Elective patient admission and assignment planning is an important task of the strategic and operational management of a hospital and early on became a central topic of clinical operations research. The management of hospital beds is an important subtask. Various approaches have been proposed, involving the computation of efficient assignments with regard to the patients’ condition, the necessity of the treatment, and the patients’ preferences. However, these approaches are mostly based on static, unadaptable estimates of the length of stay and, thus, do not take into account the uncertainty of the patient’s recovery. Furthermore, the effect of aggregated bed capacities have not been investigated in this context. Computer supported bed management, combining an adaptable length of stay estimation with the treatment of shared resources (aggregated bed capacities) has not yet been sufficiently investigated. The aim of our work is: 1) to define a cost function for patient admission taking into account adaptable length of stay estimations and aggregated resources, 2) to define a mathematical program formally modeling the assignment problem and an architecture for decision support, 3) to investigate four algorithmic methodologies addressing the assignment problem and one base-line approach, and 4) to evaluate these methodologies w.r.t. cost outcome, performance, and dismissal ratio.
The expected free ward capacity is calculated based on individual length of stay estimates, introducing Bernoulli distributed random variables for the ward occupation states and approximating the probability densities. The assignment problem is represented as a binary integer program. Four strategies for solving the problem are applied and compared: an exact approach, using the mixed integer programming solver SCIP; and three heuristic strategies, namely the longest expected processing time, the shortest expected processing time, and random choice. A baseline approach serves to compare these optimization strategies with a simple model of the status quo. All the approaches are evaluated by a realistic discrete event simulation: the outcomes are the ratio of successful assignments and dismissals, the computation time, and the model’s cost factors.
A discrete event simulation of 226,000 cases shows a reduction of the dismissal rate compared to the baseline by more than 30 percentage points (from a mean dismissal ratio of 74.7% to 40.06% comparing the status quo with the optimization strategies). Each of the optimization strategies leads to an improved assignment. The exact approach has only a marginal advantage over the heuristic strategies in the model’s cost factors (≤3%). Moreover,this marginal advantage was only achieved at the price of a computational time fifty times that of the heuristic models (an average computing time of 141 s using the exact method, vs. 2.6 s for the heuristic strategy).
In terms of its performance and the quality of its solution, the heuristic strategy RAND is the preferred method for bed assignment in the case of shared resources. Future research is needed to investigate whether an equally marked improvement can be achieved in a large scale clinical application study, ideally one comprising all the departments involved in admission and assignment planning.
Finite mixture models have come to play a very prominent role in modelling data. The finite mixture model is predicated on the assumption that distinct latent groups exist in the population. The finite mixture model therefore is based on a categorical latent variable that distinguishes the different groups. Often in practice distinct sub-populations do not actually exist. For example, disease severity (e.g. depression) may vary continuously and therefore, a distinction of diseased and not-diseased may not be based on the existence of distinct sub-populations. Thus, what is needed is a generalization of the finite mixture’s discrete latent predictor to a continuous latent predictor. We cast the finite mixture model as a regression model with a latent Bernoulli predictor. A latent regression model is proposed by replacing the discrete Bernoulli predictor by a continuous latent predictor with a beta distribution. Motivation for the latent regression model arises from applications where distinct latent classes do not exist, but instead individuals vary according to a continuous latent variable. The shapes of the beta density are very flexible and can approximate the discrete Bernoulli distribution. Examples and a simulation are provided to illustrate the latent regression model. In particular, the latent regression model is used to model placebo effect among drug treated subjects in a depression study.
Beta distribution; EM algorithm; finite and infinite mixtures; quasi-Newton algorithms; placebo effect; skew normal distribution
Gauging the systemic effects of non-synonymous single nucleotide polymorphisms (nsSNPs) is an important topic in the pursuit of personalized medicine. However, it is a non-trivial task to understand how a change at the protein structure level eventually affects a cell's behavior. This is because complex information at both the protein and pathway level has to be integrated. Given that the idea of integrating both protein and pathway dynamics to estimate the systemic impact of missense mutations in proteins remains predominantly unexplored, we investigate the practicality of such an approach by formulating mathematical models and comparing them with experimental data to study missense mutations. We present two case studies: (1) interpreting systemic perturbation for mutations within the cell cycle control mechanisms (G2 to mitosis transition) for yeast; (2) phenotypic classification of neuron-related human diseases associated with mutations within the mitogen-activated protein kinase (MAPK) pathway. We show that the application of simplified mathematical models is feasible for understanding the effects of small sequence changes on cellular behavior. Furthermore, we show that the systemic impact of missense mutations can be effectively quantified as a combination of protein stability change and pathway perturbation.
Small changes in protein sequences, such as missense mutations resulting from genetic variations in the genome, can have a large impact on cellular behavior. Consequently, numerous studies have been carried out to evaluate the disease susceptibility of missense mutations by directly analyzing their structural or functional impact on proteins. Such an approach has been shown to be useful for inferring the likelihood of a mutation to be disease-associated. However, there are still many unexplored avenues for improving disease-association studies, due to the fact that the dynamics of biological pathways are rarely considered. We therefore explore the practicality of a structural systems biology approach, combining pathway dynamics with protein structural information, for projecting the physiological outcomes of missense mutations. We show that stability changes of proteins due to missense mutations and the sensitivity of a protein in terms of regulating pathway dynamics are useful measures for this purpose. Furthermore, we demonstrate that complicated mathematical models are not a prerequisite for mapping protein stabilities to network perturbation. Thus it may be more feasible to study the systemic impact of missense mutations associated with complex pathways.
We develop a novel peak detection algorithm for the analysis of comprehensive two-dimensional gas chromatography time-of-flight mass spectrometry (GC×GC-TOF MS) data using normal-exponential-Bernoulli (NEB) and mixture probability models. The algorithm first performs baseline correction and denoising simultaneously using the NEB model, which also defines peak regions. Peaks are then picked using a mixture of probability distribution to deal with the co-eluting peaks. Peak merging is further carried out based on the mass spectral similarities among the peaks within the same peak group. The algorithm is evaluated using experimental data to study the effect of different cut-offs of the conditional Bayes factors and the effect of different mixture models including Poisson, truncated Gaussian, Gaussian, Gamma, and exponentially modified Gaussian (EMG) distributions, and the optimal version is introduced using a trial-and-error approach. We then compare the new algorithm with two existing algorithms in terms of compound identification. Data analysis shows that the developed algorithm can detect the peaks with lower false discovery rates than the existing algorithms, and a less complicated peak picking model is a promising alternative to the more complicated and widely used EMG mixture models.
Bayes factor; GC×GC-TOF MS; metabolomics; mixture model; normal-exponential-Bernoulli (NEB) model; peak detection
Finding new functional fragments in biological sequences is a challenging problem. Methods addressing this problem commonly search for clusters of pattern occurrences that are statistically significant. A measure of statistical significance is the P-value of a number of pattern occurrences, i.e. the probability to find at least S occurrences of words from a pattern in a random text of length N generated according to a given probability model. All words of the pattern are supposed to be of same length.
We present a novel algorithm SufPref that computes an exact P-value for Hidden Markov models (HMM). The algorithm is based on recursive equations on text sets related to pattern occurrences; the equations can be used for any probability model. The algorithm inductively traverses a specific data structure, an overlap graph. The nodes of the graph are associated with the overlaps of words from . The edges are associated to the prefix and suffix relations between overlaps. An originality of our data structure is that pattern need not be explicitly represented in nodes or leaves. The algorithm relies on the Cartesian product of the overlap graph and the graph of HMM states; this approach is analogous to the automaton approach from JBCB 4: 553-569. The gain in size of SufPref data structure leads to significant improvements in space and time complexity compared to existent algorithms. The algorithm SufPref was implemented as a C++ program; the program can be used both as Web-server and a stand alone program for Linux and Windows. The program interface admits special formats to describe probability models of various types (HMM, Bernoulli, Markov); a pattern can be described with a list of words, a PSSM, a degenerate pattern or a word and a number of mismatches. It is available at http://server2.lpm.org.ru/bio/online/sf/. The program was applied to compare sensitivity and specificity of methods for TFBS prediction based on P-values computed for Bernoulli models, Markov models of orders one and two and HMMs. The experiments show that the methods have approximately the same qualities.
Electronic supplementary material
The online version of this article (doi:10.1186/s13015-014-0025-1) contains supplementary material, which is available to authorized users.
P-value; Pattern occurrences; PSSM (PWM); Hidden Markov model
Many individually rare missense substitutions are encountered during deep resequencing of candidate susceptibility genes and clinical mutation screening of known susceptibility genes. BRCA1 and BRCA2 are among the most resequenced of all genes, and clinical mutation screening of these genes provides an extensive data set for analysis of rare missense substitutions. Align-GVGD is a mathematically simple missense substitution analysis algorithm, based on the Grantham difference, which has already contributed to classification of missense substitutions in BRCA1, BRCA2, and CHEK2. However, the distribution of genetic risk as a function of Align-GVGD's output variables Grantham variation (GV) and Grantham deviation (GD) has not been well characterized. Here, we used data from the Myriad Genetic Laboratories database of nearly 70,000 full-sequence tests plus two risk estimates, one approximating the odds ratio and the other reflecting strength of selection, to display the distribution of risk in the GV-GD plane as a series of surfaces. We abstracted contours from the surfaces and used the contours to define a sequence of missense substitution grades ordered from greatest risk to least risk. The grades were validated internally using a third, personal and family history-based, measure of risk. The Align-GVGD grades defined here are applicable to both the genetic epidemiology problem of classifying rare missense substitutions observed in known susceptibility genes and the molecular epidemiology problem of analyzing rare missense substitutions observed during case-control mutation screening studies of candidate susceptibility genes.
BRCA1; BRCA2; Align-GVGD; unclassified variant; missense substitution; protein multiple sequence alignment
Large-scale sequencing of cancer genomes has uncovered thousands of DNA alterations, but the functional relevance of the majority of these mutations to tumorigenesis is unknown. We have developed a computational method, called CHASM (Cancer-specific High-throughput Annotation of Somatic Mutations), to identify and prioritize those missense mutations most likely to generate functional changes that enhance tumor cell proliferation. The method has high sensitivity and specificity when discriminating between known driver missense mutations and randomly generated missense mutations (area under ROC curve > 0.91, area under Precision-Recall curve > 0.79). CHASM substantially outperformed previously described missense mutation function prediction methods at discriminating known oncogenic mutations in TP53 and the tyrosine kinase EGFR. We applied the method to 607 missense mutations found in a recent glioblastoma multiforme sequencing (GBM) study. Based on a model that assumed the GBM mutations are a mixture of drivers and passengers, we estimate that 8% of these mutations are drivers, causally contributing to tumorigenesis.
cancer drivers; CHASM; missense mutations; random forest; somatic mutations
To assist in distinguishing disease-causing mutations from non-pathogenic polymorphisms, we developed an objective algorithm to calculate an “estimate of pathogenic probability” (EPP) based on the prevalence of a specific variation, its segregation within families, and its predicted effects on protein structure. Eleven missense variations in the RPE65 gene were evaluated in patients with Leber congenital amaurosis (LCA) using the EPP algorithm. The accuracy of the EPP algorithm was evaluated using a cell-culture assay of RPE65-isomerase activity The variations were engineered into plasmids containing a human RPE65 cDNA and the retinoid isomerase activity of each variant was determined in cultured cells. The EPP algorithm predicted eight substitution mutations to be disease-causing variants. The isomerase catalytic activities of these RPE65 variants were all less than 6% of wild-type. In contrast, the EPP algorithm predicted the other three substitutions to be non-disease-causing, with isomerase activities of 68%, 127% and 110% of wild-type, respectively. We observed complete concordance between the predicted pathogenicities of missense variations in the RPE65 gene and retinoid isomerase activities measured in a functional assay. These results suggest that the EPP algorithm may be useful to evaluate the pathogenicity of missense variations in other disease genes where functional assays are not available.
Leber congenital amaurosis; pathogenicity; RPE65; retinoid
Echocardiography is routinely used to assess ventricular and valvular function, particularly in patients with known or suspected cardiac disease and who have evidence of hemodynamic compromise. A cornerstone to the use of echocardiographic imaging is not only the qualitative assessment, but also the quantitative Doppler-derived velocity characteristics of intracardiac blood flow. While simplified equations, such as the modified Bernoulli equation, are used to estimate intracardiac pressure gradients based upon Doppler velocity data, these modified equations are based upon assumptions of the varying contributions of the different forces that contribute to blood flow. Unfortunately, the assumptions can result in significant miscalculations in determining a gradient if not completely understood or they are misapplied. We briefly summarize the principles of fluid dynamics that are used clinically with some of the inherent limitations of routine broad application of the simplified Bernoulli equation.
Modeling of cancer hazards at age t deals with a dichotomous population, a small part of which (the fraction at risk) will get cancer, while the other part will not. Therefore, we conditioned the hazard function, h(t), the probability density function (pdf), f(t), and the survival function, S(t), on frailty α in individuals. Assuming α has the Bernoulli distribution, we obtained equations relating the unconditional (population level) hazard function, hU(t), cumulative hazard function, HU(t), and overall cumulative hazard, H0, with the h(t), f(t), and S(t) for individuals from the fraction at risk. Computing procedures for estimating h(t), f(t), and S(t) were developed and used to fit the pancreatic cancer data collected by SEER9 registries from 1975 through 2004 with the Weibull pdf suggested by the Armitage-Doll model. The parameters of the obtained excellent fit suggest that age of pancreatic cancer presentation has a time shift about 17 years and five mutations are needed for pancreatic cells to become malignant.
cancer incidence; cancer hazard; frailty; Weibull distribution; pancreatic cancer
A Boolean network is a graphical model for representing and analyzing the behavior of gene regulatory networks (GRN). In this context, the accurate and efficient reconstruction of a Boolean network is essential for understanding the gene regulation mechanism and the complex relations that exist therein. In this paper we introduce an elegant and efficient algorithm for the reverse engineering of Boolean networks from a time series of multivariate binary data corresponding to gene expression data. We call our method ReBMM, i.e., reverse engineering based on Bernoulli mixture models. The time complexity of most of the existing reverse engineering techniques is quite high and depends upon the indegree of a node in the network. Due to the high complexity of these methods, they can only be applied to sparsely connected networks of small sizes. ReBMM has a time complexity factor, which is independent of the indegree of a node and is quadratic in the number of nodes in the network, a big improvement over other techniques and yet there is little or no compromise in accuracy. We have tested ReBMM on a number of artificial datasets along with simulated data derived from a plant signaling network. We also used this method to reconstruct a network from real experimental observations of microarray data of the yeast cell cycle. Our method provides a natural framework for generating rules from a probabilistic model. It is simple, intuitive and illustrates excellent empirical results.
We examine the problem of estimating the spike trains of multiple neurons from voltage traces recorded on one or more extracellular electrodes. Traditional spike-sorting methods rely on thresholding or clustering of recorded signals to identify spikes. While these methods can detect a large fraction of the spikes from a recording, they generally fail to identify synchronous or near-synchronous spikes: cases in which multiple spikes overlap. Here we investigate the geometry of failures in traditional sorting algorithms, and document the prevalence of such errors in multi-electrode recordings from primate retina. We then develop a method for multi-neuron spike sorting using a model that explicitly accounts for the superposition of spike waveforms. We model the recorded voltage traces as a linear combination of spike waveforms plus a stochastic background component of correlated Gaussian noise. Combining this measurement model with a Bernoulli prior over binary spike trains yields a posterior distribution for spikes given the recorded data. We introduce a greedy algorithm to maximize this posterior that we call “binary pursuit”. The algorithm allows modest variability in spike waveforms and recovers spike times with higher precision than the voltage sampling rate. This method substantially corrects cross-correlation artifacts that arise with conventional methods, and substantially outperforms clustering methods on both real and simulated data. Finally, we develop diagnostic tools that can be used to assess errors in spike sorting in the absence of ground truth.
Advancements in sequencing techniques place personalized genomic medicine upon the horizon, bringing along the responsibility of clinicians to understand the likelihood for a mutation to cause disease, and of scientists to separate etiology from nonpathologic variability. Pathogenicity is discernable from patterns of interactions between a missense mutation, the surrounding protein structure, and intermolecular interactions. Physicochemical stability calculations are not accessible without structures, as is the case for the vast majority of human proteins, so diagnostic accuracy remains in infancy. To model the effects of missense mutations on functional stability without structure, we combine novel protein sequence analysis algorithms to discern spatial distributions of sequence, evolutionary, and physicochemical conservation, through a new approach to optimize component selection. Novel components include a combinatory substitution matrix and two heuristic algorithms that detect positions which confer structural support to interaction interfaces. The method reaches 0.91 AUC in ten-fold cross-validation to predict alteration of function for 6,392 in vitro mutations. For clinical utility we trained the method on 7,022 disease associated missense mutations within the Online Mendelian inheritance in man amongst a larger randomized set. In a blinded prospective test to delineate mutations unique to 186 patients with craniosynostosis from those in the 95 highly variant Coriell controls and 1000 age matched controls, we achieved roughly 1/3 sensitivity and perfect specificity. The component algorithms retained during machine learning constitute novel protein sequence analysis techniques to describe environments supporting neutrality or pathology of mutations. This approach to pathogenetics enables new insight into the mechanistic relationship of missense mutations to disease phenotypes in our patients.
Computational biology; protein stability; machine learning; missense mutation; nonsynonymous SNP; sequence analysis
Decision makers in epidemiology and other disciplines are faced with the daunting challenge of designing interventions that will be successful with high probability and robust against a multitude of uncertainties. To facilitate the decision making process in the context of a goal-oriented objective (e.g., eradicate polio by ), stochastic models can be used to map the probability of achieving the goal as a function of parameters. Each run of a stochastic model can be viewed as a Bernoulli trial in which “success” is returned if and only if the goal is achieved in simulation. However, each run can take a significant amount of time to complete, and many replicates are required to characterize each point in parameter space, so specialized algorithms are required to locate desirable interventions. To address this need, we present the Separatrix Algorithm, which strategically locates parameter combinations that are expected to achieve the goal with a user-specified probability of success (e.g. 95%). Technically, the algorithm iteratively combines density-corrected binary kernel regression with a novel information-gathering experiment design to produce results that are asymptotically correct and work well in practice. The Separatrix Algorithm is demonstrated on several test problems, and on a detailed individual-based simulation of malaria.
The most frequently used tools in bioinformatics are those searching for similarities, or local alignments, between biological sequences. Since the exact dynamic programming algorithm is quadratic, linear-time heuristics such as BLAST are used. Spaced seeds are much more sensitive than the consecutive seed of BLAST and using several seeds represents the current state of the art in approximate search for biological sequences. The most important aspect is computing highly sensitive seeds. Since the problem seems hard, heuristic algorithms are used. The leading software in the common Bernoulli model is the SpEED program.
SpEED uses a hill climbing method based on the overlap complexity heuristic. We propose a new algorithm for this heuristic that improves its speed by over one order of magnitude. We use the new implementation to compute improved seeds for several software programs. We compute as well multiple seeds of the same weight as MegaBLAST, that greatly improve its sensitivity.
Multiple spaced seeds are being successfully used in bioinformatics software programs. Enabling researchers to compute very fast high quality seeds will help expanding the range of their applications.
Similarity search; Local alignment; Spaced seed; Heuristic algorithm; Sensitivity
Multiple algorithms are used to predict the impact of missense mutations on protein structure and function using algorithm-generated sequence alignments or manually curated alignments. We compared the accuracy with native alignment of SIFT, Align-GVGD, PolyPhen-2 and Xvar when generating functionality predictions of well characterized missense mutations (n = 267) within the BRCA1, MSH2, MLH1 and TP53 genes. We also evaluated the impact of the alignment employed on predictions from these algorithms (except Xvar) when supplied the same four alignments including alignments automatically generated by (1) SIFT, (2) Polyphen-2, (3) Uniprot, and (4) a manually curated alignment tuned for Align-GVGD. Alignments differ in sequence composition and evolutionary depth. Data-based receiver operating characteristic curves employing the native alignment for each algorithm result in area under the curve of 78-79% for all four algorithms. Predictions from the PolyPhen-2 algorithm were least dependent on the alignment employed. In contrast, Align-GVGD predicts all variants neutral when provided alignments with a large number of sequences. Of note, algorithms make different predictions of variants even when provided the same alignment and do not necessarily perform best using their own alignment. Thus, researchers should consider optimizing both the algorithm and sequence alignment employed in missense prediction.
multiple sequence alignment; SIFT; PolyPhen-2; Align-GVGD; Xvar; BRCA1; MSH2; MLH1; TP53
Germ line inactivating mutations in BRCA1 confer susceptibility for breast and ovarian cancer. However, the relevance of the many missense changes in the gene for which the effect on protein function is unknown remains unclear. Determination of which variants are causally associated with cancer is important for assessment of individual risk. We used a functional assay that measures the transactivation activity of BRCA1 in combination with analysis of protein modeling based on the structure of BRCA1 BRCT domains. In addition, the information generated was interpreted in light of genetic data. We determined the predicted cancer association of 22 BRCA1 variants and verified that the common polymorphism S1613G has no effect on BRCA1 function, even when combined with other rare variants. We estimated the specificity and sensitivity of the assay, and by meta-analysis of 47 variants, we show that variants with <45% of wild-type activity can be classified as deleterious whereas variants with >50% can be classified as neutral. In conclusion, we did functional and structure-based analyses on a large series of BRCA1 missense variants and defined a tentative threshold activity for the classification missense variants. By interpreting the validated functional data in light of additional clinical and structural evidence, we conclude that it is possible to classify all missense variants in the BRCA1 COOH-terminal region. These results bring functional assays for BRCA1 closer to clinical applicability.
The assessment of the influence of many rare BRCA2 missense mutations on cancer risk has proved difficult. A multifactorial likelihood model that predicts the odds of cancer causality for missense variants is effective, but is limited by the availability of family data. As an alternative, we developed functional assays that measure the influence of missense mutations on the ability of BRCA2 to repair DNA damage by homologous recombination and to control centriole amplification. We evaluated 22 missense mutations from the BRCA2 DNA binding domain (DBD) that were identified in multiple breast cancer families using these assays and compared the results with those from the likelihood model. Thirteen variants inactivated BRCA2 function in at least one assay; two others truncated BRCA2 by aberrant splicing; and seven had no effect on BRCA2 function. Of 10 variants with odds in favor of causality in the likelihood model of 50:1 or more and a posterior probability of pathogenicity of 0.99, eight inactivated BRCA2 function and the other two caused splicing defects. Four variants and four controls displaying odds in favor of neutrality of 50:1 and posterior probabilities of pathogenicity of at least 1 × 10−3 had no effect on function in either assay. The strong correlation between the functional assays and likelihood model data suggests that these functional assays are an excellent method for identifying inactivating missense mutations in the BRCA2 DBD and that the assays may be a useful addition to models that predict the likelihood of cancer in carriers of missense mutations.
Cluster analysis has become a standard computational method for gene function discovery as well as for more general explanatory data analysis. A number of different approaches have been proposed for that purpose, out of which different mixture models provide a principled probabilistic framework. Cluster analysis is increasingly often supplemented with multiple data sources nowadays, and these heterogeneous information sources should be made as efficient use of as possible.
This paper presents a novel Beta-Gaussian mixture model (BGMM) for clustering genes based on Gaussian distributed and beta distributed data. The proposed BGMM can be viewed as a natural extension of the beta mixture model (BMM) and the Gaussian mixture model (GMM). The proposed BGMM method differs from other mixture model based methods in its integration of two different data types into a single and unified probabilistic modeling framework, which provides a more efficient use of multiple data sources than methods that analyze different data sources separately. Moreover, BGMM provides an exceedingly flexible modeling framework since many data sources can be modeled as Gaussian or beta distributed random variables, and it can also be extended to integrate data that have other parametric distributions as well, which adds even more flexibility to this model-based clustering framework. We developed three types of estimation algorithms for BGMM, the standard expectation maximization (EM) algorithm, an approximated EM and a hybrid EM, and propose to tackle the model selection problem by well-known model selection criteria, for which we test the Akaike information criterion (AIC), a modified AIC (AIC3), the Bayesian information criterion (BIC), and the integrated classification likelihood-BIC (ICL-BIC).
Performance tests with simulated data show that combining two different data sources into a single mixture joint model greatly improves the clustering accuracy compared with either of its two extreme cases, GMM or BMM. Applications with real mouse gene expression data (modeled as Gaussian distribution) and protein-DNA binding probabilities (modeled as beta distribution) also demonstrate that BGMM can yield more biologically reasonable results compared with either of its two extreme cases. One of our applications has found three groups of genes that are likely to be involved in Myd88-dependent Toll-like receptor 3/4 (TLR-3/4) signaling cascades, which might be useful to better understand the TLR-3/4 signal transduction.
Motivation: The number of missense mutations being identified in cancer genomes has greatly increased as a consequence of technological advances and the reduced cost of whole-genome/whole-exome sequencing methods. However, a high proportion of the amino acid substitutions detected in cancer genomes have little or no effect on tumour progression (passenger mutations). Therefore, accurate automated methods capable of discriminating between driver (cancer-promoting) and passenger mutations are becoming increasingly important. In our previous work, we developed the Functional Analysis through Hidden Markov Models (FATHMM) software and, using a model weighted for inherited disease mutations, observed improved performances over alternative computational prediction algorithms. Here, we describe an adaptation of our original algorithm that incorporates a cancer-specific model to potentiate the functional analysis of driver mutations.
Results: The performance of our algorithm was evaluated using two separate benchmarks. In our analysis, we observed improved performances when distinguishing between driver mutations and other germ line variants (both disease-causing and putatively neutral mutations). In addition, when discriminating between somatic driver and passenger mutations, we observed performances comparable with the leading computational prediction algorithms: SPF-Cancer and TransFIC.
Availability and implementation: A web-based implementation of our cancer-specific model, including a downloadable stand-alone package, is available at http://fathmm.biocompute.org.uk.
Supplementary data are available at Bioinformatics online.
Genetic testing for hereditary cancer syndromes contributes to the medical management of patients who may be at increased risk of one or more cancers. BRCA1 and BRCA2 testing for hereditary breast and ovarian cancer is one such widely used test. However, clinical testing methods with high sensitivity for deleterious mutations in these genes also detect many unclassified variants, primarily missense substitutions.
We developed an extension of the Grantham difference, called A‐GVGD, to score missense substitutions against the range of variation present at their position in a multiple sequence alignment. Combining two methods, co‐occurrence of unclassified variants with clearly deleterious mutations and A‐GVGD, we analysed most of the missense substitutions observed in BRCA1.
A‐GVGD was able to resolve known neutral and deleterious missense substitutions into distinct sets. Additionally, eight previously unclassified BRCA1 missense substitutions observed in trans with one or more deleterious mutations, and within the cross‐species range of variation observed at their position in the protein, are now classified as neutral.
The methods combined here can classify as neutral about 50% of missense substitutions that have been observed with two or more clearly deleterious mutations. Furthermore, odds ratios estimated for sets of substitutions grouped by A‐GVGD scores are consistent with the hypothesis that most unclassified substitutions that are within the cross‐species range of variation at their position in BRCA1 are also neutral. For most of these, clinical reclassification will require integrated application of other methods such as pooled family histories, segregation analysis, or validated functional assay.
A‐GVGD; BRCA1; co‐occurrence; Grantham difference; missense substitution; polymorphism
The MCMC procedure in SAS (called PROC MCMC) is particularly designed for Bayesian analysis using the Markov chain Monte Carlo (MCMC) algorithm. The program is sufficiently general to handle very complicated statistical models and arbitrary prior distributions. This study introduces the SAS/MCMC procedure and demonstrates the application of the program to quantitative trait locus (QTL) mapping. A real life QTL mapping experiment in wheat female fertility trait was used as an example for the demonstration. The fertility trait phenotypes were described under three different models: (1) the Poisson model, (2) the Bernoulli model and (3) the zero-truncated Poisson model. One QTL was identified on the second chromosome. This QTL appears to control the switch of seed-producing ability of female plants but does not affect the number of seeds produced once the switch is turned on.
Bayes; Markov chain Monte Carlo; quantitative trait locus; SAS
Driver mutations are somatic mutations that provide growth advantage to tumor cells, while passenger mutations are those not functionally related to oncogenesis. Distinguishing drivers from passengers is challenging because drivers occur much less frequently than passengers, they tend to have low prevalence, their functions are multifactorial and not intuitively obvious. Missense mutations are excellent candidates as drivers, as they occur more frequently and are potentially easier to identify than other types of mutations. Although several methods have been developed for predicting the functional impact of missense mutations, only a few have been specifically designed for identifying driver mutations. As more mutations are being discovered, more accurate predictive models can be developed using machine learning approaches that systematically characterize the commonality and peculiarity of missense mutations under the background of specific cancer types. Here, we present a cancer driver annotation (CanDrA) tool that predicts missense driver mutations based on a set of 95 structural and evolutionary features computed by over 10 functional prediction algorithms such as CHASM, SIFT, and MutationAssessor. Through feature optimization and supervised training, CanDrA outperforms existing tools in analyzing the glioblastoma multiforme and ovarian carcinoma data sets in The Cancer Genome Atlas and the Cancer Cell Line Encyclopedia project.
Motivation: Discriminant analysis is an effective tool for the classification of experimental units into groups. Here, we consider the typical problem of classifying subjects according to phenotypes via gene expression data and propose a method that incorporates variable selection into the inferential procedure, for the identification of the important biomarkers. To achieve this goal, we build upon a conjugate normal discriminant model, both linear and quadratic, and include a stochastic search variable selection procedure via an MCMC algorithm. Furthermore, we incorporate into the model prior information on the relationships among the genes as described by a gene–gene network. We use a Markov random field (MRF) prior to map the network connections among genes. Our prior model assumes that neighboring genes in the network are more likely to have a joint effect on the relevant biological processes.
Results: We use simulated data to assess performances of our method. In particular, we compare the MRF prior to a situation where independent Bernoulli priors are chosen for the individual predictors. We also illustrate the method on benchmark datasets for gene expression. Our simulation studies show that employing the MRF prior improves on selection accuracy. In real data applications, in addition to identifying markers and improving prediction accuracy, we show how the integration of existing biological knowledge into the prior model results in an increased ability to identify genes with strong discriminatory power and also aids the interpretation of the results.
Gastric emptying studies are of great interest in human and veterinary medical research to evaluate effects of medications or diets for promoting gastrointestinal motility and to examine unintended side-effects of new or existing medications, diets, or procedures. Summarizing gastric emptying data is important to allow easier comparison between treatments or groups of subjects and comparisons of results among studies. The standard method for assessing gastric emptying is by using scintigraphy and summarizing the nonlinear emptying of the radioisotope. A popular model for fitting gastric emptying data is the power exponential model. This model can only describes a globally decreasing pattern and thus has the limitation of poorly describing localized intragastric events that can occur during emptying. Hence, we develop a new model for gastric emptying studies to improve population and individual inferences using a mixture of nonlinear mixed effects models. One mixture component is based on a power exponential model which captures globally decreasing patterns. The other is based on a locally extended power exponential model which captures both local bumping and rapid decay. We refer to this mixture model as a two-component nonlinear mixed effects model. The parameters in our model have clear graphical interpretations that provide a more accurate representation and summary of the curves of gastric emptying pattern. Two methods are developed to fit our proposed model: one is the mixture of an Expectation Maximization algorithm and a global two-stage method and the other is the mixture of an Expectation Maximization algorithm and the Monte Carlo Expectation Maximization algorithm. We compare our methods using simulation, showing that the two approaches are comparable to one another. For estimating the variance and covariance matrix, the second approach appears approximately more efficient and is also numerically more stable in some cases. Our new model and approaches are applicable for assessing gastric emptying in human and veterinary medical research and in many other biomedical fields such as pharmacokinetics, toxicokinetics, and physiological research. An example of gastric emptying data from equine medicine is used to demonstrate the advantage of our approaches.
expectation maximization algorithm; global two-stage method; random coefficient model