Echocardiography is routinely used to assess ventricular and valvular function, particularly in patients with known or suspected cardiac disease and who have evidence of hemodynamic compromise. A cornerstone to the use of echocardiographic imaging is not only the qualitative assessment, but also the quantitative Doppler-derived velocity characteristics of intracardiac blood flow. While simplified equations, such as the modified Bernoulli equation, are used to estimate intracardiac pressure gradients based upon Doppler velocity data, these modified equations are based upon assumptions of the varying contributions of the different forces that contribute to blood flow. Unfortunately, the assumptions can result in significant miscalculations in determining a gradient if not completely understood or they are misapplied. We briefly summarize the principles of fluid dynamics that are used clinically with some of the inherent limitations of routine broad application of the simplified Bernoulli equation.
Finite mixture models have come to play a very prominent role in modelling data. The finite mixture model is predicated on the assumption that distinct latent groups exist in the population. The finite mixture model therefore is based on a categorical latent variable that distinguishes the different groups. Often in practice distinct sub-populations do not actually exist. For example, disease severity (e.g. depression) may vary continuously and therefore, a distinction of diseased and not-diseased may not be based on the existence of distinct sub-populations. Thus, what is needed is a generalization of the finite mixture’s discrete latent predictor to a continuous latent predictor. We cast the finite mixture model as a regression model with a latent Bernoulli predictor. A latent regression model is proposed by replacing the discrete Bernoulli predictor by a continuous latent predictor with a beta distribution. Motivation for the latent regression model arises from applications where distinct latent classes do not exist, but instead individuals vary according to a continuous latent variable. The shapes of the beta density are very flexible and can approximate the discrete Bernoulli distribution. Examples and a simulation are provided to illustrate the latent regression model. In particular, the latent regression model is used to model placebo effect among drug treated subjects in a depression study.
Beta distribution; EM algorithm; finite and infinite mixtures; quasi-Newton algorithms; placebo effect; skew normal distribution
A Boolean network is a graphical model for representing and analyzing the behavior of gene regulatory networks (GRN). In this context, the accurate and efficient reconstruction of a Boolean network is essential for understanding the gene regulation mechanism and the complex relations that exist therein. In this paper we introduce an elegant and efficient algorithm for the reverse engineering of Boolean networks from a time series of multivariate binary data corresponding to gene expression data. We call our method ReBMM, i.e., reverse engineering based on Bernoulli mixture models. The time complexity of most of the existing reverse engineering techniques is quite high and depends upon the indegree of a node in the network. Due to the high complexity of these methods, they can only be applied to sparsely connected networks of small sizes. ReBMM has a time complexity factor, which is independent of the indegree of a node and is quadratic in the number of nodes in the network, a big improvement over other techniques and yet there is little or no compromise in accuracy. We have tested ReBMM on a number of artificial datasets along with simulated data derived from a plant signaling network. We also used this method to reconstruct a network from real experimental observations of microarray data of the yeast cell cycle. Our method provides a natural framework for generating rules from a probabilistic model. It is simple, intuitive and illustrates excellent empirical results.
The MCMC procedure in SAS (called PROC MCMC) is particularly designed for Bayesian analysis using the Markov chain Monte Carlo (MCMC) algorithm. The program is sufficiently general to handle very complicated statistical models and arbitrary prior distributions. This study introduces the SAS/MCMC procedure and demonstrates the application of the program to quantitative trait locus (QTL) mapping. A real life QTL mapping experiment in wheat female fertility trait was used as an example for the demonstration. The fertility trait phenotypes were described under three different models: (1) the Poisson model, (2) the Bernoulli model and (3) the zero-truncated Poisson model. One QTL was identified on the second chromosome. This QTL appears to control the switch of seed-producing ability of female plants but does not affect the number of seeds produced once the switch is turned on.
Bayes; Markov chain Monte Carlo; quantitative trait locus; SAS
We examine the problem of estimating the spike trains of multiple neurons from voltage traces recorded on one or more extracellular electrodes. Traditional spike-sorting methods rely on thresholding or clustering of recorded signals to identify spikes. While these methods can detect a large fraction of the spikes from a recording, they generally fail to identify synchronous or near-synchronous spikes: cases in which multiple spikes overlap. Here we investigate the geometry of failures in traditional sorting algorithms, and document the prevalence of such errors in multi-electrode recordings from primate retina. We then develop a method for multi-neuron spike sorting using a model that explicitly accounts for the superposition of spike waveforms. We model the recorded voltage traces as a linear combination of spike waveforms plus a stochastic background component of correlated Gaussian noise. Combining this measurement model with a Bernoulli prior over binary spike trains yields a posterior distribution for spikes given the recorded data. We introduce a greedy algorithm to maximize this posterior that we call “binary pursuit”. The algorithm allows modest variability in spike waveforms and recovers spike times with higher precision than the voltage sampling rate. This method substantially corrects cross-correlation artifacts that arise with conventional methods, and substantially outperforms clustering methods on both real and simulated data. Finally, we develop diagnostic tools that can be used to assess errors in spike sorting in the absence of ground truth.
The LASSO-Patternsearch algorithm is proposed to efficiently identify patterns of multiple dichotomous risk factors for outcomes of interest in demographic and genomic studies. The patterns considered are those that arise naturally from the log linear expansion of the multivariate Bernoulli density. The method is designed for the case where there is a possibly very large number of candidate patterns but it is believed that only a relatively small number are important. A LASSO is used to greatly reduce the number of candidate patterns, using a novel computational algorithm that can handle an extremely large number of unknowns simultaneously. The patterns surviving the LASSO are further pruned in the framework of (parametric) generalized linear models. A novel tuning procedure based on the GACV for Bernoulli outcomes, modified to act as a model selector, is used at both steps. We applied the method to myopia data from the population-based Beaver Dam Eye Study, exposing physiologically interesting interacting risk factors. We then applied the the method to data from a generative model of Rheumatoid Arthritis based on Problem 3 from the Genetic Analysis Workshop 15, successfully demonstrating its potential to efficiently recover higher order patterns from attribute vectors of length typical of genomic studies.
We develop a new principal components analysis (PCA) type dimension reduction method for binary data. Different from the standard PCA which is defined on the observed data, the proposed PCA is defined on the logit transform of the success probabilities of the binary observations. Sparsity is introduced to the principal component (PC) loading vectors for enhanced interpretability and more stable extraction of the principal components. Our sparse PCA is formulated as solving an optimization problem with a criterion function motivated from penalized Bernoulli likelihood. A Majorization-Minimization algorithm is developed to efficiently solve the optimization problem. The effectiveness of the proposed sparse logistic PCA method is illustrated by application to a single nucleotide polymorphism data set and a simulation study.
Binary data; Dimension reduction; MM algorithm; LASSO; PCA; Regularization; Sparsity
Modeling of cancer hazards at age t deals with a dichotomous population, a small part of which (the fraction at risk) will get cancer, while the other part will not. Therefore, we conditioned the hazard function, h(t), the probability density function (pdf), f(t), and the survival function, S(t), on frailty α in individuals. Assuming α has the Bernoulli distribution, we obtained equations relating the unconditional (population level) hazard function, hU(t), cumulative hazard function, HU(t), and overall cumulative hazard, H0, with the h(t), f(t), and S(t) for individuals from the fraction at risk. Computing procedures for estimating h(t), f(t), and S(t) were developed and used to fit the pancreatic cancer data collected by SEER9 registries from 1975 through 2004 with the Weibull pdf suggested by the Armitage-Doll model. The parameters of the obtained excellent fit suggest that age of pancreatic cancer presentation has a time shift about 17 years and five mutations are needed for pancreatic cells to become malignant.
cancer incidence; cancer hazard; frailty; Weibull distribution; pancreatic cancer
This paper presents a novel method for the systematic implementation of low-power microelectronic circuits aimed at computing nonlinear cellular and molecular dynamics. The method proposed is based on the Nonlinear Bernoulli Cell Formalism (NBCF), an advanced mathematical framework stemming from the Bernoulli Cell Formalism (BCF) originally exploited for the modular synthesis and analysis of linear, time-invariant, high dynamic range, logarithmic filters. Our approach identifies and exploits the striking similarities existing between the NBCF and coupled nonlinear ordinary differential equations (ODEs) typically appearing in models of naturally encountered biochemical systems. The resulting continuous-time, continuous-value, low-power CytoMimetic electronic circuits succeed in simulating fast and with good accuracy cellular and molecular dynamics. The application of the method is illustrated by synthesising for the first time microelectronic CytoMimetic topologies which simulate successfully: 1) a nonlinear intracellular calcium oscillations model for several Hill coefficient values and 2) a gene-protein regulatory system model. The dynamic behaviours generated by the proposed CytoMimetic circuits are compared and found to be in very good agreement with their biological counterparts. The circuits exploit the exponential law codifying the low-power subthreshold operation regime and have been simulated with realistic parameters from a commercially available CMOS process. They occupy an area of a fraction of a square-millimetre, while consuming between 1 and 12 microwatts of power. Simulations of fabrication-related variability results are also presented.
Left ventricular relaxation time constant, Tau, is the best index to evaluate left ventricular diastolic function. The measurement is only available traditionally in catheter lab. In Echo lab, several methods of non-invasive measurement of Tau have been tried since 1992, however almost all the methods are still utilizing the same formula to calculate Tau as in catheter lab, which makes them inconvenient, time-consuming and sometimes not very accurate. A simple method to calculate Tau in patients with mitral regurgitation has been developed just based on Weiss’ formula and simplified Bernoulli’s equation. Similarly, formulas are developed here by pure mathematical derivative to calculate Tau by continuous-wave Doppler in patients with aortic regurgitation.
Left ventricular relaxation time constant, Tau, is the best index to evaluate left ventricular diastolic function, but the measurement is only available traditionally in catheter lab. In Echo lab, several methods of non-invasive measurement of Tau have been tried since 1992, however almost all the methods are still utilizing the same formula to calculate Tau as in catheter lab, which makes them inconvenient, time-consuming and sometimes not very accurate. Based on Weiss’ formula and simplified Bernoulli’s equation, a simple method is developed by pure mathematical derivative to calculate Tau by continuous-wave Doppler in patients with mitral regurgitation.
The assessment of the influence of many rare BRCA2 missense mutations on cancer risk has proved difficult. A multifactorial likelihood model that predicts the odds of cancer causality for missense variants is effective, but is limited by the availability of family data. As an alternative, we developed functional assays that measure the influence of missense mutations on the ability of BRCA2 to repair DNA damage by homologous recombination and to control centriole amplification. We evaluated 22 missense mutations from the BRCA2 DNA binding domain (DBD) that were identified in multiple breast cancer families using these assays and compared the results with those from the likelihood model. Thirteen variants inactivated BRCA2 function in at least one assay; two others truncated BRCA2 by aberrant splicing; and seven had no effect on BRCA2 function. Of 10 variants with odds in favor of causality in the likelihood model of 50:1 or more and a posterior probability of pathogenicity of 0.99, eight inactivated BRCA2 function and the other two caused splicing defects. Four variants and four controls displaying odds in favor of neutrality of 50:1 and posterior probabilities of pathogenicity of at least 1 × 10−3 had no effect on function in either assay. The strong correlation between the functional assays and likelihood model data suggests that these functional assays are an excellent method for identifying inactivating missense mutations in the BRCA2 DBD and that the assays may be a useful addition to models that predict the likelihood of cancer in carriers of missense mutations.
We developed a generalized linear model of QTL mapping for discrete traits in line crossing experiments. Parameter estimation was achieved using two different algorithms, a mixture model-based EM (expectation–maximization) algorithm and a GEE (generalized estimating equation) algorithm under a heterogeneous residual variance model. The methods were developed using ordinal data, binary data, binomial data and Poisson data as examples. Applications of the methods to simulated as well as real data are presented. The two different algorithms were compared in the data analyses. In most situations, the two algorithms were indistinguishable, but when large QTL are located in large marker intervals, the mixture model-based EM algorithm can fail to converge to the correct solutions. Both algorithms were coded in C++ and interfaced with SAS as a user-defined SAS procedure called PROC QTL.
Advancements in sequencing techniques place personalized genomic medicine upon the horizon, bringing along the responsibility of clinicians to understand the likelihood for a mutation to cause disease, and of scientists to separate etiology from nonpathologic variability. Pathogenicity is discernable from patterns of interactions between a missense mutation, the surrounding protein structure, and intermolecular interactions. Physicochemical stability calculations are not accessible without structures, as is the case for the vast majority of human proteins, so diagnostic accuracy remains in infancy. To model the effects of missense mutations on functional stability without structure, we combine novel protein sequence analysis algorithms to discern spatial distributions of sequence, evolutionary, and physicochemical conservation, through a new approach to optimize component selection. Novel components include a combinatory substitution matrix and two heuristic algorithms that detect positions which confer structural support to interaction interfaces. The method reaches 0.91 AUC in ten-fold cross-validation to predict alteration of function for 6,392 in vitro mutations. For clinical utility we trained the method on 7,022 disease associated missense mutations within the Online Mendelian inheritance in man amongst a larger randomized set. In a blinded prospective test to delineate mutations unique to 186 patients with craniosynostosis from those in the 95 highly variant Coriell controls and 1000 age matched controls, we achieved roughly 1/3 sensitivity and perfect specificity. The component algorithms retained during machine learning constitute novel protein sequence analysis techniques to describe environments supporting neutrality or pathology of mutations. This approach to pathogenetics enables new insight into the mechanistic relationship of missense mutations to disease phenotypes in our patients.
Computational biology; protein stability; machine learning; missense mutation; nonsynonymous SNP; sequence analysis
The measurement of human immunodeficiency virus ribonucleic acid levels over time leads to censored longitudinal data. Suitable models for dynamic modelling of these levels need to take this data characteristic into account. If groups of patients with different developments of the levels over time are suspected the model class of finite mixtures of mixed effects models with censored data is required. We describe the model specification and derive the estimation with a suitable expectation–maximization algorithm. We propose a convenient implementation using closed form formulae for the expected mean and variance of the truncated multivariate distribution. Only efficient evaluation of the cumulative multivariate normal distribution function is required. Model selection as well as methods for inference are discussed. The application is demonstrated on the clinical trial ACTG 315 data.
Censored response; EM algorithm; Finite mixture; Mixed effects model; Unobserved heterogeneity
To assist in distinguishing disease-causing mutations from non-pathogenic polymorphisms, we developed an objective algorithm to calculate an “estimate of pathogenic probability” (EPP) based on the prevalence of a specific variation, its segregation within families, and its predicted effects on protein structure. Eleven missense variations in the RPE65 gene were evaluated in patients with Leber congenital amaurosis (LCA) using the EPP algorithm. The accuracy of the EPP algorithm was evaluated using a cell-culture assay of RPE65-isomerase activity The variations were engineered into plasmids containing a human RPE65 cDNA and the retinoid isomerase activity of each variant was determined in cultured cells. The EPP algorithm predicted eight substitution mutations to be disease-causing variants. The isomerase catalytic activities of these RPE65 variants were all less than 6% of wild-type. In contrast, the EPP algorithm predicted the other three substitutions to be non-disease-causing, with isomerase activities of 68%, 127% and 110% of wild-type, respectively. We observed complete concordance between the predicted pathogenicities of missense variations in the RPE65 gene and retinoid isomerase activities measured in a functional assay. These results suggest that the EPP algorithm may be useful to evaluate the pathogenicity of missense variations in other disease genes where functional assays are not available.
Leber congenital amaurosis; pathogenicity; RPE65; retinoid
The most frequently used tools in bioinformatics are those searching for similarities, or local alignments, between biological sequences. Since the exact dynamic programming algorithm is quadratic, linear-time heuristics such as BLAST are used. Spaced seeds are much more sensitive than the consecutive seed of BLAST and using several seeds represents the current state of the art in approximate search for biological sequences. The most important aspect is computing highly sensitive seeds. Since the problem seems hard, heuristic algorithms are used. The leading software in the common Bernoulli model is the SpEED program.
SpEED uses a hill climbing method based on the overlap complexity heuristic. We propose a new algorithm for this heuristic that improves its speed by over one order of magnitude. We use the new implementation to compute improved seeds for several software programs. We compute as well multiple seeds of the same weight as MegaBLAST, that greatly improve its sensitivity.
Multiple spaced seeds are being successfully used in bioinformatics software programs. Enabling researchers to compute very fast high quality seeds will help expanding the range of their applications.
Similarity search; Local alignment; Spaced seed; Heuristic algorithm; Sensitivity
Motivation: The number of missense mutations being identified in cancer genomes has greatly increased as a consequence of technological advances and the reduced cost of whole-genome/whole-exome sequencing methods. However, a high proportion of the amino acid substitutions detected in cancer genomes have little or no effect on tumour progression (passenger mutations). Therefore, accurate automated methods capable of discriminating between driver (cancer-promoting) and passenger mutations are becoming increasingly important. In our previous work, we developed the Functional Analysis through Hidden Markov Models (FATHMM) software and, using a model weighted for inherited disease mutations, observed improved performances over alternative computational prediction algorithms. Here, we describe an adaptation of our original algorithm that incorporates a cancer-specific model to potentiate the functional analysis of driver mutations.
Results: The performance of our algorithm was evaluated using two separate benchmarks. In our analysis, we observed improved performances when distinguishing between driver mutations and other germ line variants (both disease-causing and putatively neutral mutations). In addition, when discriminating between somatic driver and passenger mutations, we observed performances comparable with the leading computational prediction algorithms: SPF-Cancer and TransFIC.
Availability and implementation: A web-based implementation of our cancer-specific model, including a downloadable stand-alone package, is available at http://fathmm.biocompute.org.uk.
Supplementary data are available at Bioinformatics online.
Nonlinear random effects models with finite mixture structures are used to identify polymorphism in pharmacokinetic/pharmacodynamic phenotypes. An EM algorithm for maximum likelihood estimation approach is developed and uses sampling-based methods to implement the expectation step, that results in an analytically tractable maximization step. A benefit of the approach is that no model linearization is performed and the estimation precision can be arbitrarily controlled by the sampling process. A detailed simulation study illustrates the feasibility of the estimation approach and evaluates its performance. Applications of the proposed nonlinear random effects mixture model approach to other population pharmacokinetic/pharmacodynamic problems will be of interest for future investigation.
Finite mixture models; Mixed effects models; Pharmacokinetics/pharmacodynamics
The appearances of perceptually bistable stimuli can by definition be reported with confidence, so these stimuli may be useful to investigate how visual cues are learned and combined to construct visual appearance. However, interpreting experimental data (percent of trials seen one way or the other) requires a theoretically motivated measure of cue effectiveness. Here we describe a simple Bayesian theory for dichotomous perceptual decisions: the Mixture of Bernoulli Experts or MBE. In this theory, a cue’s subjective reliability is the product of a weight and an estimate of the cue’s ecological validity. The theory (1) justifies the use of probit analysis to measure the system’s reliance on a cue and (2) enables hypothesis testing. To illustrate, we used apparent 3D rotation direction in perceptually ambiguous Necker cube movies to test whether the visual system relied on a newly recruited cue (position of the stimulus within the visual field) to the same extent when a long-trusted cue (binocular disparity) was present or not present in the display. For six trainees, reliance on the newly recruited cue was similar whether or not the long-trusted cue was present, suggesting that the visual system assumed the new cue to be conditionally independent.
cue combination; cue recruitment; cue learning; bistability; ambiguous figure; perceptual dichotomy; appearance; sensory fusion; machine learning; Bayes; naive Bayes; Bayes rule
Summary: In genetics, many evolutionary pathways can be modeled by the ordered accumulation of permanent changes. Mixture models of mutagenetic trees have been used to describe disease progression in cancer and in HIV. In cancer, progression is modeled by the accumulation of chromosomal gains and losses in tumor cells; in HIV, the accumulation of drug resistance-associated mutations in the viral genome is known to be associated with disease progression. From such evolutionary models, genetic progression scores can be derived that assign measures for the disease state to single patients. Rtreemix is an R package for estimating mixture models of evolutionary pathways from observed cross-sectional data and for estimating associated genetic progression scores. The package also provides extended functionality for estimating confidence intervals for estimated model parameters and for evaluating the stability of the estimated evolutionary mixture models.
Availability: Rtreemix is an R package that is freely available from the Bioconductor project at http://www.bioconductor.org and runs on Linux and Windows.
The assessment of the severity of aortic valve stenosis is done by either invasive catheterization or non-invasive Doppler Echocardiography in conjunction with the simplified Bernoulli equation. The catheter measurement is generally considered more accurate, but the procedure is also more likely to have dangerous complications.
The focus here is on examining computational fluid dynamics as an alternative method for analyzing the echo data and determining whether it can provide results similar to the catheter measurement.
An in vitro heart model with a rigid orifice is used as a first step in comparing echocardiographic data, which uses the simplified Bernoulli equation, catheterization, and echocardiographic data, which uses computational fluid dynamics (i.e., the Navier-Stokes equations).
For a 0.93cm2 orifice, the maximum pressure gradient predicted by either the simplified Bernoulli equation or computational fluid dynamics was not significantly different from the experimental catheter measurement (p > 0.01). For a smaller 0.52cm2 orifice, there was a small but significant difference (p < 0.01) between the simplified Bernoulli equation and the computational fluid dynamics simulation, with the computational fluid dynamics simulation giving better agreement with experimental data for some turbulence models.
For this simplified, in vitro system, the use of computational fluid dynamics provides an improvement over the simplified Bernoulli equation with the biggest improvement being seen at higher valvular stenosis levels.
Valvular stenosis; catheter; Doppler Echocardiography; computational fluid dynamics; turbulence.
Conventional approaches to modeling classification image data can be described in terms of a standard linear model (LM). We show how the problem can be characterized as a Generalized Linear Model (GLM) with a Bernoulli distribution. We demonstrate via simulation that this approach is more accurate in estimating the underlying template in the absence of internal noise. With increasing internal noise, however, the advantage of the GLM over the LM decreases and GLM is no more accurate than LM. We then introduce the Generalized Additive Model (GAM), an extension of GLM that can be used to estimate smooth classification images adaptively. We show that this approach is more robust to the presence of internal noise and, finally, we demonstrate that GAM is readily adapted to estimation of higher-order (nonlinear) classification images and to testing their significance.
Artifacts; Classification; Computer Simulation; Humans; Linear Models; Nonlinear Dynamics; Signal Detection, Psychological; Vision, Ocular; physiology; Visual Perception; physiology; classification images; signal detection theory; generalized linear models; GLM; generalized additive models; GAM
Guanine-rich DNA sequences of a particular form have the ability to fold into four-stranded structures called G-quadruplexes. In this paper, we present a working rule to predict which primary sequences can form this structure, and describe a search algorithm to identify such sequences in genomic DNA. We count the number of quadruplexes found in the human genome and compare that with the figure predicted by modelling DNA as a Bernoulli stream or as a Markov chain, using windows of various sizes. We demonstrate that the distribution of loop lengths is significantly different from what would be expected in a random case, providing an indication of the number of potentially relevant quadruplex-forming sequences. In particular, we show that there is a significant repression of quadruplexes in the coding strand of exonic regions, which suggests that quadruplex-forming patterns are disfavoured in sequences that will form RNA.
Quite often a single or a combination of protein mutations is linked to specific diseases. However, distinguishing from sequence information which mutations have real effects in the protein’s function is not trivial. Protein design tools are commonly used to explain mutations that affect protein stability, or protein–protein interaction, but not for mutations that could affect protein–DNA binding. Here, we used the protein design algorithm FoldX to model all known missense mutations in the paired box domain of Pax6, a highly conserved transcription factor involved in eye development and in several diseases such as aniridia. The validity of FoldX to deal with protein–DNA interactions was demonstrated by showing that high levels of accuracy can be achieved for mutations affecting these interactions. Also we showed that protein-design algorithms can accurately reproduce experimental DNA-binding logos. We conclude that 88% of the Pax6 mutations can be linked to changes in intrinsic stability (77%) and/or to its capabilities to bind DNA (30%). Our study emphasizes the importance of structure-based analysis to understand the molecular basis of diseases and shows that protein–DNA interactions can be analyzed to the same level of accuracy as protein stability, or protein–protein interactions.