There is a growing interest in the study of signal processing and machine learning methods, which may make the brain computer interface (BCI) a new communication channel. A variety of classification methods have been utilized to convert the brain information into control commands. However, most of the methods only produce uncalibrated values and uncertain results.
In this study, we presented a probabilistic method “enhanced BLDA” (EBLDA) for multi-class motor imagery BCI, which utilized Bayesian linear discriminant analysis (BLDA) with probabilistic output to improve the classification performance. EBLDA builds a new classifier that enlarges training dataset by adding test samples with high probability. EBLDA is based on the hypothesis that unlabeled samples with high probability provide valuable information to enhance learning process and generate a classifier with refined decision boundaries. To investigate the performance of EBLDA, we first used carefully designed simulated datasets to study how EBLDA works. Then, we adopted a real BCI dataset for further evaluation. The current study shows that: 1) Probabilistic information can improve the performance of BCI for subjects with high kappa coefficient; 2) With supplementary training samples from the test samples of high probability, EBLDA is significantly better than BLDA in classification, especially for small training datasets, in which EBLDA can obtain a refined decision boundary by a shift of BLDA decision boundary with the support of the information from test samples.
The proposed EBLDA could potentially reduce training effort. Therefore, it is valuable for us to realize an effective online BCI system, especially for multi-class BCI systems.
Accurate estimations of life expectancy are important in the management of patients with metastatic cancer affecting the extremities, and help set patient, family, and physician expectations. Clinically, the decision whether to operate on patients with skeletal metastases, as well as the choice of surgical procedure, are predicated on an individual patient's estimated survival. Currently, there are no reliable methods for estimating survival in this patient population. Bayesian classification, which includes Bayesian belief network (BBN) modeling, is a statistical method that explores conditional, probabilistic relationships between variables to estimate the likelihood of an outcome using observed data. Thus, BBN models are being used with increasing frequency in a variety of diagnoses to codify complex clinical data into prognostic models. The purpose of this study was to determine the feasibility of developing Bayesian classifiers to estimate survival in patients undergoing surgery for metastases of the axial and appendicular skeleton.
We searched an institution-owned patient management database for all patients who underwent surgery for skeletal metastases between 1999 and 2003. We then developed and trained a machine-learned BBN model to estimate survival in months using candidate features based on historical data. Ten-fold cross-validation and receiver operating characteristic (ROC) curve analysis were performed to evaluate the BNN model's accuracy and robustness.
A total of 189 consecutive patients were included. First-degree predictors of survival differed between the 3-month and 12-month models. Following cross validation, the area under the ROC curve was 0.85 (95% CI: 0.80–0.93) for 3-month probability of survival and 0.83 (95% CI: 0.77–0.90) for 12-month probability of survival.
A robust, accurate, probabilistic naïve BBN model was successfully developed using observed clinical data to estimate individualized survival in patients with operable skeletal metastases. This method warrants further development and must be externally validated in other patient populations.
The frequency of early fatality and the transient nature of emergency medical care mean that a single database will rarely suffice for population based injury research. Linking records from multiple data sources is therefore a promising method for injury surveillance or trauma system evaluation. The purpose of this article is to review the historical development of record linkage, provide a basic mathematical foundation, discuss some practical issues, and consider some ethical concerns.
Clerical or computer assisted deterministic record linkage methods may suffice for some applications, but probabilistic methods are particularly useful for larger studies. The probabilistic method attempts to simulate human reasoning by comparing each of several elements from the two records. The basic mathematical specifications are derived algebraically from fundamental concepts of probability, although the theory can be extended to include more advanced mathematics.
Probabilistic, deterministic, and clerical techniques may be combined in different ways depending upon the goal of the record linkage project. If a population parameter is being estimated for a purely statistical study, a completely probabilistic approach may be most efficient; for other applications, where the purpose is to make inferences about specific individuals based upon their data contained in two or more files, the need for a high positive predictive value would favor a deterministic method or a probabilistic method with careful clerical review. Whatever techniques are used, researchers must realize that the combination of data sources entails additional ethical obligations beyond the use of each source alone.
Uncertainty often affects molecular biology experiments and data for different reasons. Heterogeneity of gene or protein expression within the same tumor tissue is an example of biological uncertainty which should be taken into account when molecular markers are used in decision making. Tissue Microarray (TMA) experiments allow for large scale profiling of tissue biopsies, investigating protein patterns characterizing specific disease states. TMA studies deal with multiple sampling of the same patient, and therefore with multiple measurements of same protein target, to account for possible biological heterogeneity. The aim of this paper is to provide and validate a classification model taking into consideration the uncertainty associated with measuring replicate samples.
We propose an extension of the well-known Naïve Bayes classifier, which accounts for biological heterogeneity in a probabilistic framework, relying on Bayesian hierarchical models. The model, which can be efficiently learned from the training dataset, exploits a closed-form of classification equation, thus providing no additional computational cost with respect to the standard Naïve Bayes classifier. We validated the approach on several simulated datasets comparing its performances with the Naïve Bayes classifier. Moreover, we demonstrated that explicitly dealing with heterogeneity can improve classification accuracy on a TMA prostate cancer dataset.
The proposed Hierarchical Naïve Bayes classifier can be conveniently applied in problems where within sample heterogeneity must be taken into account, such as TMA experiments and biological contexts where several measurements (replicates) are available for the same biological sample. The performance of the new approach is better than the standard Naïve Bayes model, in particular when the within sample heterogeneity is different in the different classes.
One goal of gene expression profiling is to identify signature genes that robustly distinguish different types or grades of tumors. Several tumor classifiers based on expression profiling have been proposed using microarray technique. Due to important differences in the probabilistic models of microarray and SAGE technologies, it is important to develop suitable techniques to select specific genes from SAGE measurements.
A new framework to select specific genes that distinguish different biological states based on the analysis of SAGE data is proposed. The new framework applies the bolstered error for the identification of strong genes that separate the biological states in a feature space defined by the gene expression of a training set. Credibility intervals defined from a probabilistic model of SAGE measurements are used to identify the genes that distinguish the different states with more reliability among all gene groups selected by the strong genes method. A score taking into account the credibility and the bolstered error values in order to rank the groups of considered genes is proposed. Results obtained using SAGE data from gliomas are presented, thus corroborating the introduced methodology.
The model representing counting data, such as SAGE, provides additional statistical information that allows a more robust analysis. The additional statistical information provided by the probabilistic model is incorporated in the methodology described in the paper. The introduced method is suitable to identify signature genes that lead to a good separation of the biological states using SAGE and may be adapted for other counting methods such as Massive Parallel Signature Sequencing (MPSS) or the recent Sequencing-By-Synthesis (SBS) technique. Some of such genes identified by the proposed method may be useful to generate classifiers.
This paper presents a paradigm for generating a quantifiable marker of pathology that supports diagnosis and provides a potential biomarker of neuropsychiatric disorders, such as autism spectrum disorder (ASD). This is achieved by creating high-dimensional nonlinear pattern classifiers using Support Vector Machines (SVM), that learn the underlying pattern of pathology using numerous atlas-based regional features extracted from Diffusion Tensor Imaging (DTI) data. These classifiers, in addition to providing insight into the group separation between patients and controls, are applicable on a single subject basis and have the potential to aid in diagnosis by assigning a probabilistic abnormality score to each subject that quantifies the degree of pathology and can be used in combination with other clinical scores to aid in diagnostic decision. They also produce a ranking of regions that contribute most to the group classification and separation, thereby providing a neurobiological insight into the pathology. As an illustrative application of the general framework for creating diffusion based abnormality classifiers we create classifiers for a dataset consisting of 45 children with autism spectrum disorder (ASD) (mean age 10.5 ± 2.5 yrs) as compared to 30 typically developing (TD) controls ( mean age 10.3 ± 2.5 yrs). Based on the abnormality scores, a distinction between the ASD population and TD controls was achieved with 80% leave one out (LOO) cross-validation accuracy with high significance of p < 0.001, ~84% specificity and ~74% sensitivity. Regions that contributed to this abnormality score involved fractional anisotropy (FA) differences mainly in right occipital regions as well as in left superior longitudinal fasciculus, external and internal capsule while mean diffusivity (MD) discriminates were observed primarily in right occipital gyrus and right temporal white matter.
Diffusion tensor imaging; support vector machines; pattern classification; abnormality score
A measure of bipolar channel importance is proposed for EEG-based detection of neonatal seizures. The channel weights are computed based on the integrated synchrony of classifier probabilistic outputs for the channels which share a common electrode. These estimated time-varying weights are introduced within a Bayesian probabilistic framework to provide a channel-specific and thus adaptive seizure classification scheme. Validation results on a clinical dataset of neonatal seizures confirm the utility of the proposed channel weighting for the two patient-independent seizure detectors recently developed by this research group; one based on support vector machines and the other on Gaussian mixture models. By exploiting the channel weighting, the ROC area can be significantly increased for the most difficult patients, with the average ROC area across 17 patients increased by 22% (relative) for the SVM and by 15% (relative) for the GMM-based detector, respectively. It is shown that the system developed here outperforms the recent published studies in this area.
Neonatal; Seizure; Detection; EEG; Channel; Selection; Weighting; Montage; Classification; Probability
An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at: http://www.probtf.org.
One intractable problem with using microarray data analysis for cancer classification is how to reduce the extremely high-dimensionality gene feature data to remove the effects of noise. Feature selection is often used to address this problem by selecting informative genes from among thousands or tens of thousands of genes. However, most of the existing methods of microarray-based cancer classification utilize too many genes to achieve accurate classification, which often hampers the interpretability of the models. For a better understanding of the classification results, it is desirable to develop simpler rule-based models with as few marker genes as possible.
We screened a small number of informative single genes and gene pairs on the basis of their depended degrees proposed in rough sets. Applying the decision rules induced by the selected genes or gene pairs, we constructed cancer classifiers. We tested the efficacy of the classifiers by leave-one-out cross-validation (LOOCV) of training sets and classification of independent test sets.
We applied our methods to five cancerous gene expression datasets: leukemia (acute lymphoblastic leukemia [ALL] vs. acute myeloid leukemia [AML]), lung cancer, prostate cancer, breast cancer, and leukemia (ALL vs. mixed-lineage leukemia [MLL] vs. AML). Accurate classification outcomes were obtained by utilizing just one or two genes. Some genes that correlated closely with the pathogenesis of relevant cancers were identified. In terms of both classification performance and algorithm simplicity, our approach outperformed or at least matched existing methods.
In cancerous gene expression datasets, a small number of genes, even one or two if selected correctly, is capable of achieving an ideal cancer classification effect. This finding also means that very simple rules may perform well for cancerous class prediction.
Classification is the problem of assigning each input object to one of a finite number of classes. This problem has been extensively studied in machine learning and statistics, and there are numerous applications to bioinformatics as well as many other fields. Building a multiclass classifier has been a challenge, where the direct approach of altering the binary classification algorithm to accommodate more than two classes can be computationally too expensive. Hence the indirect approach of using binary decomposition has been commonly used, in which retrieving the class posterior probabilities from the set of binary posterior probabilities given by the individual binary classifiers has been a major issue.
In this work, we present an extension of a recently introduced probabilistic kernel-based learning algorithm called the Classification Relevance Units Machine (CRUM) to the multiclass setting to increase its applicability. The extension is achieved under the error correcting output codes framework. The probabilistic outputs of the binary CRUMs are preserved using a proposed linear-time decoding algorithm, an alternative to the generalized Bradley-Terry (GBT) algorithm whose application to large-scale prediction settings is prohibited by its computational complexity. The resulting classifier is called the Multiclass Relevance Units Machine (McRUM).
The evaluation of McRUM on a variety of real small-scale benchmark datasets shows that our proposed Naïve decoding algorithm is computationally more efficient than the GBT algorithm while maintaining a similar level of predictive accuracy. Then a set of experiments on a larger scale dataset for small ncRNA classification have been conducted with Naïve McRUM and compared with the Gaussian and linear SVM. Although McRUM's predictive performance is slightly lower than the Gaussian SVM, the results show that the similar level of true positive rate can be achieved by sacrificing false positive rate slightly. Furthermore, McRUM is computationally more efficient than the SVM, which is an important factor for large-scale analysis.
We have proposed McRUM, a multiclass extension of binary CRUM. McRUM with Naïve decoding algorithm is computationally efficient in run-time and its predictive performance is comparable to the well-known SVM, showing its potential in solving large-scale multiclass problems in bioinformatics and other fields of study.
The structure of proteins may change as a result of the inherent flexibility of some protein regions. We develop and explore probabilistic machine learning methods for predicting a continuum secondary structure, i.e. assigning probabilities to the conformational states of a residue. We train our methods using data derived from high-quality NMR models.
Several probabilistic models not only successfully estimate the continuum secondary structure, but also provide a categorical output on par with models directly trained on categorical data. Importantly, models trained on the continuum secondary structure are also better than their categorical counterparts at identifying the conformational state for structurally ambivalent residues.
Cascaded probabilistic neural networks trained on the continuum secondary structure exhibit better accuracy in structurally ambivalent regions of proteins, while sustaining an overall classification accuracy on par with standard, categorical prediction methods.
KNET is an environment for constructing probabilistic, knowledge-intensive systems within the axiomatic framework of decision theory. The KNET architecture defines a complete separation between the hypermedia user interface on the one hand, and the representation and management of expert opinion on the other. KNET offers a choice of algorithms for probabilistic inference. My coworkers and I have used KNET to build consultation systems for lymph-node pathology, bone-marrow transplantation therapy, clinical epidemiology, and alarm management in the intensive-care unit.
Most important, KNET contains a randomized approximation scheme (ras) for the difficult and almost certainly intractable problem of Bayesian inference. My algorithm can, in many circumstances, perform efficient approximate inference in large and richly interconnected models of medical diagnosis. In this article, I describe the architecture of KNET, construct a randomized algorithm for probabilistic inference, and analyze the algorithm's performance. Finally, I characterize my algorithm's empiric behavior and explore its potential for parallel speedups. From design to implementation, then, KNET demonstrates the crucial interaction between theoretical computer science and medical informatics.
This paper presents a novel probabilistic approach to speech enhancement. Instead of a deterministic logarithmic relationship, we assume a probabilistic relationship between the frequency coefficients and the log-spectra. The speech model in the log-spectral domain is a Gaussian mixture model (GMM). The frequency coefficients obey a zero-mean Gaussian whose covariance equals to the exponential of the log-spectra. This results in a Gaussian scale mixture model (GSMM) for the speech signal in the frequency domain, since the log-spectra can be regarded as scaling factors. The probabilistic relation between frequency coefficients and log-spectra allows these to be treated as two random variables, both to be estimated from the noisy signals. Expectation-maximization (EM) was used to train the GSMM and Bayesian inference was used to compute the posterior signal distribution. Because exact inference of this full probabilistic model is computationally intractable, we developed two approaches to enhance the efficiency: the Laplace method and a variational approximation. The proposed methods were applied to enhance speech corrupted by Gaussian noise and speech-shaped noise (SSN). For both approximations, signals reconstructed from the estimated frequency coefficients provided higher signal-to-noise ratio (SNR) and those reconstructed from the estimated log-spectra produced lower word recognition error rate because the log-spectra fit the inputs to the recognizer better. Our algorithms effectively reduced the SSN, which algorithms based on spectral analysis were not able to suppress.
Gaussian scale mixture model (GSMM); Laplace method; speech enhancement; variational approximation
The ability to analyze and classify three-dimensional (3D) biological morphology has lagged behind the analysis of other biological data types such as gene sequences. Here, we introduce the techniques of data mining to the study of 3D biological shapes to bring the analyses of phenomes closer to the efficiency of studying genomes. We compiled five training sets of highly variable morphologies of mammalian teeth from the MorphoBrowser database. Samples were labeled either by dietary class or by conventional dental types (e.g. carnassial, selenodont). We automatically extracted a multitude of topological attributes using Geographic Information Systems (GIS)-like procedures that were then used in several combinations of feature selection schemes and probabilistic classification models to build and optimize classifiers for predicting the labels of the training sets. In terms of classification accuracy, computational time and size of the feature sets used, non-repeated best-first search combined with 1-nearest neighbor classifier was the best approach. However, several other classification models combined with the same searching scheme proved practical. The current study represents a first step in the automatic analysis of 3D phenotypes, which will be increasingly valuable with the future increase in 3D morphology and phenomics databases.
The diagnosis of pulmonary embolism demands flexible decision models, both for the presence of clinical confounders and for the variability of local diagnostic resources. As Bayesian networks fully meet this requirement, Bayes Pulmonary embolism Assisted Diagnosis (BayPAD), a probabilistic expert systems focused on pulmonary embolism, was developed.
To quantitatively validate and improve BayPAD, the system was applied to 750 patients from a prospective study done in an Italian tertiary hospital where the true pulmonary embolism status was confirmed using pulmonary angiography or ruled out with a lung scan. The proportion of correct diagnoses made by BayPAD (accuracy) and the correctness of the pulmonary embolism probabilities predicted by the model (calibration) were calculated. The calibration was evaluated according to the Cox regression–calibration model.
Before refining the model, accuracy was 88.6%. Once refined, accuracy was 97.2% and 98%, respectively, in the training and validation samples. According to Cox analysis, calibration was satisfactory, despite a tendency to exaggerate the effect of the findings on the probability of pulmonary embolism. The lack of some investigations (like Spiral computed tomographic scan and Lower limbs doppler ultrasounds) in the pool of available data often prevents BayPAD from reaching the diagnosis without invasive procedures.
BayPAD offers clinicians a flexible and accurate strategy to diagnose pulmonary embolism. Simple to use, the system performs case‐based reasoning to optimise the use of resources available within a particular hospital. Bayesian networks are expected to have a prominent role in the clinical management of complex diagnostic problems in the near future.
A Bayesian framework is used to calibrate a mass-action model of receptor-mediated apoptosis. Despite parameter non-identifiability and model ‘sloppiness', Bayes factor analysis discriminates between two alternative models of mitochondrial outer membrane permeabilization.
Bayesian estimation returns statistically complete joint parameter distribution for mass-action models of receptor-mediated apoptosis calibrated to dynamic, live-cell data.Analysis of joint distributions reveals strong, non-linear correlations between parameters that are poorly captured by a conventional table of mean values and covariances; a high-dimensional distribution must therefore be reported as the true estimate of parameter values.Despite non-identifiablility and model ‘sloppiness,' a Bayesian framework returns probabilistic predictions for cell death dynamics that have tight confidence intervals and match experimental data.Use of a Bayesian framework to discriminate between two competing models of mitochondrial outer membrane permeabilization shows that a ‘direct' mechanism has ∼20-fold greater plausibility than an ‘indirect' mechanism, even though both models exhibit equally good fits to data for some parameters.
Using models to simulate and analyze biological networks requires principled approaches to parameter estimation and model discrimination. We use Bayesian and Monte Carlo methods to recover the full probability distributions of free parameters (initial protein concentrations and rate constants) for mass-action models of receptor-mediated cell death. The width of the individual parameter distributions is largely determined by non-identifiability but covariation among parameters, even those that are poorly determined, encodes essential information. Knowledge of joint parameter distributions makes it possible to compute the uncertainty of model-based predictions whereas ignoring it (e.g., by treating parameters as a simple list of values and variances) yields nonsensical predictions. Computing the Bayes factor from joint distributions yields the odds ratio (∼20-fold) for competing ‘direct' and ‘indirect' apoptosis models having different numbers of parameters. Our results illustrate how Bayesian approaches to model calibration and discrimination combined with single-cell data represent a generally useful and rigorous approach to discriminate between competing hypotheses in the face of parametric and topological uncertainty.
apoptosis; Bayesian estimation; biochemical networks; modeling
We introduce an automated segmentation method, extended Markov Random Field (eMRF) to classify 21 neuroanatomical structures of mouse brain based on three dimensional (3D) magnetic resonance imaging (MRI). The image data are multispectral: T2-weighted, proton density-weighted, diffusion x, y and z weighted. Earlier research (Ali et al., 2005) successfully explored the use of MRF for mouse brain segmentation. In this research, we study the use of information generated from Support Vector Machine (SVM) to represent the probabilistic information. Since SVM in general has a stronger discriminative power than the Gaussian likelihood method and is able to handle nonlinear classification problems, integrating SVM into MRF improved the classification accuracy. The eMRF employs the posterior probability distribution obtained from SVM to generate a classification based on the MR intensity. Secondly eMRF introduces a new potential function based on location information. Third, to maximize the classification performance eMRF uses the contribution weights optimally determined for each of the three potential functions: observation, location and contextual functions, which are traditionally equally weighted. We use the voxel overlap percentage and volume difference percentage to evaluate the accuracy of eMRF segmentation and compare the algorithm with three other segmentation methods – mixed ratio sampling SVM (MRS-SVM), atlas-based segmentation and MRF. Validation using classification accuracy indices between automatically segmented and manually traced data shows that eMRF outperforms other methods.
Automated segmentation; Data mining; Magnetic resonance microscopy; Markov Random Field; Mouse brain; Support Vector Machine
We describe a multi-purpose image classifier that can be applied to a wide variety of image classification tasks without modifications or fine-tuning, and yet provide classification accuracy comparable to state-of-the-art task-specific image classifiers. The proposed image classifier first extracts a large set of 1025 image features including polynomial decompositions, high contrast features, pixel statistics, and textures. These features are computed on the raw image, transforms of the image, and transforms of transforms of the image. The feature values are then used to classify test images into a set of pre-defined image classes. This classifier was tested on several different problems including biological image classification and face recognition. Although we cannot make a claim of universality, our experimental results show that this classifier performs as well or better than classifiers developed specifically for these image classification tasks. Our classifier’s high performance on a variety of classification problems is attributed to (i) a large set of features extracted from images; and (ii) an effective feature selection and weighting algorithm sensitive to specific image classification problems. The algorithms are available for free download from openmicroscopy.org.
Image classification; biological imaging; image features; high dimensional classification
Methodology is developed to classify ethnic status by name using a simple probabilistic model. This method involves the consideration of four rules which may be used to classify individuals using three name components (first, middle and last names). In order to do this, conditional probabilities of ethnic status are estimated from a sample in which the ethnic status is known. Using a split sample technique the sensitivity and specificity of this methodology were examined in a data set of death registrations. Each of the classification rules performed well on the data from which they were constructed but were not as efficient when applied to another population. Nevertheless a model (linear), in which the sum of the conditional probabilities of each home component is used, achieved a sensitivity and specificity of 97% and 100% respectively in males and 89% and 100% in females.
The safety and efficacy of racecadotril to treat acute watery diarrhea (AWD) in children is well established, however its cost effectiveness for infants and children in Europe has not yet been determined.
To evaluate the cost utility of racecadotril adjuvant with oral rehydration solution (ORS) compared to ORS alone for the treatment of AWD in children younger than 5 years old. The analysis is performed from a United Kingdom National Health Service (NHS) perspective.
A decision tree model has been developed in Microsoft® Excel. The model is populated with the best available evidence. Deterministic and probabilistic sensitivity analyses (PSA) have been performed. Health effects are measured as quality-adjusted life years (QALYs) and the model output is cost (2011 GBP) per QALY. The uncertainty in the primary outcome is explored by probabilistic analysis using 1000 iterations of a Monte Carlo simulation.
Deterministic analysis results in a total incremental cost of −£379 in favor of racecadotril and a total incremental QALY gain in favor of racecadotril of +0.0008. The observed cost savings with racecadotril arise from the reduction in primary care reconsultation and secondary referral. The difference in QALYs is largely attributable to the timely resolution of symptoms in the racecadotril arm. Racecadotril remains dominant when base case parameters are varied. Monte Carlo simulation and PSA confirm that racecadotril is the dominant treatment strategy and is almost certainly cost effective, under the central assumptions of the model, at a commonly used willingness to pay proxy threshold range of £20,000–£30,000 per QALY.
Racecadotril as adjuvant therapy is more effective and less costly compared to ORS alone, from a UK payer perspective, for the treatment of children with acute diarrhea.
cost effectiveness; health economic model; infant; QALY; racecadotril; acute watery diarrhea
Data from metabolomic studies are typically complex and high-dimensional. Principal component analysis (PCA) is currently the most widely used statistical technique for analyzing metabolomic data. However, PCA is limited by the fact that it is not based on a statistical model.
Here, probabilistic principal component analysis (PPCA) which addresses some of the limitations of PCA, is reviewed and extended. A novel extension of PPCA, called probabilistic principal component and covariates analysis (PPCCA), is introduced which provides a flexible approach to jointly model metabolomic data and additional covariate information. The use of a mixture of PPCA models for discovering the number of inherent groups in metabolomic data is demonstrated. The jackknife technique is employed to construct confidence intervals for estimated model parameters throughout. The optimal number of principal components is determined through the use of the Bayesian Information Criterion model selection tool, which is modified to address the high dimensionality of the data.
The methods presented are illustrated through an application to metabolomic data sets. Jointly modeling metabolomic data and covariates was successfully achieved and has the potential to provide deeper insight to the underlying data structure. Examination of confidence intervals for the model parameters, such as loadings, allows for principled and clear interpretation of the underlying data structure. A software package called MetabolAnalyze, freely available through the R statistical software, has been developed to facilitate implementation of the presented methods in the metabolomics field.
Many problems in bioinformatics involve classification based on features such as sequence, structure or morphology. Given multiple classifiers, two crucial questions arise: how does their performance compare, and how can they best be combined to produce a better classifier? A classifier can be evaluated in terms of sensitivity and specificity using benchmark, or gold standard, data, that is, data for which the true classification is known. However, a gold standard is not always available. Here we demonstrate that a Bayesian model for comparing medical diagnostics without a gold standard can be successfully applied in the bioinformatics domain, to genomic scale data sets. We present a new implementation, which unlike previous implementations is applicable to any number of classifiers. We apply this model, for the first time, to the problem of finding the globally optimal logical combination of classifiers.
We compared three classifiers of protein subcellular localisation, and evaluated our estimates of sensitivity and specificity against estimates obtained using a gold standard. The method overestimated sensitivity and specificity with only a small discrepancy, and correctly ranked the classifiers. Diagnostic tests for swine flu were then compared on a small data set. Lastly, classifiers for a genome-wide association study of macular degeneration with 541094 SNPs were analysed. In all cases, run times were feasible, and results precise. The optimal logical combination of classifiers was also determined for all three data sets. Code and data are available from http://bioinformatics.monash.edu.au/downloads/.
The examples demonstrate the methods are suitable for both small and large data sets, applicable to the wide range of bioinformatics classification problems, and robust to dependence between classifiers. In all three test cases, the globally optimal logical combination of the classifiers was found to be their union, according to three out of four ranking criteria. We propose as a general rule of thumb that the union of classifiers will be close to optimal.
Binary classifier; Bayesian methods; Protein sub-cellular localisation; Diagnostic tests; Genome wide association studies.
Diffusion magnetic resonance imaging (dMRI) tractography can be employed to simultaneously analyse three-dimensional white matter tracts in the brain. Numerous methods have been proposed to model diffusion-weighted magnetic resonance data for tractography, and we have explored the functionality of some of these for studying white and grey matter pathways in ex vivo mouse brain. Using various deterministic and probabilistic algorithms across a range of regions of interest we found that probabilistic tractography provides a more robust means of visualizing both white and grey matter pathways than deterministic tractography. Importantly, we demonstrate the sensitivity of probabilistic tractography profiles to streamline number, step size, curvature, fiber orientation distribution, and whole-brain versus region of interest seeding. Using anatomically well-defined cortico-thalamic pathways, we show how density maps can permit the topographical assessment of probabilistic tractography. Finally, we show how different tractography approaches can impact on dMRI assessment of tract changes in a mouse deficient for the frontal cortex morphogen, fibroblast growth factor 17. In conclusion, probabilistic tractography can elucidate the phenotypes of mice with neurodegenerative or neurodevelopmental disorders in a quantitative manner.
mouse brain; diffusion-weighted imaging; tractography; constrained spherical deconvolution; Qball; Fgf17
Anomaly detection methods can be very useful in identifying interesting or concerning events. In this work, we develop and examine new probabilistic anomaly detection methods that let us evaluate management decisions for a specific patient and identify those decisions that are highly unusual with respect to patients with the same or similar condition. The statistics used in this detection are derived from probabilistic models such as Bayesian networks that are learned from a database of past patient cases. We evaluate our methods on the problem of detection of unusual hospitalization patterns for patients with community acquired pneumonia. The results show very encouraging detection performance with 0.5 precision at 0.53 recall and give us hope that these techniques may provide the basis of intelligent monitoring systems that alert clinicians to the occurrence of unusual events or decisions.
In the current climate of high-throughput computational biology, the inference of a protein's function from related measurements, such as protein-protein interaction relations, has become a canonical task. Most existing technologies pursue this task as a classification problem, on a term-by-term basis, for each term in a database, such as the Gene Ontology (GO) database, a popular rigorous vocabulary for biological functions. However, ontology structures are essentially hierarchies, with certain top to bottom annotation rules which protein function predictions should in principle follow. Currently, the most common approach to imposing these hierarchical constraints on network-based classifiers is through the use of transitive closure to predictions.
We propose a probabilistic framework to integrate information in relational data, in the form of a protein-protein interaction network, and a hierarchically structured database of terms, in the form of the GO database, for the purpose of protein function prediction. At the heart of our framework is a factorization of local neighborhood information in the protein-protein interaction network across successive ancestral terms in the GO hierarchy. We introduce a classifier within this framework, with computationally efficient implementation, that produces GO-term predictions that naturally obey a hierarchical 'true-path' consistency from root to leaves, without the need for further post-processing.
A cross-validation study, using data from the yeast Saccharomyces cerevisiae, shows our method offers substantial improvements over both standard 'guilt-by-association' (i.e., Nearest-Neighbor) and more refined Markov random field methods, whether in their original form or when post-processed to artificially impose 'true-path' consistency. Further analysis of the results indicates that these improvements are associated with increased predictive capabilities (i.e., increased positive predictive value), and that this increase is consistent uniformly with GO-term depth. Additional in silico validation on a collection of new annotations recently added to GO confirms the advantages suggested by the cross-validation study. Taken as a whole, our results show that a hierarchical approach to network-based protein function prediction, that exploits the ontological structure of protein annotation databases in a principled manner, can offer substantial advantages over the successive application of 'flat' network-based methods.