Search tips
Search criteria

Results 1-25 (1205916)

Clipboard (0)

Related Articles

1.  Moving from Data on Deaths to Public Health Policy in Agincourt, South Africa: Approaches to Analysing and Understanding Verbal Autopsy Findings 
PLoS Medicine  2010;7(8):e1000325.
Peter Byass and colleagues compared two methods of assessing data from verbal autopsies, review by physicians or probabilistic modeling, and show that probabilistic modeling is the most efficient means of analyzing these data
Cause of death data are an essential source for public health planning, but their availability and quality are lacking in many parts of the world. Interviewing family and friends after a death has occurred (a procedure known as verbal autopsy) provides a source of data where deaths otherwise go unregistered; but sound methods for interpreting and analysing the ensuing data are essential. Two main approaches are commonly used: either physicians review individual interview material to arrive at probable cause of death, or probabilistic models process the data into likely cause(s). Here we compare and contrast these approaches as applied to a series of 6,153 deaths which occurred in a rural South African population from 1992 to 2005. We do not attempt to validate either approach in absolute terms.
Methods and Findings
The InterVA probabilistic model was applied to a series of 6,153 deaths which had previously been reviewed by physicians. Physicians used a total of 250 cause-of-death codes, many of which occurred very rarely, while the model used 33. Cause-specific mortality fractions, overall and for population subgroups, were derived from the model's output, and the physician causes coded into comparable categories. The ten highest-ranking causes accounted for 83% and 88% of all deaths by physician interpretation and probabilistic modelling respectively, and eight of the highest ten causes were common to both approaches. Top-ranking causes of death were classified by population subgroup and period, as done previously for the physician-interpreted material. Uncertainty around the cause(s) of individual deaths was recognised as an important concept that should be reflected in overall analyses. One notably discrepant group involved pulmonary tuberculosis as a cause of death in adults aged over 65, and these cases are discussed in more detail, but the group only accounted for 3.5% of overall deaths.
There were no differences between physician interpretation and probabilistic modelling that might have led to substantially different public health policy conclusions at the population level. Physician interpretation was more nuanced than the model, for example in identifying cancers at particular sites, but did not capture the uncertainty associated with individual cases. Probabilistic modelling was substantially cheaper and faster, and completely internally consistent. Both approaches characterised the rise of HIV-related mortality in this population during the period observed, and reached similar findings on other major causes of mortality. For many purposes probabilistic modelling appears to be the best available means of moving from data on deaths to public health actions.
Please see later in the article for the Editors' Summary
Editors' Summary
Whenever someone dies in a developed country, the cause of death is determined by a doctor and entered into a “vital registration system,” a record of all the births and deaths in that country. Public-health officials and medical professionals use this detailed and complete information about causes of death to develop public-health programs and to monitor how these programs affect the nation's health. Unfortunately, in many developing countries dying people are not attended by doctors and vital registration systems are incomplete. In most African countries, for example, less than one-quarter of deaths are recorded in vital registration systems. One increasingly important way to improve knowledge about the patterns of death in developing countries is “verbal autopsy” (VA). Using a standard form, trained personnel ask relatives and caregivers about the symptoms that the deceased had before his/her death and about the circumstances surrounding the death. Physicians then review these forms and assign a specific cause of death from a shortened version of the International Classification of Diseases, a list of codes for hundreds of diseases.
Why Was This Study Done?
Physician review of VA forms is time-consuming and expensive. Consequently, computer-based, “probabilistic” models have been developed that process the VA data and provide a likely cause of death. These models are faster and cheaper than physician review of VAs and, because they do not rely on the views of local doctors about the likely causes of death, they are more internally consistent. But are physician review and probabilistic models equally sound ways of interpreting VA data? In this study, the researchers compare and contrast the interpretation of VA data by physician review and by a probabilistic model called the InterVA model by applying these two approaches to the deaths that occurred in Agincourt, a rural region of northeast South Africa, between 1992 and 2005. The Agincourt health and sociodemographic surveillance system is a member of the INDEPTH Network, a global network that is evaluating the health and demographic characteristics (for example, age, gender, and education) of populations in low- and middle-income countries over several years.
What Did the Researchers Do and Find?
The researchers applied the InterVA probabilistic model to 6,153 deaths that had been previously reviewed by physicians. They grouped the 250 cause-of-death codes used by the physicians into categories comparable with the 33 cause-of-death codes used by the InterVA model and derived cause-specific mortality fractions (the proportions of the population dying from specific causes) for the whole population and for subgroups (for example, deaths in different age groups and deaths occurring over specific periods of time) from the output of both approaches. The ten highest-ranking causes of death accounted for 83% and 88% of all deaths by physician interpretation and by probabilistic modelling, respectively. Eight of the most frequent causes of death—HIV, tuberculosis, chronic heart conditions, diarrhea, pneumonia/sepsis, transport-related accidents, homicides, and indeterminate—were common to both interpretation methods. Both methods coded about a third of all deaths as indeterminate, often because of incomplete VA data. Generally, there was close agreement between the methods for the five principal causes of death for each age group and for each period of time, although one notable discrepancy was pulmonary (lung) tuberculosis, which accounted for 6.4% and 21.3% of deaths in this age group, respectively, according to the physicians and to the model. However, these deaths accounted for only 3.5% of all the deaths.
What Do These Findings Mean?
These findings reveal no differences between the cause-specific mortality fractions determined from VA data by physician interpretation and by probabilistic modelling that might have led to substantially different public-health policy programmes being initiated in this population. Importantly, both approaches clearly chart the rise of HIV-related mortality in this South African population between 1992 and 2005 and reach similar findings on other major causes of mortality. The researchers note that, although preparing the amount of VA data considered here for entry into the probabilistic model took several days, the model itself runs very quickly and always gives consistent answers. Given these findings, the researchers conclude that in many settings probabilistic modeling represents the best means of moving from VA data to public-health actions.
Additional Information
Please access these Web sites via the online version of this summary at
The importance of accurate data on death is further discussed in a perspective previously published in PLoS Medicine Perspective by Colin Mathers and Ties Boerma
The World Health Organization (WHO) provides information on the vital registration of deaths and on the International Classification of Diseases; the WHO Health Metrics Network is a global collaboration focused on improving sources of vital statistics; and the WHO Global Health Observatory brings together core health statistics for WHO member states
The INDEPTH Network is a global collaboration that is collecting health statistics from developing countries; it provides more information about the Agincourt health and socio-demographic surveillance system and access to standard VA forms
Information on the Agincourt health and sociodemographic surveillance system is available on the University of Witwatersrand Web site
The InterVA Web site provides resources for interpreting verbal autopsy data and the Umeå Centre for Global Health Reseach, where the InterVA model was developed, is found at
A recent PLoS Medicine Essay by Peter Byass, lead author of this study, discusses The Unequal World of Health Data
PMCID: PMC2923087  PMID: 20808956
2.  Kernel-imbedded Gaussian processes for disease classification using microarray gene expression data 
BMC Bioinformatics  2007;8:67.
Designing appropriate machine learning methods for identifying genes that have a significant discriminating power for disease outcomes has become more and more important for our understanding of diseases at genomic level. Although many machine learning methods have been developed and applied to the area of microarray gene expression data analysis, the majority of them are based on linear models, which however are not necessarily appropriate for the underlying connection between the target disease and its associated explanatory genes. Linear model based methods usually also bring in false positive significant features more easily. Furthermore, linear model based algorithms often involve calculating the inverse of a matrix that is possibly singular when the number of potentially important genes is relatively large. This leads to problems of numerical instability. To overcome these limitations, a few non-linear methods have recently been introduced to the area. Many of the existing non-linear methods have a couple of critical problems, the model selection problem and the model parameter tuning problem, that remain unsolved or even untouched. In general, a unified framework that allows model parameters of both linear and non-linear models to be easily tuned is always preferred in real-world applications. Kernel-induced learning methods form a class of approaches that show promising potentials to achieve this goal.
A hierarchical statistical model named kernel-imbedded Gaussian process (KIGP) is developed under a unified Bayesian framework for binary disease classification problems using microarray gene expression data. In particular, based on a probit regression setting, an adaptive algorithm with a cascading structure is designed to find the appropriate kernel, to discover the potentially significant genes, and to make the optimal class prediction accordingly. A Gibbs sampler is built as the core of the algorithm to make Bayesian inferences. Simulation studies showed that, even without any knowledge of the underlying generative model, the KIGP performed very close to the theoretical Bayesian bound not only in the case with a linear Bayesian classifier but also in the case with a very non-linear Bayesian classifier. This sheds light on its broader usability to microarray data analysis problems, especially to those that linear methods work awkwardly. The KIGP was also applied to four published microarray datasets, and the results showed that the KIGP performed better than or at least as well as any of the referred state-of-the-art methods did in all of these cases.
Mathematically built on the kernel-induced feature space concept under a Bayesian framework, the KIGP method presented in this paper provides a unified machine learning approach to explore both the linear and the possibly non-linear underlying relationship between the target features of a given binary disease classification problem and the related explanatory gene expression data. More importantly, it incorporates the model parameter tuning into the framework. The model selection problem is addressed in the form of selecting a proper kernel type. The KIGP method also gives Bayesian probabilistic predictions for disease classification. These properties and features are beneficial to most real-world applications. The algorithm is naturally robust in numerical computation. The simulation studies and the published data studies demonstrated that the proposed KIGP performs satisfactorily and consistently.
PMCID: PMC1821044  PMID: 17328811
3.  Accurate multimodal probabilistic prediction of conversion to Alzheimer's disease in patients with mild cognitive impairment☆ 
NeuroImage : Clinical  2013;2:735-745.
Accurately identifying the patients that have mild cognitive impairment (MCI) who will go on to develop Alzheimer's disease (AD) will become essential as new treatments will require identification of AD patients at earlier stages in the disease process. Most previous work in this area has centred around the same automated techniques used to diagnose AD patients from healthy controls, by coupling high dimensional brain image data or other relevant biomarker data to modern machine learning techniques. Such studies can now distinguish between AD patients and controls as accurately as an experienced clinician. Models trained on patients with AD and control subjects can also distinguish between MCI patients that will convert to AD within a given timeframe (MCI-c) and those that remain stable (MCI-s), although differences between these groups are smaller and thus, the corresponding accuracy is lower. The most common type of classifier used in these studies is the support vector machine, which gives categorical class decisions. In this paper, we introduce Gaussian process (GP) classification to the problem. This fully Bayesian method produces naturally probabilistic predictions, which we show correlate well with the actual chances of converting to AD within 3 years in a population of 96 MCI-s and 47 MCI-c subjects. Furthermore, we show that GPs can integrate multimodal data (in this study volumetric MRI, FDG-PET, cerebrospinal fluid, and APOE genotype with the classification process through the use of a mixed kernel). The GP approach aids combination of different data sources by learning parameters automatically from training data via type-II maximum likelihood, which we compare to a more conventional method based on cross validation and an SVM classifier. When the resulting probabilities from the GP are dichotomised to produce a binary classification, the results for predicting MCI conversion based on the combination of all three types of data show a balanced accuracy of 74%. This is a substantially higher accuracy than could be obtained using any individual modality or using a multikernel SVM, and is competitive with the highest accuracy yet achieved for predicting conversion within three years on the widely used ADNI dataset.
•Prediction of MCI to AD conversion using ADNI data and Gaussian processes.•74% accuracy, 0.795 area under ROC curve for predicting conversion within 3 years.•Gaussian processes allow automatic parameter tuning including multimodal weights.•Statistically significant improvement for multimodal vs best unimodal prediction.•Probabilistic interpretation of results to better reflect continuum of disease.
PMCID: PMC3777690  PMID: 24179825
Alzheimer's disease; Mild cognitive impairment; Gaussian process; Support vector machine; Multimodality; Probabilistic classification; Risk scores
4.  A probabilistic method to estimate the burden of maternal morbidity in resource-poor settings: preliminary development and evaluation 
Maternal morbidity is more common than maternal death, and population-based estimates of the burden of maternal morbidity could provide important indicators for monitoring trends, priority setting and evaluating the health impact of interventions. Methods based on lay reporting of obstetric events have been shown to lack specificity and there is a need for new approaches to measure the population burden of maternal morbidity. A computer-based probabilistic tool was developed to estimate the likelihood of maternal morbidity and its causes based on self-reported symptoms and pregnancy/delivery experiences. Development involved the use of training datasets of signs, symptoms and causes of morbidity from 1734 facility-based deliveries in Benin and Burkina Faso, as well as expert review. Preliminary evaluation of the method compared the burden of maternal morbidity and specific causes from the probabilistic tool with clinical classifications of 489 recently-delivered women from Benin, Bangladesh and India.
Using training datasets, it was possible to create a probabilistic tool that handled uncertainty of women’s self reports of pregnancy and delivery experiences in a unique way to estimate population-level burdens of maternal morbidity and specific causes that compared well with clinical classifications of the same data. When applied to test datasets, the method overestimated the burden of morbidity compared with clinical review, although possible conceptual and methodological reasons for this were identified.
The probabilistic method shows promise and may offer opportunities for standardised measurement of maternal morbidity that allows for the uncertainty of women’s self-reported symptoms in retrospective interviews. However, important discrepancies with clinical classifications were observed and the method requires further development, refinement and evaluation in a range of settings.
PMCID: PMC3975153  PMID: 24620784
Maternal health; Morbidity; Developing countries; Pregnancy; Childbirth; Bayesian analysis; Africa; Asia
5.  Practical introduction to record linkage for injury research 
Injury Prevention  2004;10(3):186-191.
The frequency of early fatality and the transient nature of emergency medical care mean that a single database will rarely suffice for population based injury research. Linking records from multiple data sources is therefore a promising method for injury surveillance or trauma system evaluation. The purpose of this article is to review the historical development of record linkage, provide a basic mathematical foundation, discuss some practical issues, and consider some ethical concerns.
Clerical or computer assisted deterministic record linkage methods may suffice for some applications, but probabilistic methods are particularly useful for larger studies. The probabilistic method attempts to simulate human reasoning by comparing each of several elements from the two records. The basic mathematical specifications are derived algebraically from fundamental concepts of probability, although the theory can be extended to include more advanced mathematics.
Probabilistic, deterministic, and clerical techniques may be combined in different ways depending upon the goal of the record linkage project. If a population parameter is being estimated for a purely statistical study, a completely probabilistic approach may be most efficient; for other applications, where the purpose is to make inferences about specific individuals based upon their data contained in two or more files, the need for a high positive predictive value would favor a deterministic method or a probabilistic method with careful clerical review. Whatever techniques are used, researchers must realize that the combination of data sources entails additional ethical obligations beyond the use of each source alone.
PMCID: PMC1730090  PMID: 15178677
6.  An Enhanced Probabilistic LDA for Multi-Class Brain Computer Interface 
PLoS ONE  2011;6(1):e14634.
There is a growing interest in the study of signal processing and machine learning methods, which may make the brain computer interface (BCI) a new communication channel. A variety of classification methods have been utilized to convert the brain information into control commands. However, most of the methods only produce uncalibrated values and uncertain results.
Methodology/Principal Findings
In this study, we presented a probabilistic method “enhanced BLDA” (EBLDA) for multi-class motor imagery BCI, which utilized Bayesian linear discriminant analysis (BLDA) with probabilistic output to improve the classification performance. EBLDA builds a new classifier that enlarges training dataset by adding test samples with high probability. EBLDA is based on the hypothesis that unlabeled samples with high probability provide valuable information to enhance learning process and generate a classifier with refined decision boundaries. To investigate the performance of EBLDA, we first used carefully designed simulated datasets to study how EBLDA works. Then, we adopted a real BCI dataset for further evaluation. The current study shows that: 1) Probabilistic information can improve the performance of BCI for subjects with high kappa coefficient; 2) With supplementary training samples from the test samples of high probability, EBLDA is significantly better than BLDA in classification, especially for small training datasets, in which EBLDA can obtain a refined decision boundary by a shift of BLDA decision boundary with the support of the information from test samples.
The proposed EBLDA could potentially reduce training effort. Therefore, it is valuable for us to realize an effective online BCI system, especially for multi-class BCI systems.
PMCID: PMC3031502  PMID: 21297944
7.  A hierarchical Naïve Bayes Model for handling sample heterogeneity in classification problems: an application to tissue microarrays 
BMC Bioinformatics  2006;7:514.
Uncertainty often affects molecular biology experiments and data for different reasons. Heterogeneity of gene or protein expression within the same tumor tissue is an example of biological uncertainty which should be taken into account when molecular markers are used in decision making. Tissue Microarray (TMA) experiments allow for large scale profiling of tissue biopsies, investigating protein patterns characterizing specific disease states. TMA studies deal with multiple sampling of the same patient, and therefore with multiple measurements of same protein target, to account for possible biological heterogeneity. The aim of this paper is to provide and validate a classification model taking into consideration the uncertainty associated with measuring replicate samples.
We propose an extension of the well-known Naïve Bayes classifier, which accounts for biological heterogeneity in a probabilistic framework, relying on Bayesian hierarchical models. The model, which can be efficiently learned from the training dataset, exploits a closed-form of classification equation, thus providing no additional computational cost with respect to the standard Naïve Bayes classifier. We validated the approach on several simulated datasets comparing its performances with the Naïve Bayes classifier. Moreover, we demonstrated that explicitly dealing with heterogeneity can improve classification accuracy on a TMA prostate cancer dataset.
The proposed Hierarchical Naïve Bayes classifier can be conveniently applied in problems where within sample heterogeneity must be taken into account, such as TMA experiments and biological contexts where several measurements (replicates) are available for the same biological sample. The performance of the new approach is better than the standard Naïve Bayes model, in particular when the within sample heterogeneity is different in the different classes.
PMCID: PMC1698579  PMID: 17125514
8.  Estimating Survival in Patients with Operable Skeletal Metastases: An Application of a Bayesian Belief Network 
PLoS ONE  2011;6(5):e19956.
Accurate estimations of life expectancy are important in the management of patients with metastatic cancer affecting the extremities, and help set patient, family, and physician expectations. Clinically, the decision whether to operate on patients with skeletal metastases, as well as the choice of surgical procedure, are predicated on an individual patient's estimated survival. Currently, there are no reliable methods for estimating survival in this patient population. Bayesian classification, which includes Bayesian belief network (BBN) modeling, is a statistical method that explores conditional, probabilistic relationships between variables to estimate the likelihood of an outcome using observed data. Thus, BBN models are being used with increasing frequency in a variety of diagnoses to codify complex clinical data into prognostic models. The purpose of this study was to determine the feasibility of developing Bayesian classifiers to estimate survival in patients undergoing surgery for metastases of the axial and appendicular skeleton.
We searched an institution-owned patient management database for all patients who underwent surgery for skeletal metastases between 1999 and 2003. We then developed and trained a machine-learned BBN model to estimate survival in months using candidate features based on historical data. Ten-fold cross-validation and receiver operating characteristic (ROC) curve analysis were performed to evaluate the BNN model's accuracy and robustness.
A total of 189 consecutive patients were included. First-degree predictors of survival differed between the 3-month and 12-month models. Following cross validation, the area under the ROC curve was 0.85 (95% CI: 0.80–0.93) for 3-month probability of survival and 0.83 (95% CI: 0.77–0.90) for 12-month probability of survival.
A robust, accurate, probabilistic naïve BBN model was successfully developed using observed clinical data to estimate individualized survival in patients with operable skeletal metastases. This method warrants further development and must be externally validated in other patient populations.
PMCID: PMC3094405  PMID: 21603644
9.  Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources 
PLoS ONE  2008;3(3):e1820.
An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at:
PMCID: PMC2268002  PMID: 18364997
10.  Quantifying diffusion MRI tractography of the corticospinal tract in brain tumors with deterministic and probabilistic methods☆ 
NeuroImage : Clinical  2013;3:361-368.
Diffusion MRI tractography has been increasingly used to delineate white matter pathways in vivo for which the leading clinical application is presurgical mapping of eloquent regions. However, there is rare opportunity to quantify the accuracy or sensitivity of these approaches to delineate white matter fiber pathways in vivo due to the lack of a gold standard. Intraoperative electrical stimulation (IES) provides a gold standard for the location and existence of functional motor pathways that can be used to determine the accuracy and sensitivity of fiber tracking algorithms. In this study we used intraoperative stimulation from brain tumor patients as a gold standard to estimate the sensitivity and accuracy of diffusion tensor MRI (DTI) and q-ball models of diffusion with deterministic and probabilistic fiber tracking algorithms for delineation of motor pathways.
We used preoperative high angular resolution diffusion MRI (HARDI) data (55 directions, b = 2000 s/mm2) acquired in a clinically feasible time frame from 12 patients who underwent a craniotomy for resection of a cerebral glioma. The corticospinal fiber tracts were delineated with DTI and q-ball models using deterministic and probabilistic algorithms. We used cortical and white matter IES sites as a gold standard for the presence and location of functional motor pathways. Sensitivity was defined as the true positive rate of delineating fiber pathways based on cortical IES stimulation sites. For accuracy and precision of the course of the fiber tracts, we measured the distance between the subcortical stimulation sites and the tractography result. Positive predictive rate of the delineated tracts was assessed by comparison of subcortical IES motor function (upper extremity, lower extremity, face) with the connection of the tractography pathway in the motor cortex.
We obtained 21 cortical and 8 subcortical IES sites from intraoperative mapping of motor pathways. Probabilistic q-ball had the best sensitivity (79%) as determined from cortical IES compared to deterministic q-ball (50%), probabilistic DTI (36%), and deterministic DTI (10%). The sensitivity using the q-ball algorithm (65%) was significantly higher than using DTI (23%) (p < 0.001) and the probabilistic algorithms (58%) were more sensitive than deterministic approaches (30%) (p = 0.003). Probabilistic q-ball fiber tracks had the smallest offset to the subcortical stimulation sites. The offsets between diffusion fiber tracks and subcortical IES sites were increased significantly for those cases where the diffusion fiber tracks were visibly thinner than expected. There was perfect concordance between the subcortical IES function (e.g. hand stimulation) and the cortical connection of the nearest diffusion fiber track (e.g. upper extremity cortex).
This study highlights the tremendous utility of intraoperative stimulation sites to provide a gold standard from which to evaluate diffusion MRI fiber tracking methods and has provided an object standard for evaluation of different diffusion models and approaches to fiber tracking. The probabilistic q-ball fiber tractography was significantly better than DTI methods in terms of sensitivity and accuracy of the course through the white matter. The commonly used DTI fiber tracking approach was shown to have very poor sensitivity (as low as 10% for deterministic DTI fiber tracking) for delineation of the lateral aspects of the corticospinal tract in our study. Effects of the tumor/edema resulted in significantly larger offsets between the subcortical IES and the preoperative fiber tracks. The provided data show that probabilistic HARDI tractography is the most objective and reproducible analysis but given the small sample and number of stimulation points a generalization about our results should be given with caution. Indeed our results inform the capabilities of preoperative diffusion fiber tracking and indicate that such data should be used carefully when making pre-surgical and intra-operative management decisions.
•Diffusion MRI tractography is used for presurgical brain mapping.•We use intraoperative electric stimulation as a gold standard.•We delineate motor tracts with deterministic and probabilistic DTI and q-ball models.•Probabilistic q-ball has the best sensitivity (79%).•Probabilistic q-ball fiber tracks had the smallest offset to the subcortical IES.
PMCID: PMC3815019  PMID: 24273719
Diffusion MRI Tractography; Corticospinal tract; q-Ball; DTI; Brain tumor; Intraoperative electrical stimulation (IES)
11.  Using multiparametric data with missing features for learning patterns of pathology 
The paper presents a method for learning multimodal classifiers from datasets in which not all subjects have data from all modalities. Usually, subjects with a severe form of pathology are the ones failing to satisfactorily complete the study, especially when it consists of multiple imaging modalities. A classifier capable of handling subjects with unequal numbers of modalities prevents discarding any subjects, as is traditionally done, thereby broadening the scope of the classifier to more severe pathology. It also allows design of the classifier to include as much of the available information as possible and facilitates testing of subjects with missing modalities over the constructed classifier. The presented method employs an ensemble based approach where several subsets of complete data are formed and trained using individual classifiers. The output from these classifiers is fused using a weighted aggregation step giving an optimal probabilistic score for each subject. The method is applied to a spatio-temporal dataset for autism spectrum disorders (ASD)(96 patients with ASD and 42 typically developing controls) that consists of functional features from magnetoencephalography (MEG) and structural connectivity features from diffusion tensor imaging (DTI). A clear distinction between ASD and controls is obtained with an average 5-fold accuracy of 83.3% and testing accuracy of 88.4%. The fusion classifier performance is superior to the classification achieved using single modalities as well as multimodal classifier using only complete data (78.3%). The presented multimodal classifier framework is applicable to all modality combinations.
PMCID: PMC4023481  PMID: 23286164
12.  A multi-class predictor based on a probabilistic model: application to gene expression profiling-based diagnosis of thyroid tumors 
BMC Genomics  2006;7:190.
Although microscopic diagnosis has been playing the decisive role in cancer diagnostics, there have been cases in which it does not satisfy the clinical need. Differential diagnosis of malignant and benign thyroid tissues is one such case, and supplementary diagnosis such as that by gene expression profile is expected.
With four thyroid tissue types, i.e., papillary carcinoma, follicular carcinoma, follicular adenoma, and normal thyroid, we performed gene expression profiling with adaptor-tagged competitive PCR, a high-throughput RT-PCR technique. For differential diagnosis, we applied a novel multi-class predictor, introducing probabilistic outputs. Multi-class predictors were constructed using various combinations of binary classifiers. The learning set included 119 samples, and the predictors were evaluated by strict leave-one-out cross validation. Trials included classical combinations, i.e., one-to-one, one-to-the-rest, but the predictor using more combination exhibited the better prediction accuracy. This characteristic was consistent with other gene expression data sets. The performance of the selected predictor was then tested with an independent set consisting of 49 samples. The resulting test prediction accuracy was 85.7%.
Molecular diagnosis of thyroid tissues is feasible by gene expression profiling, and the current level is promising towards the automatic diagnostic tool to complement the present medical procedures. A multi-class predictor with an exhaustive combination of binary classifiers could achieve a higher prediction accuracy than those with classical combinations and other predictors such as multi-class SVM. The probabilistic outputs of the predictor offer more detailed information for each sample, which enables visualization of each sample in low-dimensional classification spaces. These new concepts should help to improve the multi-class classification including that of cancer tissues.
PMCID: PMC1550728  PMID: 16872506
13.  Class prediction for high-dimensional class-imbalanced data 
BMC Bioinformatics  2010;11:523.
The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance.
Our results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers.
Our results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with class-imbalanced data are exacerbated when dealing with high-dimensional data. Researchers using class-imbalanced data should be careful in assessing the predictive accuracy of the classifiers and, unless the class imbalance is mild, they should always use an appropriate method for dealing with the class imbalance problem.
PMCID: PMC3098087  PMID: 20961420
14.  Identification and Optimization of Classifier Genes from Multi-Class Earthworm Microarray Dataset 
PLoS ONE  2010;5(10):e13715.
Monitoring, assessment and prediction of environmental risks that chemicals pose demand rapid and accurate diagnostic assays. A variety of toxicological effects have been associated with explosive compounds TNT and RDX. One important goal of microarray experiments is to discover novel biomarkers for toxicity evaluation. We have developed an earthworm microarray containing 15,208 unique oligo probes and have used it to profile gene expression in 248 earthworms exposed to TNT, RDX or neither. We assembled a new machine learning pipeline consisting of several well-established feature filtering/selection and classification techniques to analyze the 248-array dataset in order to construct classifier models that can separate earthworm samples into three groups: control, TNT-treated, and RDX-treated. First, a total of 869 genes differentially expressed in response to TNT or RDX exposure were identified using a univariate statistical algorithm of class comparison. Then, decision tree-based algorithms were applied to select a subset of 354 classifier genes, which were ranked by their overall weight of significance. A multiclass support vector machine (MC-SVM) method and an unsupervised K-mean clustering method were applied to independently refine the classifier, producing a smaller subset of 39 and 30 classifier genes, separately, with 11 common genes being potential biomarkers. The combined 58 genes were considered the refined subset and used to build MC-SVM and clustering models with classification accuracy of 83.5% and 56.9%, respectively. This study demonstrates that the machine learning approach can be used to identify and optimize a small subset of classifier/biomarker genes from high dimensional datasets and generate classification models of acceptable precision for multiple classes.
PMCID: PMC2965664  PMID: 21060837
15.  Multi-kernel graph embedding for detection, Gleason grading of prostate cancer via MRI/MRS 
Medical image analysis  2012;17(2):219-235.
Even though 1 in 6 men in the US, in their lifetime are expected to be diagnosed with prostate cancer (CaP), only 1 in 37 is expected to die on account of it. Consequently, among many men diagnosed with CaP, there has been a recent trend to resort to active surveillance (wait and watch) if diagnosed with a lower Gleason score on biopsy, as opposed to seeking immediate treatment. Some researchers have recently identified imaging markers for low and high grade CaP on multi-parametric (MP) magnetic resonance (MR) imaging (such as T2 weighted MR imaging (T2w MRI) and MR spectroscopy (MRS)). In this paper, we present a novel computerized decision support system (DSS), called Semi Supervised Multi Kernel Graph Embedding (SeSMiK-GE), that quantitatively combines structural, and metabolic imaging data for distinguishing (a) benign versus cancerous, and (b) high- versus low-Gleason grade CaP regions from in vivo MP-MRI. A total of 29 1.5 Tesla endorectal pre-operative in vivo MP MRI (T2w MRI, MRS) studies from patients undergoing radical prostatectomy were considered in this study. Ground truth for evaluation of the SeSMiK-GE classifier was obtained via annotation of disease extent on the preoperative imaging by visually correlating the MRI to the ex vivo whole mount histologic specimens. The SeSMiK-GE framework comprises of three main modules: (1) multi-kernel learning, (2) semi-supervised learning, and (3) dimensionality reduction, which are leveraged for the construction of an integrated low dimensional representation of the different imaging and non-imaging MRI protocols. Hierarchical classifiers for diagnosis and Gleason grading of CaP are then constructed within this unified low dimensional representation. Step 1 of the hierarchical classifier employs a random forest classifier in conjunction with the SeSMiK-GE based data representation and a probabilistic pairwise Markov Random Field algorithm (which allows for imposition of local spatial constraints) to yield a voxel based classification of CaP presence. The CaP region of interest identified in Step 1 is then subsequently classified as either high or low Gleason grade CaP in Step 2. Comparing SeSMiK-GE with unimodal T2w MRI, MRS classifiers and a commonly used feature concatenation (COD) strategy, yielded areas (AUC) under the receiver operative curve (ROC) of (a) 0.89 ± 0.09 (SeSMiK), 0.54 ± 0.18 (T2w MRI), 0.61 ± 0.20 (MRS), and 0.64 ± 0.23 (COD) for distinguishing benign from CaP regions, and (b) 0.84 ± 0.07 (SeSMiK),0.54 ± 0.13 (MRI), 0.59 ± 0.19 (MRS), and 0.62 ± 0.18 (COD) for distinguishing high and low grade CaP using a leave one out cross-validation strategy, all evaluations being performed on a per voxel basis. Our results suggest that following further rigorous validation, SeSMiK-GE could be developed into a powerful diagnostic and prognostic tool for detection and grading of CaP in vivo and in helping to determine the appropriate treatment option. Identifying low grade disease in vivo might allow CaP patients to opt for active surveillance rather than immediately opt for aggressive therapy such as radical prostatectomy.
PMCID: PMC3708492  PMID: 23294985
Prostate cancer; Grading; Data integration; Graph embedding; Semi-supervised
16.  Multiclass relevance units machine: benchmark evaluation and application to small ncRNA discovery 
BMC Genomics  2013;14(Suppl 2):S6.
Classification is the problem of assigning each input object to one of a finite number of classes. This problem has been extensively studied in machine learning and statistics, and there are numerous applications to bioinformatics as well as many other fields. Building a multiclass classifier has been a challenge, where the direct approach of altering the binary classification algorithm to accommodate more than two classes can be computationally too expensive. Hence the indirect approach of using binary decomposition has been commonly used, in which retrieving the class posterior probabilities from the set of binary posterior probabilities given by the individual binary classifiers has been a major issue.
In this work, we present an extension of a recently introduced probabilistic kernel-based learning algorithm called the Classification Relevance Units Machine (CRUM) to the multiclass setting to increase its applicability. The extension is achieved under the error correcting output codes framework. The probabilistic outputs of the binary CRUMs are preserved using a proposed linear-time decoding algorithm, an alternative to the generalized Bradley-Terry (GBT) algorithm whose application to large-scale prediction settings is prohibited by its computational complexity. The resulting classifier is called the Multiclass Relevance Units Machine (McRUM).
The evaluation of McRUM on a variety of real small-scale benchmark datasets shows that our proposed Naïve decoding algorithm is computationally more efficient than the GBT algorithm while maintaining a similar level of predictive accuracy. Then a set of experiments on a larger scale dataset for small ncRNA classification have been conducted with Naïve McRUM and compared with the Gaussian and linear SVM. Although McRUM's predictive performance is slightly lower than the Gaussian SVM, the results show that the similar level of true positive rate can be achieved by sacrificing false positive rate slightly. Furthermore, McRUM is computationally more efficient than the SVM, which is an important factor for large-scale analysis.
We have proposed McRUM, a multiclass extension of binary CRUM. McRUM with Naïve decoding algorithm is computationally efficient in run-time and its predictive performance is comparable to the well-known SVM, showing its potential in solving large-scale multiclass problems in bioinformatics and other fields of study.
PMCID: PMC3582431  PMID: 23445533
17.  Comparison of feature selection and classification for MALDI-MS data 
BMC Genomics  2009;10(Suppl 1):S3.
In the classification of Mass Spectrometry (MS) proteomics data, peak detection, feature selection, and learning classifiers are critical to classification accuracy. To better understand which methods are more accurate when classifying data, some publicly available peak detection algorithms for Matrix assisted Laser Desorption Ionization Mass Spectrometry (MALDI-MS) data were recently compared; however, the issue of different feature selection methods and different classification models as they relate to classification performance has not been addressed. With the application of intelligent computing, much progress has been made in the development of feature selection methods and learning classifiers for the analysis of high-throughput biological data. The main objective of this paper is to compare the methods of feature selection and different learning classifiers when applied to MALDI-MS data and to provide a subsequent reference for the analysis of MS proteomics data.
We compared a well-known method of feature selection, Support Vector Machine Recursive Feature Elimination (SVMRFE), and a recently developed method, Gradient based Leave-one-out Gene Selection (GLGS) that effectively performs microarray data analysis. We also compared several learning classifiers including K-Nearest Neighbor Classifier (KNNC), Naïve Bayes Classifier (NBC), Nearest Mean Scaled Classifier (NMSC), uncorrelated normal based quadratic Bayes Classifier recorded as UDC, Support Vector Machines, and a distance metric learning for Large Margin Nearest Neighbor classifier (LMNN) based on Mahanalobis distance. To compare, we conducted a comprehensive experimental study using three types of MALDI-MS data.
Regarding feature selection, SVMRFE outperformed GLGS in classification. As for the learning classifiers, when classification models derived from the best training were compared, SVMs performed the best with respect to the expected testing accuracy. However, the distance metric learning LMNN outperformed SVMs and other classifiers on evaluating the best testing. In such cases, the optimum classification model based on LMNN is worth investigating for future study.
PMCID: PMC2709264  PMID: 19594880
18.  Evaluating bias due to data linkage error in electronic healthcare records 
Linkage of electronic healthcare records is becoming increasingly important for research purposes. However, linkage error due to mis-recorded or missing identifiers can lead to biased results. We evaluated the impact of linkage error on estimated infection rates using two different methods for classifying links: highest-weight (HW) classification using probabilistic match weights and prior-informed imputation (PII) using match probabilities.
A gold-standard dataset was created through deterministic linkage of unique identifiers in admission data from two hospitals and infection data recorded at the hospital laboratories (original data). Unique identifiers were then removed and data were re-linked by date of birth, sex and Soundex using two classification methods: i) HW classification - accepting the candidate record with the highest weight exceeding a threshold and ii) PII–imputing values from a match probability distribution. To evaluate methods for linking data with different error rates, non-random error and different match rates, we generated simulation data. Each set of simulated files was linked using both classification methods. Infection rates in the linked data were compared with those in the gold-standard data.
In the original gold-standard data, 1496/20924 admissions linked to an infection. In the linked original data, PII provided least biased results: 1481 and 1457 infections (upper/lower thresholds) compared with 1316 and 1287 (HW upper/lower thresholds). In the simulated data, substantial bias (up to 112%) was introduced when linkage error varied by hospital. Bias was also greater when the match rate was low or the identifier error rate was high and in these cases, PII performed better than HW classification at reducing bias due to false-matches.
This study highlights the importance of evaluating the potential impact of linkage error on results. PII can help incorporate linkage uncertainty into analysis and reduce bias due to linkage error, without requiring identifiers.
PMCID: PMC4015706  PMID: 24597489
Data linkage; Routine data; Bias; Electronic health records; Evaluation; Linkage quality
19.  Boosting Probabilistic Graphical Model Inference by Incorporating Prior Knowledge from Multiple Sources 
PLoS ONE  2013;8(6):e67410.
Inferring regulatory networks from experimental data via probabilistic graphical models is a popular framework to gain insights into biological systems. However, the inherent noise in experimental data coupled with a limited sample size reduces the performance of network reverse engineering. Prior knowledge from existing sources of biological information can address this low signal to noise problem by biasing the network inference towards biologically plausible network structures. Although integrating various sources of information is desirable, their heterogeneous nature makes this task challenging. We propose two computational methods to incorporate various information sources into a probabilistic consensus structure prior to be used in graphical model inference. Our first model, called Latent Factor Model (LFM), assumes a high degree of correlation among external information sources and reconstructs a hidden variable as a common source in a Bayesian manner. The second model, a Noisy-OR, picks up the strongest support for an interaction among information sources in a probabilistic fashion. Our extensive computational studies on KEGG signaling pathways as well as on gene expression data from breast cancer and yeast heat shock response reveal that both approaches can significantly enhance the reconstruction accuracy of Bayesian Networks compared to other competing methods as well as to the situation without any prior. Our framework allows for using diverse information sources, like pathway databases, GO terms and protein domain data, etc. and is flexible enough to integrate new sources, if available.
PMCID: PMC3691143  PMID: 23826291
20.  Metamotifs - a generative model for building families of nucleotide position weight matrices 
BMC Bioinformatics  2010;11:348.
Development of high-throughput methods for measuring DNA interactions of transcription factors together with computational advances in short motif inference algorithms is expanding our understanding of transcription factor binding site motifs. The consequential growth of sequence motif data sets makes it important to systematically group and categorise regulatory motifs. It has been shown that there are familial tendencies in DNA sequence motifs that are predictive of the family of factors that binds them. Further development of methods that detect and describe familial motif trends has the potential to help in measuring the similarity of novel computational motif predictions to previously known data and sensitively detecting regulatory motifs similar to previously known ones from novel sequence.
We propose a probabilistic model for position weight matrix (PWM) sequence motif families. The model, which we call the 'metamotif' describes recurring familial patterns in a set of motifs. The metamotif framework models variation within a family of sequence motifs. It allows for simultaneous estimation of a series of independent metamotifs from input position weight matrix (PWM) motif data and does not assume that all input motif columns contribute to a familial pattern. We describe an algorithm for inferring metamotifs from weight matrix data. We then demonstrate the use of the model in two practical tasks: in the Bayesian NestedMICA model inference algorithm as a PWM prior to enhance motif inference sensitivity, and in a motif classification task where motifs are labelled according to their interacting DNA binding domain.
We show that metamotifs can be used as PWM priors in the NestedMICA motif inference algorithm to dramatically increase the sensitivity to infer motifs. Metamotifs were also successfully applied to a motif classification problem where sequence motif features were used to predict the family of protein DNA binding domains that would interact with it. The metamotif based classifier is shown to compare favourably to previous related methods. The metamotif has great potential for further use in machine learning tasks related to especially de novo computational sequence motif inference. The metamotif methods presented have been incorporated into the NestedMICA suite.
PMCID: PMC2906491  PMID: 20579334
21.  A feature selection approach for identification of signature genes from SAGE data 
BMC Bioinformatics  2007;8:169.
One goal of gene expression profiling is to identify signature genes that robustly distinguish different types or grades of tumors. Several tumor classifiers based on expression profiling have been proposed using microarray technique. Due to important differences in the probabilistic models of microarray and SAGE technologies, it is important to develop suitable techniques to select specific genes from SAGE measurements.
A new framework to select specific genes that distinguish different biological states based on the analysis of SAGE data is proposed. The new framework applies the bolstered error for the identification of strong genes that separate the biological states in a feature space defined by the gene expression of a training set. Credibility intervals defined from a probabilistic model of SAGE measurements are used to identify the genes that distinguish the different states with more reliability among all gene groups selected by the strong genes method. A score taking into account the credibility and the bolstered error values in order to rank the groups of considered genes is proposed. Results obtained using SAGE data from gliomas are presented, thus corroborating the introduced methodology.
The model representing counting data, such as SAGE, provides additional statistical information that allows a more robust analysis. The additional statistical information provided by the probabilistic model is incorporated in the methodology described in the paper. The introduced method is suitable to identify signature genes that lead to a good separation of the biological states using SAGE and may be adapted for other counting methods such as Massive Parallel Signature Sequencing (MPSS) or the recent Sequencing-By-Synthesis (SBS) technique. Some of such genes identified by the proposed method may be useful to generate classifiers.
PMCID: PMC1891113  PMID: 17519038
22.  CMRF: analyzing differential gene regulation in two group perturbation experiments 
BMC Genomics  2012;13(Suppl 2):S2.
Microarray experiments often measure expressions of genes taken from sample tissues in the presence of external perturbations such as medication, radiation, or disease. The external perturbation can change the expressions of some genes directly or indirectly through gene interaction network. In this paper, we focus on an important class of such microarray experiments that inherently have two groups of tissue samples. When such different groups exist, the changes in expressions for some of the genes after the perturbation can be different between the two groups. It is not only important to identify the genes that respond differently across the two groups, but also to mine the reason behind this differential response. In this paper, we aim to identify the cause of this differential behavior of genes, whether because of the perturbation or due to interactions with other genes.
We propose a new probabilistic Bayesian method CMRF based on Markov Random Field to identify such genes. CMRF leverages the information about gene interactions as the prior of the model. We compare the accuracy of CMRF with SSEM and Student's t test and our old method SMRF on semi-synthetic dataset generated from microarray data. CMRF obtains high accuracy and outperforms all the other three methods. We also conduct a statistical significance test using a parametric noise based experiment to evaluate the accuracy of our method. In this experiment, CMRF generates significant regions of confidence for various parameter settings.
In this paper, we solved the problem of finding primarily differentially regulated genes in the presence of external perturbations when the data is sampled from two groups. The probabilistic Bayesian method CMRF based on Markov Random Field incorporates dependency structure of the gene networks as the prior to the model. Experimental results on synthetic and real datasets demonstrated the superiority of CMRF compared to other simple techniques.
PMCID: PMC3394417  PMID: 22537297
23.  Cytochrome P450 site of metabolism prediction from 2D topological fingerprints using GPU accelerated probabilistic classifiers 
The prediction of sites and products of metabolism in xenobiotic compounds is key to the development of new chemical entities, where screening potential metabolites for toxicity or unwanted side-effects is of crucial importance. In this work 2D topological fingerprints are used to encode atomic sites and three probabilistic machine learning methods are applied: Parzen-Rosenblatt Window (PRW), Naive Bayesian (NB) and a novel approach called RASCAL (Random Attribute Subsampling Classification ALgorithm). These are implemented by randomly subsampling descriptor space to alleviate the problem often suffered by data mining methods of having to exactly match fingerprints, and in the case of PRW by measuring a distance between feature vectors rather than exact matching. The classifiers have been implemented in CUDA/C++ to exploit the parallel architecture of graphical processing units (GPUs) and is freely available in a public repository.
It is shown that for PRW a SoM (Site of Metabolism) is identified in the top two predictions for 85%, 91% and 88% of the CYP 3A4, 2D6 and 2C9 data sets respectively, with RASCAL giving similar performance of 83%, 91% and 88%, respectively. These results put PRW and RASCAL performance ahead of NB which gave a much lower classification performance of 51%, 73% and 74%, respectively.
2D topological fingerprints calculated to a bond depth of 4-6 contain sufficient information to allow the identification of SoMs using classifiers based on relatively small data sets. Thus, the machine learning methods outlined in this paper are conceptually simpler and more efficient than other methods tested and the use of simple topological descriptors derived from 2D structure give results competitive with other approaches using more expensive quantum chemical descriptors. The descriptor space subsampling approach and ensemble methodology allow the methods to be applied to molecules more distant from the training data where data mining would be more likely to fail due to the lack of common fingerprints. The RASCAL algorithm is shown to give equivalent classification performance to PRW but at lower computational expense allowing it to be applied more efficiently in the ensemble scheme.
PMCID: PMC4047555  PMID: 24959208
Cytochrome P450; Metabolism; Probabilistic; Classification; GPU; CUDA; 2D
24.  Diffusion based Abnormality Markers of Pathology: Towards Learned Diagnostic Prediction of ASD 
NeuroImage  2011;57(3):918-927.
This paper presents a paradigm for generating a quantifiable marker of pathology that supports diagnosis and provides a potential biomarker of neuropsychiatric disorders, such as autism spectrum disorder (ASD). This is achieved by creating high-dimensional nonlinear pattern classifiers using Support Vector Machines (SVM), that learn the underlying pattern of pathology using numerous atlas-based regional features extracted from Diffusion Tensor Imaging (DTI) data. These classifiers, in addition to providing insight into the group separation between patients and controls, are applicable on a single subject basis and have the potential to aid in diagnosis by assigning a probabilistic abnormality score to each subject that quantifies the degree of pathology and can be used in combination with other clinical scores to aid in diagnostic decision. They also produce a ranking of regions that contribute most to the group classification and separation, thereby providing a neurobiological insight into the pathology. As an illustrative application of the general framework for creating diffusion based abnormality classifiers we create classifiers for a dataset consisting of 45 children with autism spectrum disorder (ASD) (mean age 10.5 ± 2.5 yrs) as compared to 30 typically developing (TD) controls ( mean age 10.3 ± 2.5 yrs). Based on the abnormality scores, a distinction between the ASD population and TD controls was achieved with 80% leave one out (LOO) cross-validation accuracy with high significance of p < 0.001, ~84% specificity and ~74% sensitivity. Regions that contributed to this abnormality score involved fractional anisotropy (FA) differences mainly in right occipital regions as well as in left superior longitudinal fasciculus, external and internal capsule while mean diffusivity (MD) discriminates were observed primarily in right occipital gyrus and right temporal white matter.
PMCID: PMC3152443  PMID: 21609768
Diffusion tensor imaging; support vector machines; pattern classification; abnormality score
25.  A Bayesian Framework to Account for Complex Non-Genetic Factors in Gene Expression Levels Greatly Increases Power in eQTL Studies 
PLoS Computational Biology  2010;6(5):e1000770.
Gene expression measurements are influenced by a wide range of factors, such as the state of the cell, experimental conditions and variants in the sequence of regulatory regions. To understand the effect of a variable of interest, such as the genotype of a locus, it is important to account for variation that is due to confounding causes. Here, we present VBQTL, a probabilistic approach for mapping expression quantitative trait loci (eQTLs) that jointly models contributions from genotype as well as known and hidden confounding factors. VBQTL is implemented within an efficient and flexible inference framework, making it fast and tractable on large-scale problems. We compare the performance of VBQTL with alternative methods for dealing with confounding variability on eQTL mapping datasets from simulations, yeast, mouse, and human. Employing Bayesian complexity control and joint modelling is shown to result in more precise estimates of the contribution of different confounding factors resulting in additional associations to measured transcript levels compared to alternative approaches. We present a threefold larger collection of cis eQTLs than previously found in a whole-genome eQTL scan of an outbred human population. Altogether, 27% of the tested probes show a significant genetic association in cis, and we validate that the additional eQTLs are likely to be real by replicating them in different sets of individuals. Our method is the next step in the analysis of high-dimensional phenotype data, and its application has revealed insights into genetic regulation of gene expression by demonstrating more abundant cis-acting eQTLs in human than previously shown. Our software is freely available online at
Author Summary
Gene expression is a complex phenotype. The measured expression level in an experiment can be affected by a wide range of factors—state of the cell, experimental conditions, variants in the sequence of regulatory regions, and others. To understand genotype-to-phenotype relationships, we need to be able to distinguish the variation that is due to the genetic state from all the confounding causes. We present VBQTL, a probabilistic method for dissecting gene expression variation by jointly modelling the underlying global causes of variability and the genetic effect. Our method is implemented in a flexible framework that allows for quick model adaptation and comparison with alternative models. The probabilistic approach yields more accurate estimates of the contributions from different sources of variation. Applying VBQTL, we find that common genetic variation controlling gene expression levels in human is more abundant than previously shown, which has implications for a wide range of studies relating genotype to phenotype.
PMCID: PMC2865505  PMID: 20463871

Results 1-25 (1205916)