PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1443353)

Clipboard (0)
None

Related Articles

1.  Analysis of multiple compound–protein interactions reveals novel bioactive molecules 
The authors use machine learning of compound-protein interactions to explore drug polypharmacology and to efficiently identify bioactive ligands, including novel scaffold-hopping compounds for two pharmaceutically important protein families: G-protein coupled receptors and protein kinases.
We have demonstrated that machine learning of multiple compound–protein interactions is useful for efficient ligand screening and for assessing drug polypharmacology.This approach successfully identified novel scaffold-hopping compounds for two pharmaceutically important protein families: G-protein-coupled receptors and protein kinases.These bioactive compounds were not detected by existing computational ligand-screening methods in comparative studies.The results of this study indicate that data derived from chemical genomics can be highly useful for exploring chemical space, and this systems biology perspective could accelerate drug discovery processes.
The discovery of novel bioactive molecules advances our systems-level understanding of biological processes and is crucial for innovation in drug development. Perturbations of biological systems by chemical probes provide broader applications not only for analysis of complex systems but also for intentional manipulations of these systems. Nevertheless, the lack of well-characterized chemical modulators has limited their use. Recently, chemical genomics has emerged as a promising area of research applicable to the exploration of novel bioactive molecules, and researchers are currently striving toward the identification of all possible ligands for all target protein families (Wang et al, 2009). Chemical genomics studies have shown that patterns of compound–protein interactions (CPIs) are too diverse to be understood as simple one-to-one events. There is an urgent need to develop appropriate data mining methods for characterizing and visualizing the full complexity of interactions between chemical space and biological systems. However, no existing screening approach has so far succeeded in identifying novel bioactive compounds using multiple interactions among compounds and target proteins.
High-throughput screening (HTS) and computational screening have greatly aided in the identification of early lead compounds for drug discovery. However, the large number of assays required for HTS to identify drugs that target multiple proteins render this process very costly and time-consuming. Therefore, interest in using in silico strategies for screening has increased. The most common computational approaches, ligand-based virtual screening (LBVS) and structure-based virtual screening (SBVS; Oprea and Matter, 2004; Muegge and Oloff, 2006; McInnes, 2007; Figure 1A), have been used for practical drug development. LBVS aims to identify molecules that are very similar to known active molecules and generally has difficulty identifying compounds with novel structural scaffolds that differ from reference molecules. The other popular strategy, SBVS, is constrained by the number of three-dimensional crystallographic structures available. To circumvent these limitations, we have shown that a new computational screening strategy, chemical genomics-based virtual screening (CGBVS), has the potential to identify novel, scaffold-hopping compounds and assess their polypharmacology by using a machine-learning method to recognize conserved molecular patterns in comprehensive CPI data sets.
The CGBVS strategy used in this study was made up of five steps: CPI data collection, descriptor calculation, representation of interaction vectors, predictive model construction using training data sets, and predictions from test data (Figure 1A). Importantly, step 1, the construction of a data set of chemical structures and protein sequences for known CPIs, did not require the three-dimensional protein structures needed for SBVS. In step 2, compound structures and protein sequences were converted into numerical descriptors. These descriptors were used to construct chemical or biological spaces in which decreasing distance between vectors corresponded to increasing similarity of compound structures or protein sequences. In step 3, we represented multiple CPI patterns by concatenating these chemical and protein descriptors. Using these interaction vectors, we could quantify the similarity of molecular interactions for compound–protein pairs, despite the fact that the ligand and protein similarity maps differed substantially. In step 4, concatenated vectors for CPI pairs (positive samples) and non-interacting pairs (negative samples) were input into an established machine-learning method. In the final step, the classifier constructed using training sets was applied to test data.
To evaluate the predictive value of CGBVS, we first compared its performance with that of LBVS by fivefold cross-validation. CGBVS performed with considerably higher accuracy (91.9%) than did LBVS (84.4%; Figure 1B). We next compared CGBVS and SBVS in a retrospective virtual screening based on the human β2-adrenergic receptor (ADRB2). Figure 1C shows that CGBVS provided higher hit rates than did SBVS. These results suggest that CGBVS is more successful than conventional approaches for prediction of CPIs.
We then evaluated the ability of the CGBVS method to predict the polypharmacology of ADRB2 by attempting to identify novel ADRB2 ligands from a group of G-protein-coupled receptor (GPCR) ligands. We ranked the prediction scores for the interactions of 826 reported GPCR ligands with ADRB2 and then analyzed the 50 highest-ranked compounds in greater detail. Of 21 commercially available compounds, 11 showed ADRB2-binding activity and were not previously reported to be ADRB2 ligands. These compounds included ligands not only for aminergic receptors but also for neuropeptide Y-type 1 receptors (NPY1R), which have low protein homology to ADRB2. Most ligands we identified were not detected by LBVS and SBVS, which suggests that only CGBVS could identify this unexpected cross-reaction for a ligand developed as a target to a peptidergic receptor.
The true value of CGBVS in drug discovery must be tested by assessing whether this method can identify scaffold-hopping lead compounds from a set of compounds that is structurally more diverse. To assess this ability, we analyzed 11 500 commercially available compounds to predict compounds likely to bind to two GPCRs and two protein kinases. Functional assays revealed that nine ADRB2 ligands, three NPY1R ligands, five epidermal growth factor receptor (EGFR) inhibitors, and two cyclin-dependent kinase 2 (CDK2) inhibitors were concentrated in the top-ranked compounds (hit rate=30, 15, 25, and 10%, respectively). We also evaluated the extent of scaffold hopping achieved in the identification of these novel ligands. One ADRB2 ligand, two NPY1R ligands, and one CDK2 inhibitor exhibited scaffold hopping (Figure 4), indicating that CGBVS can use this characteristic to rationally predict novel lead compounds, a crucial and very difficult step in drug discovery. This feature of CGBVS is critically different from existing predictive methods, such as LBVS, which depend on similarities between test and reference ligands, and focus on a single protein or highly homologous proteins. In particular, CGBVS is useful for targets with undefined ligands because this method can use CPIs with target proteins that exhibit lower levels of homology.
In summary, we have demonstrated that data mining of multiple CPIs is of great practical value for exploration of chemical space. As a predictive model, CGBVS could provide an important step in the discovery of such multi-target drugs by identifying the group of proteins targeted by a particular ligand, leading to innovation in pharmaceutical research.
The discovery of novel bioactive molecules advances our systems-level understanding of biological processes and is crucial for innovation in drug development. For this purpose, the emerging field of chemical genomics is currently focused on accumulating large assay data sets describing compound–protein interactions (CPIs). Although new target proteins for known drugs have recently been identified through mining of CPI databases, using these resources to identify novel ligands remains unexplored. Herein, we demonstrate that machine learning of multiple CPIs can not only assess drug polypharmacology but can also efficiently identify novel bioactive scaffold-hopping compounds. Through a machine-learning technique that uses multiple CPIs, we have successfully identified novel lead compounds for two pharmaceutically important protein families, G-protein-coupled receptors and protein kinases. These novel compounds were not identified by existing computational ligand-screening methods in comparative studies. The results of this study indicate that data derived from chemical genomics can be highly useful for exploring chemical space, and this systems biology perspective could accelerate drug discovery processes.
doi:10.1038/msb.2011.5
PMCID: PMC3094066  PMID: 21364574
chemical genomics; data mining; drug discovery; ligand screening; systems chemical biology
2.  The Cure: Design and Evaluation of a Crowdsourcing Game for Gene Selection for Breast Cancer Survival Prediction 
JMIR Serious Games  2014;2(2):e7.
Background
Molecular signatures for predicting breast cancer prognosis could greatly improve care through personalization of treatment. Computational analyses of genome-wide expression datasets have identified such signatures, but these signatures leave much to be desired in terms of accuracy, reproducibility, and biological interpretability. Methods that take advantage of structured prior knowledge (eg, protein interaction networks) show promise in helping to define better signatures, but most knowledge remains unstructured. Crowdsourcing via scientific discovery games is an emerging methodology that has the potential to tap into human intelligence at scales and in modes unheard of before.
Objective
The main objective of this study was to test the hypothesis that knowledge linking expression patterns of specific genes to breast cancer outcomes could be captured from players of an open, Web-based game. We envisioned capturing knowledge both from the player’s prior experience and from their ability to interpret text related to candidate genes presented to them in the context of the game.
Methods
We developed and evaluated an online game called The Cure that captured information from players regarding genes for use as predictors of breast cancer survival. Information gathered from game play was aggregated using a voting approach, and used to create rankings of genes. The top genes from these rankings were evaluated using annotation enrichment analysis, comparison to prior predictor gene sets, and by using them to train and test machine learning systems for predicting 10 year survival.
Results
Between its launch in September 2012 and September 2013, The Cure attracted more than 1000 registered players, who collectively played nearly 10,000 games. Gene sets assembled through aggregation of the collected data showed significant enrichment for genes known to be related to key concepts such as cancer, disease progression, and recurrence. In terms of the predictive accuracy of models trained using this information, these gene sets provided comparable performance to gene sets generated using other methods, including those used in commercial tests. The Cure is available on the Internet.
Conclusions
The principal contribution of this work is to show that crowdsourcing games can be developed as a means to address problems involving domain knowledge. While most prior work on scientific discovery games and crowdsourcing in general takes as a premise that contributors have little or no expertise, here we demonstrated a crowdsourcing system that succeeded in capturing expert knowledge.
doi:10.2196/games.3350
PMCID: PMC4307816  PMID: 25654473
breast neoplasms; gene expression; artificial intelligence; survival analysis; crowdsourcing; Web applications; computer games; collaborative and social computing systems and tools; supervised learning; feature selection
3.  Application of Machine Learning Algorithms to Predict Coronary Artery Calcification With a Sibship-Based Design 
Genetic epidemiology  2008;32(4):350-360.
As part of the Genetic Epidemiology Network of Arteriopathy study, hypertensive non-Hispanic White sibships were screened using 471 single nucleotide polymorphisms (SNPs) to identify genes influencing coronary artery calcification (CAC) measured by computed tomography. Individuals with detectable CAC and CAC quantity ≥70th age- and sex-specific percentile were classified as having a high CAC burden and compared to individuals with CAC quantity <70th percentile. Two sibs from each sibship were randomly chosen and divided into two data sets, each with 360 unrelated individuals. Within each data set, we applied two machine learning algorithms, Random Forests and RuleFit, to identify the best predictors of having high CAC burden among 17 risk factors and 471 SNPs. Using five-fold cross-validation, both methods had ~70% sensitivity and ~60% specificity. Prediction accuracies were significantly different from random predictions (P-value <0.001) based on 1,000 permutation tests. Predictability of using 287 tagSNPs was as good as using all 471 SNPs. For Random Forests, among the top 50 predictors, the same eight tagSNPs and 15 risk factors were found in both data sets while eight tagSNPs and 12 risk factors were found in both data sets for RuleFit. Replicable effects of two tagSNPs (in genes GPR35 and NOS3) and 12 risk factors (age, body mass index, sex, serum glucose, high-density lipoprotein cholesterol, systolic blood pressure, cholesterol, homocysteine, triglycerides, fibrinogen, Lp(a) and low-density lipoprotein particle size) were identified by both methods. This study illustrates how machine learning methods can be used in sibships to identify important, replicable predictors of subclinical coronary atherosclerosis.
doi:10.1002/gepi.20309
PMCID: PMC2828904  PMID: 18271057
machine learning; Random Forests; RuleFit; coronary artery calcification; sibship
4.  Use of Attribute Driven Incremental Discretization and Logic Learning Machine to build a prognostic classifier for neuroblastoma patients 
BMC Bioinformatics  2014;15(Suppl 5):S4.
Background
Cancer patient's outcome is written, in part, in the gene expression profile of the tumor. We previously identified a 62-probe sets signature (NB-hypo) to identify tissue hypoxia in neuroblastoma tumors and showed that NB-hypo stratified neuroblastoma patients in good and poor outcome [1]. It was important to develop a prognostic classifier to cluster patients into risk groups benefiting of defined therapeutic approaches. Novel classification and data discretization approaches can be instrumental for the generation of accurate predictors and robust tools for clinical decision support. We explored the application to gene expression data of Rulex, a novel software suite including the Attribute Driven Incremental Discretization technique for transforming continuous variables into simplified discrete ones and the Logic Learning Machine model for intelligible rule generation.
Results
We applied Rulex components to the problem of predicting the outcome of neuroblastoma patients on the bases of 62 probe sets NB-hypo gene expression signature. The resulting classifier consisted in 9 rules utilizing mainly two conditions of the relative expression of 11 probe sets. These rules were very effective predictors, as shown in an independent validation set, demonstrating the validity of the LLM algorithm applied to microarray data and patients' classification. The LLM performed as efficiently as Prediction Analysis of Microarray and Support Vector Machine, and outperformed other learning algorithms such as C4.5. Rulex carried out a feature selection by selecting a new signature (NB-hypo-II) of 11 probe sets that turned out to be the most relevant in predicting outcome among the 62 of the NB-hypo signature. Rules are easily interpretable as they involve only few conditions.
Furthermore, we demonstrate that the application of a weighted classification associated with the rules improves the classification of poorly represented classes.
Conclusions
Our findings provided evidence that the application of Rulex to the expression values of NB-hypo signature created a set of accurate, high quality, consistent and interpretable rules for the prediction of neuroblastoma patients' outcome. We identified the Rulex weighted classification as a flexible tool that can support clinical decisions. For these reasons, we consider Rulex to be a useful tool for cancer classification from microarray gene expression data.
doi:10.1186/1471-2105-15-S5-S4
PMCID: PMC4095004  PMID: 25078098
Logic Learning Machine; Attribute Driven Incremental Discretization; Explicit rules; NB-hypo-II signature; Neuroblastoma; Hypoxia; Classifier; Weighted Classification
5.  New bandwidth selection criterion for Kernel PCA: Approach to dimensionality reduction and classification problems 
BMC Bioinformatics  2014;15:137.
Background
DNA microarrays are potentially powerful technology for improving diagnostic classification, treatment selection, and prognostic assessment. The use of this technology to predict cancer outcome has a history of almost a decade. Disease class predictors can be designed for known disease cases and provide diagnostic confirmation or clarify abnormal cases. The main input to this class predictors are high dimensional data with many variables and few observations. Dimensionality reduction of these features set significantly speeds up the prediction task. Feature selection and feature transformation methods are well known preprocessing steps in the field of bioinformatics. Several prediction tools are available based on these techniques.
Results
Studies show that a well tuned Kernel PCA (KPCA) is an efficient preprocessing step for dimensionality reduction, but the available bandwidth selection method for KPCA was computationally expensive. In this paper, we propose a new data-driven bandwidth selection criterion for KPCA, which is related to least squares cross-validation for kernel density estimation. We propose a new prediction model with a well tuned KPCA and Least Squares Support Vector Machine (LS-SVM). We estimate the accuracy of the newly proposed model based on 9 case studies. Then, we compare its performances (in terms of test set Area Under the ROC Curve (AUC) and computational time) with other well known techniques such as whole data set + LS-SVM, PCA + LS-SVM, t-test + LS-SVM, Prediction Analysis of Microarrays (PAM) and Least Absolute Shrinkage and Selection Operator (Lasso). Finally, we assess the performance of the proposed strategy with an existing KPCA parameter tuning algorithm by means of two additional case studies.
Conclusion
We propose, evaluate, and compare several mathematical/statistical techniques, which apply feature transformation/selection for subsequent classification, and consider its application in medical diagnostics. Both feature selection and feature transformation perform well on classification tasks. Due to the dynamic selection property of feature selection, it is hard to define significant features for the classifier, which predicts classes of future samples. Moreover, the proposed strategy enjoys a distinctive advantage with its relatively lesser time complexity.
doi:10.1186/1471-2105-15-137
PMCID: PMC4025604  PMID: 24886083
6.  Osteoporosis Risk Prediction for Bone Mineral Density Assessment of Postmenopausal Women Using Machine Learning 
Yonsei Medical Journal  2013;54(6):1321-1330.
Purpose
A number of clinical decision tools for osteoporosis risk assessment have been developed to select postmenopausal women for the measurement of bone mineral density. We developed and validated machine learning models with the aim of more accurately identifying the risk of osteoporosis in postmenopausal women compared to the ability of conventional clinical decision tools.
Materials and Methods
We collected medical records from Korean postmenopausal women based on the Korea National Health and Nutrition Examination Surveys. The training data set was used to construct models based on popular machine learning algorithms such as support vector machines (SVM), random forests, artificial neural networks (ANN), and logistic regression (LR) based on simple surveys. The machine learning models were compared to four conventional clinical decision tools: osteoporosis self-assessment tool (OST), osteoporosis risk assessment instrument (ORAI), simple calculated osteoporosis risk estimation (SCORE), and osteoporosis index of risk (OSIRIS).
Results
SVM had significantly better area under the curve (AUC) of the receiver operating characteristic than ANN, LR, OST, ORAI, SCORE, and OSIRIS for the training set. SVM predicted osteoporosis risk with an AUC of 0.827, accuracy of 76.7%, sensitivity of 77.8%, and specificity of 76.0% at total hip, femoral neck, or lumbar spine for the testing set. The significant factors selected by SVM were age, height, weight, body mass index, duration of menopause, duration of breast feeding, estrogen therapy, hyperlipidemia, hypertension, osteoarthritis, and diabetes mellitus.
Conclusion
Considering various predictors associated with low bone density, the machine learning methods may be effective tools for identifying postmenopausal women at high risk for osteoporosis.
doi:10.3349/ymj.2013.54.6.1321
PMCID: PMC3809875  PMID: 24142634
Screening; machine learning; risk assessment; clinical decision tools; support vector machines
7.  Online prediction of respiratory motion: multidimensional processing with low-dimensional feature learning 
Physics in medicine and biology  2010;55(11):3011-3025.
Accurate real-time prediction of respiratory motion is desirable for effective motion management in radiotherapy for lung tumor targets. Recently, nonparametric methods have been developed and their efficacy in predicting one-dimensional respiratory-type motion has been demonstrated. To exploit the correlation among various coordinates of the moving target, it is natural to extend the 1D method to multidimensional processing. However, the amount of learning data required for such extension grows exponentially with the dimensionality of the problem, a phenomenon known as the ‘curse of dimensionality’. In this study, we investigate a multidimensional prediction scheme based on kernel density estimation (KDE) in an augmented covariate–response space. To alleviate the ‘curse of dimensionality’, we explore the intrinsic lower dimensional manifold structure and utilize principal component analysis (PCA) to construct a proper low-dimensional feature space, where kernel density estimation is feasible with the limited training data. Interestingly, the construction of this lower dimensional representation reveals a useful decomposition of the variations in respiratory motion into the contribution from semiperiodic dynamics and that from the random noise, as it is only sensible to perform prediction with respect to the former. The dimension reduction idea proposed in this work is closely related to feature extraction used in machine learning, particularly support vector machines. This work points out a pathway in processing high-dimensional data with limited training instances, and this principle applies well beyond the problem of target-coordinate-based respiratory-based prediction. A natural extension is prediction based on image intensity directly, which we will investigate in the continuation of this work. We used 159 lung target motion traces obtained with a Synchrony respiratory tracking system. Prediction performance of the low-dimensional feature learning-based multidimensional prediction method was compared against the independent prediction method where prediction was conducted along each physical coordinate independently. Under fair setup conditions, the proposed method showed uniformly better performance, and reduced the case-wise 3D root mean squared prediction error by about 30–40%. The 90% percentile 3D error is reduced from 1.80 mm to 1.08 mm for 160 ms prediction, and 2.76 mm to 2.01 mm for 570 ms prediction. The proposed method demonstrates the most noticeable improvement in the tail of the error distribution.
doi:10.1088/0031-9155/55/11/002
PMCID: PMC2975024  PMID: 20442460
8.  Inferring on the Intentions of Others by Hierarchical Bayesian Learning 
PLoS Computational Biology  2014;10(9):e1003810.
Inferring on others' (potentially time-varying) intentions is a fundamental problem during many social transactions. To investigate the underlying mechanisms, we applied computational modeling to behavioral data from an economic game in which 16 pairs of volunteers (randomly assigned to “player” or “adviser” roles) interacted. The player performed a probabilistic reinforcement learning task, receiving information about a binary lottery from a visual pie chart. The adviser, who received more predictive information, issued an additional recommendation. Critically, the game was structured such that the adviser's incentives to provide helpful or misleading information varied in time. Using a meta-Bayesian modeling framework, we found that the players' behavior was best explained by the deployment of hierarchical learning: they inferred upon the volatility of the advisers' intentions in order to optimize their predictions about the validity of their advice. Beyond learning, volatility estimates also affected the trial-by-trial variability of decisions: participants were more likely to rely on their estimates of advice accuracy for making choices when they believed that the adviser's intentions were presently stable. Finally, our model of the players' inference predicted the players' interpersonal reactivity index (IRI) scores, explicit ratings of the advisers' helpfulness and the advisers' self-reports on their chosen strategy. Overall, our results suggest that humans (i) employ hierarchical generative models to infer on the changing intentions of others, (ii) use volatility estimates to inform decision-making in social interactions, and (iii) integrate estimates of advice accuracy with non-social sources of information. The Bayesian framework presented here can quantify individual differences in these mechanisms from simple behavioral readouts and may prove useful in future clinical studies of maladaptive social cognition.
Author Summary
The ability to decode another person's intentions is a critical component of social interactions. This is particularly important when we have to make decisions based on someone else's advice. Our research proposes that this complex cognitive skill (social learning) can be translated into a mathematical model, which prescribes a mechanism for mentally simulating another person's intentions. This study demonstrates that this process can be parsimoniously described as the deployment of hierarchical learning. In other words, participants learn about two quantities: the intentions of the person they interact with and the veracity of the recommendations they offer. As participants become more and more confident about their representation of the other's intentions, they make decisions more in accordance with the advice they receive. Importantly, our modeling framework captures individual differences in the social learning process: The estimated “learning fingerprint” can predict other aspects of participants' behavior, such as their perspective-taking abilities and their explicit ratings of the adviser's level of trustworthiness. The present modeling approach can be further applied in the context of psychiatry to identify maladaptive learning processes in disorders where social learning processes are particularly impaired, such as schizophrenia.
doi:10.1371/journal.pcbi.1003810
PMCID: PMC4154656  PMID: 25187943
9.  Machine Learning for Outcome Prediction of Acute Ischemic Stroke Post Intra-Arterial Therapy 
PLoS ONE  2014;9(2):e88225.
Introduction
Stroke is a major cause of death and disability. Accurately predicting stroke outcome from a set of predictive variables may identify high-risk patients and guide treatment approaches, leading to decreased morbidity. Logistic regression models allow for the identification and validation of predictive variables. However, advanced machine learning algorithms offer an alternative, in particular, for large-scale multi-institutional data, with the advantage of easily incorporating newly available data to improve prediction performance. Our aim was to design and compare different machine learning methods, capable of predicting the outcome of endovascular intervention in acute anterior circulation ischaemic stroke.
Method
We conducted a retrospective study of a prospectively collected database of acute ischaemic stroke treated by endovascular intervention. Using SPSS®, MATLAB®, and Rapidminer®, classical statistics as well as artificial neural network and support vector algorithms were applied to design a supervised machine capable of classifying these predictors into potential good and poor outcomes. These algorithms were trained, validated and tested using randomly divided data.
Results
We included 107 consecutive acute anterior circulation ischaemic stroke patients treated by endovascular technique. Sixty-six were male and the mean age of 65.3. All the available demographic, procedural and clinical factors were included into the models. The final confusion matrix of the neural network, demonstrated an overall congruency of ∼80% between the target and output classes, with favourable receiving operative characteristics. However, after optimisation, the support vector machine had a relatively better performance, with a root mean squared error of 2.064 (SD: ±0.408).
Discussion
We showed promising accuracy of outcome prediction, using supervised machine learning algorithms, with potential for incorporation of larger multicenter datasets, likely further improving prediction. Finally, we propose that a robust machine learning system can potentially optimise the selection process for endovascular versus medical treatment in the management of acute stroke.
doi:10.1371/journal.pone.0088225
PMCID: PMC3919736  PMID: 24520356
10.  An Exhaustive, Non-Euclidean, Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria 
PLoS ONE  2011;6(9):e24085.
Complex, high-dimensional data sets pose significant analytical challenges in the post-genomic era. Such data sets are not exclusive to genetic analyses and are also pertinent to epidemiology. There has been considerable effort to develop hypothesis-free data mining and machine learning methodologies. However, current methodologies lack exhaustivity and general applicability. Here we use a novel non-parametric, non-euclidean data mining tool, HyperCube®, to explore exhaustively a complex epidemiological malaria data set by searching for over density of events in m-dimensional space. Hotspots of over density correspond to strings of variables, rules, that determine, in this case, the occurrence of Plasmodium falciparum clinical malaria episodes. The data set contained 46,837 outcome events from 1,653 individuals and 34 explanatory variables. The best predictive rule contained 1,689 events from 148 individuals and was defined as: individuals present during 1992–2003, aged 1–5 years old, having hemoglobin AA, and having had previous Plasmodium malariae malaria parasite infection ≤10 times. These individuals had 3.71 times more P. falciparum clinical malaria episodes than the general population. We validated the rule in two different cohorts. We compared and contrasted the HyperCube® rule with the rules using variables identified by both traditional statistical methods and non-parametric regression tree methods. In addition, we tried all possible sub-stratified quantitative variables. No other model with equal or greater representativity gave a higher Relative Risk. Although three of the four variables in the rule were intuitive, the effect of number of P. malariae episodes was not. HyperCube® efficiently sub-stratified quantitative variables to optimize the rule and was able to identify interactions among the variables, tasks not easy to perform using standard data mining methods. Search of local over density in m-dimensional space, explained by easily interpretable rules, is thus seemingly ideal for generating hypotheses for large datasets to unravel the complexity inherent in biological systems.
doi:10.1371/journal.pone.0024085
PMCID: PMC3170284  PMID: 21931645
11.  Video Game Telemetry as a Critical Tool in the Study of Complex Skill Learning 
PLoS ONE  2013;8(9):e75129.
Cognitive science has long shown interest in expertise, in part because prediction and control of expert development would have immense practical value. Most studies in this area investigate expertise by comparing experts with novices. The reliance on contrastive samples in studies of human expertise only yields deep insight into development where differences are important throughout skill acquisition. This reliance may be pernicious where the predictive importance of variables is not constant across levels of expertise. Before the development of sophisticated machine learning tools for data mining larger samples, and indeed, before such samples were available, it was difficult to test the implicit assumption of static variable importance in expertise development. To investigate if this reliance may have imposed critical restrictions on the understanding of complex skill development, we adopted an alternative method, the online acquisition of telemetry data from a common daily activity for many: video gaming. Using measures of cognitive-motor, attentional, and perceptual processing extracted from game data from 3360 Real-Time Strategy players at 7 different levels of expertise, we identified 12 variables relevant to expertise. We show that the static variable importance assumption is false - the predictive importance of these variables shifted as the levels of expertise increased - and, at least in our dataset, that a contrastive approach would have been misleading. The finding that variable importance is not static across levels of expertise suggests that large, diverse datasets of sustained cognitive-motor performance are crucial for an understanding of expertise in real-world contexts. We also identify plausible cognitive markers of expertise.
doi:10.1371/journal.pone.0075129
PMCID: PMC3776738  PMID: 24058656
12.  Fusing Dual-Event Datasets for Mycobacterium Tuberculosis Machine Learning Models and their Evaluation 
The search for new tuberculosis treatments continues as we need to find molecules that can act more quickly, be accommodated in multi-drug regimens, and overcome ever increasing levels of drug resistance. Multiple large scale phenotypic high-throughput screens against Mycobacterium tuberculosis (Mtb) have generated dose response data, enabling the generation of machine learning models. These models also incorporated cytotoxicity data and were recently validated with a large external dataset.
A cheminformatics data-fusion approach followed by Bayesian machine learning, Support Vector Machine or Recursive Partitioning model development (based on publicly available Mtb screening data) was used to compare individual datasets and subsequent combined models. A set of 1924 commercially available molecules with promising antitubercular activity (and lack of relative cytotoxicity to Vero cells) were used to evaluate the predictive nature of the models. We demonstrate that combining three datasets incorporating antitubercular and cytotoxicity data in Vero cells from our previous screens results in external validation receiver operator curve (ROC) of 0.83 (Bayesian or RP Forest). Models that do not have the highest five-fold cross validation ROC scores can outperform other models in a test set dependent manner.
We demonstrate with predictions for a recently published set of Mtb leads from GlaxoSmithKline that no single machine learning model may be enough to identify compounds of interest. Dataset fusion represents a further useful strategy for machine learning construction as illustrated with Mtb. Coverage of chemistry and Mtb target spaces may also be limiting factors for the whole-cell screening data generated to date.
doi:10.1021/ci400480s
PMCID: PMC3910492  PMID: 24144044
Bayesian models; Collaborative Drug Discovery Tuberculosis database; Dual-event models; Function class fingerprints; Lead optimization; Mycobacterium tuberculosis; Recursive partitioning; Support vector machine; Tuberculosis
13.  Diabetic retinopathy risk prediction for fundus examination using sparse learning: a cross-sectional study 
Background
Blindness due to diabetic retinopathy (DR) is the major disability in diabetic patients. Although early management has shown to prevent vision loss, diabetic patients have a low rate of routine ophthalmologic examination. Hence, we developed and validated sparse learning models with the aim of identifying the risk of DR in diabetic patients.
Methods
Health records from the Korea National Health and Nutrition Examination Surveys (KNHANES) V-1 were used. The prediction models for DR were constructed using data from 327 diabetic patients, and were validated internally on 163 patients in the KNHANES V-1. External validation was performed using 562 diabetic patients in the KNHANES V-2. The learning models, including ridge, elastic net, and LASSO, were compared to the traditional indicators of DR.
Results
Considering the Bayesian information criterion, LASSO predicted DR most efficiently. In the internal and external validation, LASSO was significantly superior to the traditional indicators by calculating the area under the curve (AUC) of the receiver operating characteristic. LASSO showed an AUC of 0.81 and an accuracy of 73.6% in the internal validation, and an AUC of 0.82 and an accuracy of 75.2% in the external validation.
Conclusion
The sparse learning model using LASSO was effective in analyzing the epidemiological underlying patterns of DR. This is the first study to develop a machine learning model to predict DR risk using health records. LASSO can be an excellent choice when both discriminative power and variable selection are important in the analysis of high-dimensional electronic health records.
doi:10.1186/1472-6947-13-106
PMCID: PMC3847617  PMID: 24033926
Sparse learning; LASSO; Diabetic retinopathy; Clinical decision support; Risk assessment
14.  Investigating the Efficacy of Nonlinear Dimensionality Reduction Schemes in Classifying Gene- and Protein-Expression Studies 
The recent explosion in procurement and availability of high-dimensional gene- and protein-expression profile datasets for cancer diagnostics has necessitated the development of sophisticated machine learning tools with which to analyze them. While some investigators are focused on identifying informative genes and proteins that play a role in specific diseases, other researchers have attempted instead to use patients based on their expression profiles to prognosticate disease status. A major limitation in the ability to accurate classify these high-dimensional datasets stems from the ‘curse of dimensionality’, occurring in situations where the number of genes or peptides significantly exceeds the total number of patient samples. Previous attempts at dealing with this issue have mostly centered on the use of a dimensionality reduction (DR) scheme, Principal Component Analysis (PCA), to obtain a low-dimensional projection of the high-dimensional data. However, linear PCA and other linear DR methods, which rely on Euclidean distances to estimate object similarity, do not account for the inherent underlying nonlinear structure associated with most biomedical data. While some researchers have begun to explore nonlinear DR methods for computer vision problems such as face detection and recognition, to the best of our knowledge, few such attempts have been made for classification and visualization of high-dimensional biomedical data. The motivation behind this work is to identify the appropriate DR methods for analysis of high-dimensional gene- and protein-expression studies. Towards this end, we empirically and rigorously compare three nonlinear (Isomap, Locally Linear Embedding, Laplacian Eigenmaps) and three linear DR schemes (PCA, Linear Discriminant Analysis, Multidimensional Scaling) with the intent of determining a reduced subspace representation in which the individual object classes are more easily discriminable. Owing to the to the inherent nonlinear structure of gene- and protein-expression studies, our claim is that the nonlinear DR methods provide a more truthful low-dimensional representation of the data compared to the linear DR schemes. Evaluation of the DR schemes was done by (i) assessing the discriminability of two supervised classifiers (Support Vector Machine and C4.5 Decision Trees) in the different low-dimensional data embeddings and (ii) 5 cluster validity measures to evaluate the size, distance and tightness of object aggregates in the low-dimensional space. For each of the 7 evaluation measures considered, statistically significant improvement in the quality of the embeddings across 10 cancer datasets via the use of 3 nonlinear DR schemes over 3 linear DR techniques was observed. Similar trends were observed when linear and nonlinear DR was applied to the high-dimensional data following feature pruning to isolate the most informative features. Qualitative evaluation of the low-dimensional data embedding obtained via the 6 DR methods further suggests that the nonlinear schemes are better able to identify potential novel classes (e.g. cancer subtypes) within the data.
doi:10.1109/TCBB.2008.36
PMCID: PMC2562675  PMID: 18670041
Dimensionality reduction; bioinformatics; data clustering; data visualization; machine learning; manifold learning; nonlinear dimensionality reduction; gene expression; proteomics; prostate cancer; lung cancer; ovarian cancer; principal component analysis; linear discriminant analysis; multidimensional scaling; Isomap; locally linear embedding; laplacian eigenmaps; classification; support vector machine; decision trees; LLE; PCA
15.  Protein-Protein Interaction Site Predictions with Three-Dimensional Probability Distributions of Interacting Atoms on Protein Surfaces 
PLoS ONE  2012;7(6):e37706.
Protein-protein interactions are key to many biological processes. Computational methodologies devised to predict protein-protein interaction (PPI) sites on protein surfaces are important tools in providing insights into the biological functions of proteins and in developing therapeutics targeting the protein-protein interaction sites. One of the general features of PPI sites is that the core regions from the two interacting protein surfaces are complementary to each other, similar to the interior of proteins in packing density and in the physicochemical nature of the amino acid composition. In this work, we simulated the physicochemical complementarities by constructing three-dimensional probability density maps of non-covalent interacting atoms on the protein surfaces. The interacting probabilities were derived from the interior of known structures. Machine learning algorithms were applied to learn the characteristic patterns of the probability density maps specific to the PPI sites. The trained predictors for PPI sites were cross-validated with the training cases (consisting of 432 proteins) and were tested on an independent dataset (consisting of 142 proteins). The residue-based Matthews correlation coefficient for the independent test set was 0.423; the accuracy, precision, sensitivity, specificity were 0.753, 0.519, 0.677, and 0.779 respectively. The benchmark results indicate that the optimized machine learning models are among the best predictors in identifying PPI sites on protein surfaces. In particular, the PPI site prediction accuracy increases with increasing size of the PPI site and with increasing hydrophobicity in amino acid composition of the PPI interface; the core interface regions are more likely to be recognized with high prediction confidence. The results indicate that the physicochemical complementarity patterns on protein surfaces are important determinants in PPIs, and a substantial portion of the PPI sites can be predicted correctly with the physicochemical complementarity features based on the non-covalent interaction data derived from protein interiors.
doi:10.1371/journal.pone.0037706
PMCID: PMC3368894  PMID: 22701576
16.  Negative Example Selection for Protein Function Prediction: The NoGO Database 
PLoS Computational Biology  2014;10(6):e1003644.
Negative examples – genes that are known not to carry out a given protein function – are rarely recorded in genome and proteome annotation databases, such as the Gene Ontology database. Negative examples are required, however, for several of the most powerful machine learning methods for integrative protein function prediction. Most protein function prediction efforts have relied on a variety of heuristics for the choice of negative examples. Determining the accuracy of methods for negative example prediction is itself a non-trivial task, given that the Open World Assumption as applied to gene annotations rules out many traditional validation metrics. We present a rigorous comparison of these heuristics, utilizing a temporal holdout, and a novel evaluation strategy for negative examples. We add to this comparison several algorithms adapted from Positive-Unlabeled learning scenarios in text-classification, which are the current state of the art methods for generating negative examples in low-density annotation contexts. Lastly, we present two novel algorithms of our own construction, one based on empirical conditional probability, and the other using topic modeling applied to genes and annotations. We demonstrate that our algorithms achieve significantly fewer incorrect negative example predictions than the current state of the art, using multiple benchmarks covering multiple organisms. Our methods may be applied to generate negative examples for any type of method that deals with protein function, and to this end we provide a database of negative examples in several well-studied organisms, for general use (The NoGO database, available at: bonneaulab.bio.nyu.edu/nogo.html).
Author Summary
Many machine learning methods have been applied to the task of predicting the biological function of proteins based on a variety of available data. The majority of these methods require negative examples: proteins that are known not to perform a function, in order to achieve meaningful predictions, but negative examples are often not available. In addition, past heuristic methods for negative example selection suffer from a high error rate. Here, we rigorously compare two novel algorithms against past heuristics, as well as some algorithms adapted from a similar task in text-classification. Through this comparison, performed on several different benchmarks, we demonstrate that our algorithms make significantly fewer mistakes when predicting negative examples. We also provide a database of negative examples for general use in machine learning for protein function prediction (The NoGO database, available at: bonneaulab.bio.nyu.edu/nogo.html).
doi:10.1371/journal.pcbi.1003644
PMCID: PMC4055410  PMID: 24922051
17.  Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data 
PLoS ONE  2012;7(7):e39932.
Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL’s classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.
doi:10.1371/journal.pone.0039932
PMCID: PMC3394775  PMID: 22808075
18.  Random generalized linear model: a highly accurate and interpretable ensemble predictor 
BMC Bioinformatics  2013;14:5.
Background
Ensemble predictors such as the random forest are known to have superior accuracy but their black-box predictions are difficult to interpret. In contrast, a generalized linear model (GLM) is very interpretable especially when forward feature selection is used to construct the model. However, forward feature selection tends to overfit the data and leads to low predictive accuracy. Therefore, it remains an important research goal to combine the advantages of ensemble predictors (high accuracy) with the advantages of forward regression modeling (interpretability). To address this goal several articles have explored GLM based ensemble predictors. Since limited evaluations suggested that these ensemble predictors were less accurate than alternative predictors, they have found little attention in the literature.
Results
Comprehensive evaluations involving hundreds of genomic data sets, the UCI machine learning benchmark data, and simulations are used to give GLM based ensemble predictors a new and careful look. A novel bootstrap aggregated (bagged) GLM predictor that incorporates several elements of randomness and instability (random subspace method, optional interaction terms, forward variable selection) often outperforms a host of alternative prediction methods including random forests and penalized regression models (ridge regression, elastic net, lasso). This random generalized linear model (RGLM) predictor provides variable importance measures that can be used to define a “thinned” ensemble predictor (involving few features) that retains excellent predictive accuracy.
Conclusion
RGLM is a state of the art predictor that shares the advantages of a random forest (excellent predictive accuracy, feature importance measures, out-of-bag estimates of accuracy) with those of a forward selected generalized linear model (interpretability). These methods are implemented in the freely available R software package randomGLM.
doi:10.1186/1471-2105-14-5
PMCID: PMC3645958  PMID: 23323760
19.  Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy 
BMC Bioinformatics  2006;7:417.
Background
Understanding how amino acid substitutions affect protein functions is critical for the study of proteins and their implications in diseases. Although methods have been developed for predicting potential effects of amino acid substitutions using sequence, three-dimensional structural, and evolutionary properties of proteins, the applications are limited by the complication of the features and the availability of protein structural information. Another limitation is that the prediction results are hard to be interpreted with physicochemical principles and biological knowledge.
Results
To overcome these limitations, we proposed a novel feature set using physicochemical properties of amino acids, evolutionary profiles of proteins, and protein sequence information. We applied the support vector machine and the random forest with the feature set to experimental amino acid substitutions occurring in the E. coli lac repressor and the bacteriophage T4 lysozyme, as well as to annotated amino acid substitutions occurring in a wide range of human proteins. The results showed that the proposed feature set was superior to the existing ones. To explore physicochemical principles behind amino acid substitutions, we designed a simulated annealing bump hunting strategy to automatically extract interpretable rules for amino acid substitutions. We applied the strategy to annotated human amino acid substitutions and successfully extracted several rules which were either consistent with current biological knowledge or providing new insights for the understanding of amino acid substitutions. When applied to unclassified data, these rules could cover a large portion of samples, and most of the covered samples showed good agreement with predictions made by either the support vector machine or the random forest.
Conclusion
The prediction methods using the proposed feature set can achieve larger AUC (the area under the ROC curve), smaller BER (the balanced error rate), and larger MCC (the Matthews' correlation coefficient) than those using the published feature sets, suggesting that our feature set is superior to the existing ones. The rules extracted by the simulated annealing bump hunting strategy have comparable coverage and accuracy but much better interpretability as those extracted by the patient rule induction method (PRIM), revealing that the strategy is more effective in inducing interpretable rules.
doi:10.1186/1471-2105-7-417
PMCID: PMC1618409  PMID: 16984653
20.  Stochastic evolutionary dynamics resolve the Traveler's Dilemma 
Journal of Theoretical Biology  2012;303:119-127.
Behavior in social dilemmas is often inconsistent with the predictions of classical game theory: people (and a wide variety of other organisms) are more cooperative than might be expected. Here we consider behavior in one such social dilemma, the Traveler's Dilemma, that has received considerable attention in the economics literature but is little known among theoretical biologists. The rules of the game are as follows. Two players each choose a value between R and M, where 0 < R < M. If the players choose the same value, both receive that amount. If the players choose different values v1 and v2, where v1 < v2, then the player choosing v1 earns v1 + R and the player choosing v2 earns v1 − R. While the players would maximize their payoffs by both declaring the largest possible value, M, the Nash equilibrium is to declare the smallest possible value, R. In behavioral experiments, however, people generally declare values much larger than the minimum and the deviation from the expected equilibrium decreases with R. In this paper, we show that the cooperative behavior observed in the Traveler's Dilemma can be explained in an evolutionary framework. We study stochastic evolutionary dynamics in finite populations with varying intensity of selection and varying mutation rate. We derive analytic results showing that strategies declaring high values can be favored when selection is weak. More generally, selection favors strategies that declare high values if R is small (relative to M) and strategies that declare low values if R is large. Finally, we show that a two-parameter model involving the intensity of selection and the mutation rate can quantitatively reproduce data that from a Traveler's Dilemma experiment. These results demonstrate the power of evolutionary game theory for explaining human behavior in contexts that are challenging for standard economic game theory.
doi:10.1016/j.jtbi.2012.03.014
PMCID: PMC3348410  PMID: 22465111
21.  Improved machine learning method for analysis of gas phase chemistry of peptides 
BMC Bioinformatics  2008;9:515.
Background
Accurate peptide identification is important to high-throughput proteomics analyses that use mass spectrometry. Search programs compare fragmentation spectra (MS/MS) of peptides from complex digests with theoretically derived spectra from a database of protein sequences. Improved discrimination is achieved with theoretical spectra that are based on simulating gas phase chemistry of the peptides, but the limited understanding of those processes affects the accuracy of predictions from theoretical spectra.
Results
We employed a robust data mining strategy using new feature annotation functions of MAE software, which revealed under-prediction of the frequency of occurrence in fragmentation of the second peptide bond. We applied methods of exploratory data analysis to pre-process the information in the MS/MS spectra, including data normalization and attribute selection, to reduce the attributes to a smaller, less correlated set for machine learning studies. We then compared our rule building machine learning program, DataSqueezer, with commonly used association rules and decision tree algorithms. All used machine learning algorithms produced similar results that were consistent with expected properties for a second gas phase mechanism at the second peptide bond.
Conclusion
The results provide compelling evidence that we have identified underlying chemical properties in the data that suggest the existence of an additional gas phase mechanism for the second peptide bond. Thus, the methods described in this study provide a valuable approach for analyses of this kind in the future.
doi:10.1186/1471-2105-9-515
PMCID: PMC2612015  PMID: 19055745
22.  Spike-based Decision Learning of Nash Equilibria in Two-Player Games 
PLoS Computational Biology  2012;8(9):e1002691.
Humans and animals face decision tasks in an uncertain multi-agent environment where an agent's strategy may change in time due to the co-adaptation of others strategies. The neuronal substrate and the computational algorithms underlying such adaptive decision making, however, is largely unknown. We propose a population coding model of spiking neurons with a policy gradient procedure that successfully acquires optimal strategies for classical game-theoretical tasks. The suggested population reinforcement learning reproduces data from human behavioral experiments for the blackjack and the inspector game. It performs optimally according to a pure (deterministic) and mixed (stochastic) Nash equilibrium, respectively. In contrast, temporal-difference(TD)-learning, covariance-learning, and basic reinforcement learning fail to perform optimally for the stochastic strategy. Spike-based population reinforcement learning, shown to follow the stochastic reward gradient, is therefore a viable candidate to explain automated decision learning of a Nash equilibrium in two-player games.
Author Summary
Socio-economic interactions are captured in a game theoretic framework by multiple agents acting on a pool of goods to maximize their own reward. Neuroeconomics tries to explain the agent's behavior in neuronal terms. Classical models in neuroeconomics use temporal-difference(TD)-learning. This algorithm incrementally updates values of state-action pairs, and actions are selected according to a value-based policy. In contrast, policy gradient methods do not introduce values as intermediate steps, but directly derive an action selection policy which maximizes the total expected reward. We consider a decision making network consisting of a population of neurons which, upon presentation of a spatio-temporal spike pattern, encodes binary actions by the population output spike trains and a subsequent majority vote. The action selection policy is parametrized by the strengths of synapses projecting to the population neurons. A gradient learning rule is derived which modifies these synaptic strengths and which depends on four factors, the pre- and postsynaptic activities, the action and the reward. We show that for classical game-theoretical tasks our decision making network endowed with the four-factor learning rule leads to Nash-optimal action selections. It also mimics human decision learning for these same tasks.
doi:10.1371/journal.pcbi.1002691
PMCID: PMC3459907  PMID: 23028289
23.  Evaluating Random Forests for Survival Analysis using Prediction Error Curves 
Prediction error curves are increasingly used to assess and compare predictions in survival analysis. This article surveys the R package pec which provides a set of functions for efficient computation of prediction error curves. The software implements inverse probability of censoring weights to deal with right censored data and several variants of cross-validation to deal with the apparent error problem. In principle, all kinds of prediction models can be assessed, and the package readily supports most traditional regression modeling strategies, like Cox regression or additive hazard regression, as well as state of the art machine learning methods such as random forests, a nonparametric method which provides promising alternatives to traditional strategies in low and high-dimensional settings. We show how the functionality of pec can be extended to yet unsupported prediction models. As an example, we implement support for random forest prediction models based on the R-packages randomSurvivalForest and party. Using data of the Copenhagen Stroke Study we use pec to compare random forests to a Cox regression model derived from stepwise variable selection. Reproducible results on the user level are given for publicly available data from the German breast cancer study group.
PMCID: PMC4194196  PMID: 25317082
Survival prediction; prediction error curves; random survival forest; R.
24.  Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art 
BMC Bioinformatics  2012;13:89.
Background
RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition ‘code’ that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction.
Results
We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naïve Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues.
Conclusions
Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons.
doi:10.1186/1471-2105-13-89
PMCID: PMC3490755  PMID: 22574904
25.  Predictive models to assess risk of type 2 diabetes, hypertension and comorbidity: machine-learning algorithms and validation using national health data from Kuwait—a cohort study 
BMJ Open  2013;3(5):e002457.
Objective
We build classification models and risk assessment tools for diabetes, hypertension and comorbidity using machine-learning algorithms on data from Kuwait. We model the increased proneness in diabetic patients to develop hypertension and vice versa. We ascertain the importance of ethnicity (and natives vs expatriate migrants) and of using regional data in risk assessment.
Design
Retrospective cohort study. Four machine-learning techniques were used: logistic regression, k-nearest neighbours (k-NN), multifactor dimensionality reduction and support vector machines. The study uses fivefold cross validation to obtain generalisation accuracies and errors.
Setting
Kuwait Health Network (KHN) that integrates data from primary health centres and hospitals in Kuwait.
Participants
270 172 hospital visitors (of which, 89 858 are diabetic, 58 745 hypertensive and 30 522 comorbid) comprising Kuwaiti natives, Asian and Arab expatriates.
Outcome measures
Incident type 2 diabetes, hypertension and comorbidity.
Results
Classification accuracies of >85% (for diabetes) and >90% (for hypertension) are achieved using only simple non-laboratory-based parameters. Risk assessment tools based on k-NN classification models are able to assign ‘high’ risk to 75% of diabetic patients and to 94% of hypertensive patients. Only 5% of diabetic patients are seen assigned ‘low’ risk. Asian-specific models and assessments perform even better. Pathological conditions of diabetes in the general population or in hypertensive population and those of hypertension are modelled. Two-stage aggregate classification models and risk assessment tools, built combining both the component models on diabetes (or on hypertension), perform better than individual models.
Conclusions
Data on diabetes, hypertension and comorbidity from the cosmopolitan State of Kuwait are available for the first time. This enabled us to apply four different case–control models to assess risks. These tools aid in the preliminary non-intrusive assessment of the population. Ethnicity is seen significant to the predictive models. Risk assessments need to be developed using regional data as we demonstrate the applicability of the American Diabetes Association online calculator on data from Kuwait.
doi:10.1136/bmjopen-2012-002457
PMCID: PMC3657675  PMID: 23676796
predictive models; Machine learning; Risk assessment; Kuwait

Results 1-25 (1443353)