PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (600608)

Clipboard (0)
None

Related Articles

1.  Module-Based Outcome Prediction Using Breast Cancer Compendia 
PLoS ONE  2007;2(10):e1047.
Background
The availability of large collections of microarray datasets (compendia), or knowledge about grouping of genes into pathways (gene sets), is typically not exploited when training predictors of disease outcome. These can be useful since a compendium increases the number of samples, while gene sets reduce the size of the feature space. This should be favorable from a machine learning perspective and result in more robust predictors.
Methodology
We extracted modules of regulated genes from gene sets, and compendia. Through supervised analysis, we constructed predictors which employ modules predictive of breast cancer outcome. To validate these predictors we applied them to independent data, from the same institution (intra-dataset), and other institutions (inter-dataset).
Conclusions
We show that modules derived from single breast cancer datasets achieve better performance on the validation data compared to gene-based predictors. We also show that there is a trend in compendium specificity and predictive performance: modules derived from a single breast cancer dataset, and a breast cancer specific compendium perform better compared to those derived from a human cancer compendium. Additionally, the module-based predictor provides a much richer insight into the underlying biology. Frequently selected gene sets are associated with processes such as cell cycle, E2F regulation, DNA damage response, proteasome and glycolysis. We analyzed two modules related to cell cycle, and the OCT1 transcription factor, respectively. On an individual basis, these modules provide a significant separation in survival subgroups on the training and independent validation data.
doi:10.1371/journal.pone.0001047
PMCID: PMC2002511  PMID: 17940611
2.  Win-Stay-Lose-Learn Promotes Cooperation in the Spatial Prisoner's Dilemma Game 
PLoS ONE  2012;7(2):e30689.
Holding on to one's strategy is natural and common if the later warrants success and satisfaction. This goes against widespread simulation practices of evolutionary games, where players frequently consider changing their strategy even though their payoffs may be marginally different than those of the other players. Inspired by this observation, we introduce an aspiration-based win-stay-lose-learn strategy updating rule into the spatial prisoner's dilemma game. The rule is simple and intuitive, foreseeing strategy changes only by dissatisfied players, who then attempt to adopt the strategy of one of their nearest neighbors, while the strategies of satisfied players are not subject to change. We find that the proposed win-stay-lose-learn rule promotes the evolution of cooperation, and it does so very robustly and independently of the initial conditions. In fact, we show that even a minute initial fraction of cooperators may be sufficient to eventually secure a highly cooperative final state. In addition to extensive simulation results that support our conclusions, we also present results obtained by means of the pair approximation of the studied game. Our findings continue the success story of related win-stay strategy updating rules, and by doing so reveal new ways of resolving the prisoner's dilemma.
doi:10.1371/journal.pone.0030689
PMCID: PMC3281853  PMID: 22363470
3.  Experience and Abstract Reasoning in Learning Backward Induction 
Backward induction is a benchmark of game theoretic rationality, yet surprisingly little is known as to how humans discover and initially learn to apply this abstract solution concept in experimental settings. We use behavioral and functional magnetic resonance imaging (fMRI) data to study the way in which subjects playing in a sequential game of perfect information learn the optimal backward induction strategy for the game. Experimental data from our two studies support two main findings: First, subjects converge to a common process of recursive inference similar to the backward induction procedure for solving the game. The process is recursive because earlier insights and conclusions are used as inputs in later steps of the inference. This process is matched by a similar pattern in brain activation, which also proceeds backward, following the prediction error: brain activity initially codes the responses to losses in final positions; in later trials this activity shifts to the starting position. Second, the learning process is not exclusively cognitive, but instead combines experience-based learning and abstract reasoning. Critical experiences leading to the adoption of an improved solution strategy appear to be stimulated by brain activity in the reward system. This indicates that the negative affect induced by initial failures facilitates the switch to a different method of solving the problem. Abstract reasoning is combined with this response, and is expressed by activation in the ventrolateral prefrontal cortex. Differences in brain activation match differences in performance between subjects who show different learning speeds.
doi:10.3389/fnins.2012.00023
PMCID: PMC3282917  PMID: 22363254
neuroeconomics; game theory; backward induction; learning; deductive reasoning
4.  Survival ensembles by the sum of pairwise differences with application to lung cancer microarray studies 
The annals of applied statistics  2011;5(2A):1081-1101.
Lung cancer is among the most common cancers in the United States, in terms of incidence and mortality. In 2009, it is estimated that more than 150,000 deaths will result from lung cancer alone. Genetic information is an extremely valuable data source in characterizing the personal nature of cancer. Over the past several years, investigators have conducted numerous association studies where intensive genetic data is collected on relatively few patients compared to the numbers of gene predictors, with one scientific goal being to identify genetic features associated with cancer recurrence or survival. In this note, we propose high-dimensional survival analysis through a new application of boosting, a powerful tool in machine learning. Our approach is based on an accelerated lifetime model and minimizing the sum of pairwise differences in residuals. We apply our method to a recent microarray study of lung adenocarcinoma and find that our ensemble is composed of 19 genes while a proportional hazards (PH) ensemble is composed of nine genes, a proper subset of the 19-gene panel. In one of our simulation scenarios, we demonstrate that PH boosting in a misspecified model tends to underfit and ignore moderately-sized covariate effects, on average. Diagnostic analyses suggest that the PH assumption is not satisfied in the microarray data and may explain, in part, the discrepancy in the sets of active coefficients. Our simulation studies and comparative data analyses demonstrate how statistical learning by PH models alone is insufficient.
doi:10.1214/10-AOAS426
PMCID: PMC3148798  PMID: 21818245
Accelerated failure time; Boosting; Lasso; Proportional hazards regression; Survival analysis
5.  Classification of microarrays; synergistic effects between normalization, gene selection and machine learning 
BMC Bioinformatics  2011;12:390.
Background
Machine learning is a powerful approach for describing and predicting classes in microarray data. Although several comparative studies have investigated the relative performance of various machine learning methods, these often do not account for the fact that performance (e.g. error rate) is a result of a series of analysis steps of which the most important are data normalization, gene selection and machine learning.
Results
In this study, we used seven previously published cancer-related microarray data sets to compare the effects on classification performance of five normalization methods, three gene selection methods with 21 different numbers of selected genes and eight machine learning methods. Performance in term of error rate was rigorously estimated by repeatedly employing a double cross validation approach. Since performance varies greatly between data sets, we devised an analysis method that first compares methods within individual data sets and then visualizes the comparisons across data sets. We discovered both well performing individual methods and synergies between different methods.
Conclusion
Support Vector Machines with a radial basis kernel, linear kernel or polynomial kernel of degree 2 all performed consistently well across data sets. We show that there is a synergistic relationship between these methods and gene selection based on the T-test and the selection of a relatively high number of genes. Also, we find that these methods benefit significantly from using normalized data, although it is hard to draw general conclusions about the relative performance of different normalization procedures.
doi:10.1186/1471-2105-12-390
PMCID: PMC3229535  PMID: 21982277
6.  A Geometrical Perspective for the Bargaining Problem 
PLoS ONE  2010;5(4):e10331.
A new treatment to determine the Pareto-optimal outcome for a non-zero-sum game is presented. An equilibrium point for any game is defined here as a set of strategy choices for the players, such that no change in the choice of any single player will increase the overall payoff of all the players. Determining equilibrium for multi-player games is a complex problem. An intuitive conceptual tool for reducing the complexity, via the idea of spatially representing strategy options in the bargaining problem is proposed. Based on this geometry, an equilibrium condition is established such that the product of their gains over what each receives is maximal. The geometrical analysis of a cooperative bargaining game provides an example for solving multi-player and non-zero-sum games efficiently.
doi:10.1371/journal.pone.0010331
PMCID: PMC2859940  PMID: 20436675
7.  Rapid and Accurate Prediction and Scoring of Water Molecules in Protein Binding Sites 
PLoS ONE  2012;7(3):e32036.
Water plays a critical role in ligand-protein interactions. However, it is still challenging to predict accurately not only where water molecules prefer to bind, but also which of those water molecules might be displaceable. The latter is often seen as a route to optimizing affinity of potential drug candidates. Using a protocol we call WaterDock, we show that the freely available AutoDock Vina tool can be used to predict accurately the binding sites of water molecules. WaterDock was validated using data from X-ray crystallography, neutron diffraction and molecular dynamics simulations and correctly predicted 97% of the water molecules in the test set. In addition, we combined data-mining, heuristic and machine learning techniques to develop probabilistic water molecule classifiers. When applied to WaterDock predictions in the Astex Diverse Set of protein ligand complexes, we could identify whether a water molecule was conserved or displaced to an accuracy of 75%. A second model predicted whether water molecules were displaced by polar groups or by non-polar groups to an accuracy of 80%. These results should prove useful for anyone wishing to undertake rational design of new compounds where the displacement of water molecules is being considered as a route to improved affinity.
doi:10.1371/journal.pone.0032036
PMCID: PMC3291545  PMID: 22396746
8.  Evolution of learned strategy choice in a frequency-dependent game 
In frequency-dependent games, strategy choice may be innate or learned. While experimental evidence in the producer–scrounger game suggests that learned strategy choice may be common, a recent theoretical analysis demonstrated that learning by only some individuals prevents learning from evolving in others. Here, however, we model learning explicitly, and demonstrate that learning can easily evolve in the whole population. We used an agent-based evolutionary simulation of the producer–scrounger game to test the success of two general learning rules for strategy choice. We found that learning was eventually acquired by all individuals under a sufficient degree of environmental fluctuation, and when players were phenotypically asymmetric. In the absence of sufficient environmental change or phenotypic asymmetries, the correct target for learning seems to be confounded by game dynamics, and innate strategy choice is likely to be fixed in the population. The results demonstrate that under biologically plausible conditions, learning can easily evolve in the whole population and that phenotypic asymmetry is important for the evolution of learned strategy choice, especially in a stable or mildly changing environment.
doi:10.1098/rspb.2011.1734
PMCID: PMC3267151  PMID: 21937494
social foraging; producer–scrounger game; phenotypic asymmetries; evolutionary simulation
9.  Combining classifiers to predict gene function in Arabidopsis thaliana using large-scale gene expression measurements 
BMC Bioinformatics  2007;8:358.
Background
Arabidopsis thaliana is the model species of current plant genomic research with a genome size of 125 Mb and approximately 28,000 genes. The function of half of these genes is currently unknown. The purpose of this study is to infer gene function in Arabidopsis using machine-learning algorithms applied to large-scale gene expression data sets, with the goal of identifying genes that are potentially involved in plant response to abiotic stress.
Results
Using in house and publicly available data, we assembled a large set of gene expression measurements for A. thaliana. Using those genes of known function, we first evaluated and compared the ability of basic machine-learning algorithms to predict which genes respond to stress. Predictive accuracy was measured using ROC50 and precision curves derived through cross validation. To improve accuracy, we developed a method for combining these classifiers using a weighted-voting scheme. The combined classifier was then trained on genes of known function and applied to genes of unknown function, identifying genes that potentially respond to stress. Visual evidence corroborating the predictions was obtained using electronic Northern analysis. Three of the predicted genes were chosen for biological validation. Gene knockout experiments confirmed that all three are involved in a variety of stress responses. The biological analysis of one of these genes (At1g16850) is presented here, where it is shown to be necessary for the normal response to temperature and NaCl.
Conclusion
Supervised learning methods applied to large-scale gene expression measurements can be used to predict gene function. However, the ability of basic learning methods to predict stress response varies widely and depends heavily on how much dimensionality reduction is used. Our method of combining classifiers can improve the accuracy of such predictions – in this case, predictions of genes involved in stress response in plants – and it effectively chooses the appropriate amount of dimensionality reduction automatically. The method provides a useful means of identifying genes in A. thaliana that potentially respond to stress, and we expect it would be useful in other organisms and for other gene functions.
doi:10.1186/1471-2105-8-358
PMCID: PMC2213690  PMID: 17888165
10.  Dana-Farber Repository for Machine Learning in Immunology 
Journal of immunological methods  2011;374(1-2):18-25.
The immune system is characterized by high combinatorial complexity that necessitates the use of specialized computational tools for analysis of immunological data. Machine learning (ML) algorithms are used in combination with classical experimentation for the selection of vaccine targets and in computational simulations that reduce the number of necessary experiments. The development of ML algorithms requires standardized data sets, consistent measurement methods, and uniform scales. To bridge the gap between the immunology community and the ML community, we designed a repository for machine learning in immunology named Dana-Farber Repository for Machine Learning in Immunology (DFRMLI). This repository provides standardized data sets of HLA-binding peptides with all binding affinities mapped onto a common scale. It also provides a list of experimentally validated naturally processed T cell epitopes derived from tumor or virus antigens. The DFRMLI data were preprocessed and ensure consistency, comparability, detailed descriptions, and statistically meaningful sample sizes for peptides that bind to various HLA molecules. The repository is accessible at http://bio.dfci.harvard.edu/DFRMLI/.
doi:10.1016/j.jim.2011.07.007
PMCID: PMC3249226  PMID: 21782820
11.  A New Data Mining Scheme Using Artificial Neural Networks 
Sensors (Basel, Switzerland)  2011;11(5):4622-4647.
Classification is one of the data mining problems receiving enormous attention in the database community. Although artificial neural networks (ANNs) have been successfully applied in a wide range of machine learning applications, they are however often regarded as black boxes, i.e., their predictions cannot be explained. To enhance the explanation of ANNs, a novel algorithm to extract symbolic rules from ANNs has been proposed in this paper. ANN methods have not been effectively utilized for data mining tasks because how the classifications were made is not explicitly stated as symbolic rules that are suitable for verification or interpretation by human experts. With the proposed approach, concise symbolic rules with high accuracy, that are easily explainable, can be extracted from the trained ANNs. Extracted rules are comparable with other methods in terms of number of rules, average number of conditions for a rule, and the accuracy. The effectiveness of the proposed approach is clearly demonstrated by the experimental results on a set of benchmark data mining classification problems.
doi:10.3390/s110504622
PMCID: PMC3231400  PMID: 22163866
data mining; neural networks; symbolic rules; weight freezing; constructive algorithm; pruning; clustering; rule extraction; symbolic rules
12.  Anomaly and Signature Filtering Improve Classifier Performance For Detection Of Suspicious Access To EHRs 
Our objective is to facilitate semi-automated detection of suspicious access to EHRs. Previously we have shown that a machine learning method can play a role in identifying potentially inappropriate access to EHRs. However, the problem of sampling informative instances to build a classifier still remained. We developed an integrated filtering method leveraging both anomaly detection based on symbolic clustering and signature detection, a rule-based technique. We applied the integrated filtering to 25.5 million access records in an intervention arm, and compared this with 8.6 million access records in a control arm where no filtering was applied. On the training set with cross-validation, the AUC was 0.960 in the control arm and 0.998 in the intervention arm. The difference in false negative rates on the independent test set was significant, P=1.6×10−6. Our study suggests that utilization of integrated filtering strategies to facilitate the construction of classifiers can be helpful.
PMCID: PMC3243249  PMID: 22195129
13.  Applications of Machine Learning in Cancer Prediction and Prognosis 
Cancer Informatics  2007;2:59-77.
Machine learning is a branch of artificial intelligence that employs a variety of statistical, probabilistic and optimization techniques that allows computers to “learn” from past examples and to detect hard-to-discern patterns from large, noisy or complex data sets. This capability is particularly well-suited to medical applications, especially those that depend on complex proteomic and genomic measurements. As a result, machine learning is frequently used in cancer diagnosis and detection. More recently machine learning has been applied to cancer prognosis and prediction. This latter approach is particularly interesting as it is part of a growing trend towards personalized, predictive medicine. In assembling this review we conducted a broad survey of the different types of machine learning methods being used, the types of data being integrated and the performance of these methods in cancer prediction and prognosis. A number of trends are noted, including a growing dependence on protein biomarkers and microarray data, a strong bias towards applications in prostate and breast cancer, and a heavy reliance on “older” technologies such artificial neural networks (ANNs) instead of more recently developed or more easily interpretable machine learning methods. A number of published studies also appear to lack an appropriate level of validation or testing. Among the better designed and validated studies it is clear that machine learning methods can be used to substantially (15–25%) improve the accuracy of predicting cancer susceptibility, recurrence and mortality. At a more fundamental level, it is also evident that machine learning is also helping to improve our basic understanding of cancer development and progression.
PMCID: PMC2675494  PMID: 19458758
Cancer; machine learning; prognosis; risk; prediction
14.  Spatial analysis of plague in California: niche modeling predictions of the current distribution and potential response to climate change 
Background
Plague, caused by the bacterium Yersinia pestis, is a public and wildlife health concern in California and the western United States. This study explores the spatial characteristics of positive plague samples in California and tests Maxent, a machine-learning method that can be used to develop niche-based models from presence-only data, for mapping the potential distribution of plague foci. Maxent models were constructed using geocoded seroprevalence data from surveillance of California ground squirrels (Spermophilus beecheyi) as case points and Worldclim bioclimatic data as predictor variables, and compared and validated using area under the receiver operating curve (AUC) statistics. Additionally, model results were compared to locations of positive and negative coyote (Canis latrans) samples, in order to determine the correlation between Maxent model predictions and areas of plague risk as determined via wild carnivore surveillance.
Results
Models of plague activity in California ground squirrels, based on recent climate conditions, accurately identified case locations (AUC of 0.913 to 0.948) and were significantly correlated with coyote samples. The final models were used to identify potential plague risk areas based on an ensemble of six future climate scenarios. These models suggest that by 2050, climate conditions may reduce plague risk in the southern parts of California and increase risk along the northern coast and Sierras.
Conclusion
Because different modeling approaches can yield substantially different results, care should be taken when interpreting future model predictions. Nonetheless, niche modeling can be a useful tool for exploring and mapping the potential response of plague activity to climate change. The final models in this study were used to identify potential plague risk areas based on an ensemble of six future climate scenarios, which can help public managers decide where to allocate surveillance resources. In addition, Maxent model results were significantly correlated with coyote samples, indicating that carnivore surveillance programs will continue to be important for tracking the response of plague to future climate conditions.
doi:10.1186/1476-072X-8-38
PMCID: PMC2716330  PMID: 19558717
15.  An Exhaustive, Non-Euclidean, Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria 
PLoS ONE  2011;6(9):e24085.
Complex, high-dimensional data sets pose significant analytical challenges in the post-genomic era. Such data sets are not exclusive to genetic analyses and are also pertinent to epidemiology. There has been considerable effort to develop hypothesis-free data mining and machine learning methodologies. However, current methodologies lack exhaustivity and general applicability. Here we use a novel non-parametric, non-euclidean data mining tool, HyperCube®, to explore exhaustively a complex epidemiological malaria data set by searching for over density of events in m-dimensional space. Hotspots of over density correspond to strings of variables, rules, that determine, in this case, the occurrence of Plasmodium falciparum clinical malaria episodes. The data set contained 46,837 outcome events from 1,653 individuals and 34 explanatory variables. The best predictive rule contained 1,689 events from 148 individuals and was defined as: individuals present during 1992–2003, aged 1–5 years old, having hemoglobin AA, and having had previous Plasmodium malariae malaria parasite infection ≤10 times. These individuals had 3.71 times more P. falciparum clinical malaria episodes than the general population. We validated the rule in two different cohorts. We compared and contrasted the HyperCube® rule with the rules using variables identified by both traditional statistical methods and non-parametric regression tree methods. In addition, we tried all possible sub-stratified quantitative variables. No other model with equal or greater representativity gave a higher Relative Risk. Although three of the four variables in the rule were intuitive, the effect of number of P. malariae episodes was not. HyperCube® efficiently sub-stratified quantitative variables to optimize the rule and was able to identify interactions among the variables, tasks not easy to perform using standard data mining methods. Search of local over density in m-dimensional space, explained by easily interpretable rules, is thus seemingly ideal for generating hypotheses for large datasets to unravel the complexity inherent in biological systems.
doi:10.1371/journal.pone.0024085
PMCID: PMC3170284  PMID: 21931645
16.  Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information 
BMC Bioinformatics  2007;8:201.
Background
Structural properties of proteins such as secondary structure and solvent accessibility contribute to three-dimensional structure prediction, not only in the ab initio case but also when homology information to known structures is available. Structural properties are also routinely used in protein analysis even when homology is available, largely because homology modelling is lower throughput than, say, secondary structure prediction. Nonetheless, predictors of secondary structure and solvent accessibility are virtually always ab initio.
Results
Here we develop high-throughput machine learning systems for the prediction of protein secondary structure and solvent accessibility that exploit homology to proteins of known structure, where available, in the form of simple structural frequency profiles extracted from sets of PDB templates. We compare these systems to their state-of-the-art ab initio counterparts, and with a number of baselines in which secondary structures and solvent accessibilities are extracted directly from the templates. We show that structural information from templates greatly improves secondary structure and solvent accessibility prediction quality, and that, on average, the systems significantly enrich the information contained in the templates. For sequence similarity exceeding 30%, secondary structure prediction quality is approximately 90%, close to its theoretical maximum, and 2-class solvent accessibility roughly 85%. Gains are robust with respect to template selection noise, and significant for marginal sequence similarity and for short alignments, supporting the claim that these improved predictions may prove beneficial beyond the case in which clear homology is available.
Conclusion
The predictive system are publicly available at the address .
doi:10.1186/1471-2105-8-201
PMCID: PMC1913928  PMID: 17570843
17.  Genomic data sampling and its effect on classification performance assessment 
BMC Bioinformatics  2003;4:5.
Background
Supervised classification is fundamental in bioinformatics. Machine learning models, such as neural networks, have been applied to discover genes and expression patterns. This process is achieved by implementing training and test phases. In the training phase, a set of cases and their respective labels are used to build a classifier. During testing, the classifier is used to predict new cases. One approach to assessing its predictive quality is to estimate its accuracy during the test phase. Key limitations appear when dealing with small-data samples. This paper investigates the effect of data sampling techniques on the assessment of neural network classifiers.
Results
Three data sampling techniques were studied: Cross-validation, leave-one-out, and bootstrap. These methods are designed to reduce the bias and variance of small-sample estimations. Two prediction problems based on small-sample sets were considered: Classification of microarray data originating from a leukemia study and from small, round blue-cell tumours. A third problem, the prediction of splice-junctions, was analysed to perform comparisons. Different accuracy estimations were produced for each problem. The variations are accentuated in the small-data samples. The quality of the estimates depends on the number of train-test experiments and the amount of data used for training the networks.
Conclusion
The predictive quality assessment of biomolecular data classifiers depends on the data size, sampling techniques and the number of train-test experiments. Conservative and optimistic accuracy estimations can be obtained by applying different methods. Guidelines are suggested to select a sampling technique according to the complexity of the prediction problem under consideration.
doi:10.1186/1471-2105-4-5
PMCID: PMC149349  PMID: 12553886
18.  Analysis of Two-Player Quantum Games in an EPR Setting Using Clifford's Geometric Algebra 
PLoS ONE  2012;7(1):e29015.
The framework for playing quantum games in an Einstein-Podolsky-Rosen (EPR) type setting is investigated using the mathematical formalism of geometric algebra (GA). The main advantage of this framework is that the players' strategy sets remain identical to the ones in the classical mixed-strategy version of the game, and hence the quantum game becomes a proper extension of the classical game, avoiding a criticism of other quantum game frameworks. We produce a general solution for two-player games, and as examples, we analyze the games of Prisoners' Dilemma and Stag Hunt in the EPR setting. The use of GA allows a quantum-mechanical analysis without the use of complex numbers or the Dirac Bra-ket notation, and hence is more accessible to the non-physicist.
doi:10.1371/journal.pone.0029015
PMCID: PMC3261139  PMID: 22279525
19.  Active machine learning for transmembrane helix prediction 
BMC Bioinformatics  2010;11(Suppl 1):S58.
Background
About 30% of genes code for membrane proteins, which are involved in a wide variety of crucial biological functions. Despite their importance, experimentally determined structures correspond to only about 1.7% of protein structures deposited in the Protein Data Bank due to the difficulty in crystallizing membrane proteins. Algorithms that can identify proteins whose high-resolution structure can aid in predicting the structure of many previously unresolved proteins are therefore of potentially high value. Active machine learning is a supervised machine learning approach which is suitable for this domain where there are a large number of sequences but only very few have known corresponding structures. In essence, active learning seeks to identify proteins whose structure, if revealed experimentally, is maximally predictive of others.
Results
An active learning approach is presented for selection of a minimal set of proteins whose structures can aid in the determination of transmembrane helices for the remaining proteins. TMpro, an algorithm for high accuracy TM helix prediction we previously developed, is coupled with active learning. We show that with a well-designed selection procedure, high accuracy can be achieved with only few proteins. TMpro, trained with a single protein achieved an F-score of 94% on benchmark evaluation and 91% on MPtopo dataset, which correspond to the state-of-the-art accuracies on TM helix prediction that are achieved usually by training with over 100 training proteins.
Conclusion
Active learning is suitable for bioinformatics applications, where manually characterized data are not a comprehensive representation of all possible data, and in fact can be a very sparse subset thereof. It aids in selection of data instances which when characterized experimentally can improve the accuracy of computational characterization of remaining raw data. The results presented here also demonstrate that the feature extraction method of TMpro is well designed, achieving a very good separation between TM and non TM segments.
doi:10.1186/1471-2105-11-S1-S58
PMCID: PMC3009531  PMID: 20122233
20.  Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data 
PLoS ONE  2012;7(7):e39932.
Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL’s classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.
doi:10.1371/journal.pone.0039932
PMCID: PMC3394775  PMID: 22808075
21.  Dressing the mind properly for the game. 
Game theory as a theoretical and empirical approach to interaction has spread from economics to psychology, political science, sociology and biology. Numerous social interactions-foraging, talking, trusting, coordinating, competing-can be formally represented in a game with specific rules and strategies. These same interactions seem to rely on an interweaving of mental selves, but an effective strategy need not depend on explicit strategizing and higher mental capabilities, as less sentient creatures or even lines of software can play similar games. Human players are distinct because we are less consistent and our choices respond to elements of the setting that appear to be strategically insignificant. Recent analyses of this variable response have yielded a number of insights into the mental approach of human players: we often mentalize, but not always; we are endowed with social preferences; we distinguish among various types of opponents; we manifest different personalities; we are often guided by security concerns; and our strategic sophistication is usually modest.
doi:10.1098/rstb.2002.1246
PMCID: PMC1693129  PMID: 12689383
22.  Combining Shapley value and statistics to the analysis of gene expression data in children exposed to air pollution 
BMC Bioinformatics  2008;9:361.
Background
In gene expression analysis, statistical tests for differential gene expression provide lists of candidate genes having, individually, a sufficiently low p-value. However, the interpretation of each single p-value within complex systems involving several interacting genes is problematic. In parallel, in the last sixty years, game theory has been applied to political and social problems to assess the power of interacting agents in forcing a decision and, more recently, to represent the relevance of genes in response to certain conditions.
Results
In this paper we introduce a Bootstrap procedure to test the null hypothesis that each gene has the same relevance between two conditions, where the relevance is represented by the Shapley value of a particular coalitional game defined on a microarray data-set. This method, which is called Comparative Analysis of Shapley value (shortly, CASh), is applied to data concerning the gene expression in children differentially exposed to air pollution. The results provided by CASh are compared with the results from a parametric statistical test for testing differential gene expression. Both lists of genes provided by CASh and t-test are informative enough to discriminate exposed subjects on the basis of their gene expression profiles. While many genes are selected in common by CASh and the parametric test, it turns out that the biological interpretation of the differences between these two selections is more interesting, suggesting a different interpretation of the main biological pathways in gene expression regulation for exposed individuals. A simulation study suggests that CASh offers more power than t-test for the detection of differential gene expression variability.
Conclusion
CASh is successfully applied to gene expression analysis of a data-set where the joint expression behavior of genes may be critical to characterize the expression response to air pollution. We demonstrate a synergistic effect between coalitional games and statistics that resulted in a selection of genes with a potential impact in the regulation of complex pathways.
doi:10.1186/1471-2105-9-361
PMCID: PMC2556684  PMID: 18764936
23.  Predicting deleterious nsSNPs: an analysis of sequence and structural attributes 
BMC Bioinformatics  2006;7:217.
Background
There has been an explosion in the number of single nucleotide polymorphisms (SNPs) within public databases. In this study we focused on non-synonymous protein coding single nucleotide polymorphisms (nsSNPs), some associated with disease and others which are thought to be neutral. We describe the distribution of both types of nsSNPs using structural and sequence based features and assess the relative value of these attributes as predictors of function using machine learning methods. We also address the common problem of balance within machine learning methods and show the effect of imbalance on nsSNP function prediction. We show that nsSNP function prediction can be significantly improved by 100% undersampling of the majority class. The learnt rules were then applied to make predictions of function on all nsSNPs within Ensembl.
Results
The measure of prediction success is greatly affected by the level of imbalance in the training dataset. We found the balanced dataset that included all attributes produced the best prediction. The performance as measured by the Matthews correlation coefficient (MCC) varied between 0.49 and 0.25 depending on the imbalance. As previously observed, the degree of sequence conservation at the nsSNP position is the single most useful attribute. In addition to conservation, structural predictions made using a balanced dataset can be of value.
Conclusion
The predictions for all nsSNPs within Ensembl, based on a balanced dataset using all attributes, are available as a DAS annotation. Instructions for adding the track to Ensembl are at
doi:10.1186/1471-2105-7-217
PMCID: PMC1489951  PMID: 16630345
24.  Event discovery in medical time-series data. 
Vast amounts of clinical information are generated daily on patients in the health care setting. Increasingly, this information is collected and stored for its potential utility in advancing health care. Knowledge-based systems, for example, might be able to apply rules to the collected data to determine whether a patient has a certain condition. Often, however, the underlying knowledge needed to write such rules is not well understood. How could these clinical data be useful then? Use of machine learning is one answer. We present a pipeline for discovering the knowledge needed for event detection in medical time-series data. We demonstrate how this process can be applied in the development of intelligent patient monitoring for the intensive care unit (ICU). Specifically, we develop a system for detecting Otrue alarmO situations in the ICU, where currently as many as 86% of bedside monitor alarms are false.
PMCID: PMC2243881  PMID: 11080006
25.  Playing cards on asthma management: A new interactive method for knowledge transfer to primary care physicians 
OBJECTIVES:
To describe an interactive playing card workshop in the communication of asthma guidelines recommendations, and to assess the initial evaluation of this educational tool by family physicians.
DESIGN:
Family physicians were invited to participate in the workshop by advertisements or personal contacts. Each physician completed a standardized questionnaire on his or her perception of the rules, content and properties of the card game.
SETTING:
A university-based continuing medical education initiative.
PARTICIPANTS:
Primary care physicians.
MAIN OUTCOME MEASURES:
Physicians’ evaluation of the rules, content and usefulness of the program.
RESULTS:
The game allowed the communication of relevant asthma-related content, as well as experimentation with a different learning format. It also stimulated interaction in a climate of friendly competition. Participating physicians considered the method to be an innovative tool that facilitated reflection, interaction and learning. It generated relevant discussions on how to apply guideline recommendations to current asthma care.
CONCLUSIONS:
This new, interactive, educational intervention, integrating play and scientific components, was well received by participants. This method may be of value to help integrate current guidelines into current practice, thus facilitating knowledge transfer to caregivers.
PMCID: PMC2677773  PMID: 18060093
Asthma; Game-based learning; Knowledge implementation; Medical education

Results 1-25 (600608)