The availability of large collections of microarray datasets (compendia), or knowledge about grouping of genes into pathways (gene sets), is typically not exploited when training predictors of disease outcome. These can be useful since a compendium increases the number of samples, while gene sets reduce the size of the feature space. This should be favorable from a machine learning perspective and result in more robust predictors.
We extracted modules of regulated genes from gene sets, and compendia. Through supervised analysis, we constructed predictors which employ modules predictive of breast cancer outcome. To validate these predictors we applied them to independent data, from the same institution (intra-dataset), and other institutions (inter-dataset).
We show that modules derived from single breast cancer datasets achieve better performance on the validation data compared to gene-based predictors. We also show that there is a trend in compendium specificity and predictive performance: modules derived from a single breast cancer dataset, and a breast cancer specific compendium perform better compared to those derived from a human cancer compendium. Additionally, the module-based predictor provides a much richer insight into the underlying biology. Frequently selected gene sets are associated with processes such as cell cycle, E2F regulation, DNA damage response, proteasome and glycolysis. We analyzed two modules related to cell cycle, and the OCT1 transcription factor, respectively. On an individual basis, these modules provide a significant separation in survival subgroups on the training and independent validation data.
The increasing challenges and complexity of business environments are making business decisions and operations more difficult for entrepreneurs to predict the outcomes of these processes. Therefore, we developed a decision support scheme that could be used and adapted to various business decision processes. These involve decisions that are made under uncertain situations such as business competition in the market or wage negotiation within a firm. The scheme uses game strategies and fuzzy inference concepts to effectively grasp the variables in these uncertain situations. The games are played between human and fuzzy players. The accuracy of the fuzzy rule base and the game strategies help to mitigate the adverse effects that a business may suffer from these uncertain factors. We also introduced learning which enables the fuzzy player to adapt over time. We tested this scheme in different scenarios and discover that it could be an invaluable tool in the hand of entrepreneurs that are operating under uncertain and competitive business environments.
Fuzzy logic; Membership functions; Zero sum; Decision; Business games; Game theory
Lung cancer is among the most common cancers in the United States, in terms of incidence and mortality. In 2009, it is estimated that more than 150,000 deaths will result from lung cancer alone. Genetic information is an extremely valuable data source in characterizing the personal nature of cancer. Over the past several years, investigators have conducted numerous association studies where intensive genetic data is collected on relatively few patients compared to the numbers of gene predictors, with one scientific goal being to identify genetic features associated with cancer recurrence or survival. In this note, we propose high-dimensional survival analysis through a new application of boosting, a powerful tool in machine learning. Our approach is based on an accelerated lifetime model and minimizing the sum of pairwise differences in residuals. We apply our method to a recent microarray study of lung adenocarcinoma and find that our ensemble is composed of 19 genes while a proportional hazards (PH) ensemble is composed of nine genes, a proper subset of the 19-gene panel. In one of our simulation scenarios, we demonstrate that PH boosting in a misspecified model tends to underfit and ignore moderately-sized covariate effects, on average. Diagnostic analyses suggest that the PH assumption is not satisfied in the microarray data and may explain, in part, the discrepancy in the sets of active coefficients. Our simulation studies and comparative data analyses demonstrate how statistical learning by PH models alone is insufficient.
Accelerated failure time; Boosting; Lasso; Proportional hazards regression; Survival analysis
Machine learning is a powerful approach for describing and predicting classes in microarray data. Although several comparative studies have investigated the relative performance of various machine learning methods, these often do not account for the fact that performance (e.g. error rate) is a result of a series of analysis steps of which the most important are data normalization, gene selection and machine learning.
In this study, we used seven previously published cancer-related microarray data sets to compare the effects on classification performance of five normalization methods, three gene selection methods with 21 different numbers of selected genes and eight machine learning methods. Performance in term of error rate was rigorously estimated by repeatedly employing a double cross validation approach. Since performance varies greatly between data sets, we devised an analysis method that first compares methods within individual data sets and then visualizes the comparisons across data sets. We discovered both well performing individual methods and synergies between different methods.
Support Vector Machines with a radial basis kernel, linear kernel or polynomial kernel of degree 2 all performed consistently well across data sets. We show that there is a synergistic relationship between these methods and gene selection based on the T-test and the selection of a relatively high number of genes. Also, we find that these methods benefit significantly from using normalized data, although it is hard to draw general conclusions about the relative performance of different normalization procedures.
Holding on to one's strategy is natural and common if the later warrants success and satisfaction. This goes against widespread simulation practices of evolutionary games, where players frequently consider changing their strategy even though their payoffs may be marginally different than those of the other players. Inspired by this observation, we introduce an aspiration-based win-stay-lose-learn strategy updating rule into the spatial prisoner's dilemma game. The rule is simple and intuitive, foreseeing strategy changes only by dissatisfied players, who then attempt to adopt the strategy of one of their nearest neighbors, while the strategies of satisfied players are not subject to change. We find that the proposed win-stay-lose-learn rule promotes the evolution of cooperation, and it does so very robustly and independently of the initial conditions. In fact, we show that even a minute initial fraction of cooperators may be sufficient to eventually secure a highly cooperative final state. In addition to extensive simulation results that support our conclusions, we also present results obtained by means of the pair approximation of the studied game. Our findings continue the success story of related win-stay strategy updating rules, and by doing so reveal new ways of resolving the prisoner's dilemma.
Backward induction is a benchmark of game theoretic rationality, yet surprisingly little is known as to how humans discover and initially learn to apply this abstract solution concept in experimental settings. We use behavioral and functional magnetic resonance imaging (fMRI) data to study the way in which subjects playing in a sequential game of perfect information learn the optimal backward induction strategy for the game. Experimental data from our two studies support two main findings: First, subjects converge to a common process of recursive inference similar to the backward induction procedure for solving the game. The process is recursive because earlier insights and conclusions are used as inputs in later steps of the inference. This process is matched by a similar pattern in brain activation, which also proceeds backward, following the prediction error: brain activity initially codes the responses to losses in final positions; in later trials this activity shifts to the starting position. Second, the learning process is not exclusively cognitive, but instead combines experience-based learning and abstract reasoning. Critical experiences leading to the adoption of an improved solution strategy appear to be stimulated by brain activity in the reward system. This indicates that the negative affect induced by initial failures facilitates the switch to a different method of solving the problem. Abstract reasoning is combined with this response, and is expressed by activation in the ventrolateral prefrontal cortex. Differences in brain activation match differences in performance between subjects who show different learning speeds.
neuroeconomics; game theory; backward induction; learning; deductive reasoning
We build classification models and risk assessment tools for diabetes, hypertension and comorbidity using machine-learning algorithms on data from Kuwait. We model the increased proneness in diabetic patients to develop hypertension and vice versa. We ascertain the importance of ethnicity (and natives vs expatriate migrants) and of using regional data in risk assessment.
Retrospective cohort study. Four machine-learning techniques were used: logistic regression, k-nearest neighbours (k-NN), multifactor dimensionality reduction and support vector machines. The study uses fivefold cross validation to obtain generalisation accuracies and errors.
Kuwait Health Network (KHN) that integrates data from primary health centres and hospitals in Kuwait.
270 172 hospital visitors (of which, 89 858 are diabetic, 58 745 hypertensive and 30 522 comorbid) comprising Kuwaiti natives, Asian and Arab expatriates.
Incident type 2 diabetes, hypertension and comorbidity.
Classification accuracies of >85% (for diabetes) and >90% (for hypertension) are achieved using only simple non-laboratory-based parameters. Risk assessment tools based on k-NN classification models are able to assign ‘high’ risk to 75% of diabetic patients and to 94% of hypertensive patients. Only 5% of diabetic patients are seen assigned ‘low’ risk. Asian-specific models and assessments perform even better. Pathological conditions of diabetes in the general population or in hypertensive population and those of hypertension are modelled. Two-stage aggregate classification models and risk assessment tools, built combining both the component models on diabetes (or on hypertension), perform better than individual models.
Data on diabetes, hypertension and comorbidity from the cosmopolitan State of Kuwait are available for the first time. This enabled us to apply four different case–control models to assess risks. These tools aid in the preliminary non-intrusive assessment of the population. Ethnicity is seen significant to the predictive models. Risk assessments need to be developed using regional data as we demonstrate the applicability of the American Diabetes Association online calculator on data from Kuwait.
predictive models; Machine learning; Risk assessment; Kuwait
Water plays a critical role in ligand-protein interactions. However, it is still challenging to predict accurately not only where water molecules prefer to bind, but also which of those water molecules might be displaceable. The latter is often seen as a route to optimizing affinity of potential drug candidates. Using a protocol we call WaterDock, we show that the freely available AutoDock Vina tool can be used to predict accurately the binding sites of water molecules. WaterDock was validated using data from X-ray crystallography, neutron diffraction and molecular dynamics simulations and correctly predicted 97% of the water molecules in the test set. In addition, we combined data-mining, heuristic and machine learning techniques to develop probabilistic water molecule classifiers. When applied to WaterDock predictions in the Astex Diverse Set of protein ligand complexes, we could identify whether a water molecule was conserved or displaced to an accuracy of 75%. A second model predicted whether water molecules were displaced by polar groups or by non-polar groups to an accuracy of 80%. These results should prove useful for anyone wishing to undertake rational design of new compounds where the displacement of water molecules is being considered as a route to improved affinity.
A new treatment to determine the Pareto-optimal outcome for a non-zero-sum game is presented. An equilibrium point for any game is defined here as a set of strategy choices for the players, such that no change in the choice of any single player will increase the overall payoff of all the players. Determining equilibrium for multi-player games is a complex problem. An intuitive conceptual tool for reducing the complexity, via the idea of spatially representing strategy options in the bargaining problem is proposed. Based on this geometry, an equilibrium condition is established such that the product of their gains over what each receives is maximal. The geometrical analysis of a cooperative bargaining game provides an example for solving multi-player and non-zero-sum games efficiently.
In the ultimatum game, two players divide a sum of money. The proposer suggests how to split and the responder can accept or reject. If the suggestion is rejected, both players get nothing. The rational solution is that the responder accepts even the smallest offer but humans prefer fair share. In this paper, we study the ultimatum game by a learning-mutation process based on quantal response equilibrium, where players are assumed boundedly rational and make mistakes when estimating the payoffs of strategies. Social learning is never stabilized at the fair outcome or the rational outcome, but leads to oscillations from offering 40 percent to 50 percent. To be precise, there is a clear tendency to increase the mean offer if it is lower than 40 percent, but will decrease when it reaches the fair offer. If mutations occur rarely, fair behavior is favored in the limit of local mutation. If mutation rate is sufficiently high, fairness can evolve for both local mutation and global mutation.
Arabidopsis thaliana is the model species of current plant genomic research with a genome size of 125 Mb and approximately 28,000 genes. The function of half of these genes is currently unknown. The purpose of this study is to infer gene function in Arabidopsis using machine-learning algorithms applied to large-scale gene expression data sets, with the goal of identifying genes that are potentially involved in plant response to abiotic stress.
Using in house and publicly available data, we assembled a large set of gene expression measurements for A. thaliana. Using those genes of known function, we first evaluated and compared the ability of basic machine-learning algorithms to predict which genes respond to stress. Predictive accuracy was measured using ROC50 and precision curves derived through cross validation. To improve accuracy, we developed a method for combining these classifiers using a weighted-voting scheme. The combined classifier was then trained on genes of known function and applied to genes of unknown function, identifying genes that potentially respond to stress. Visual evidence corroborating the predictions was obtained using electronic Northern analysis. Three of the predicted genes were chosen for biological validation. Gene knockout experiments confirmed that all three are involved in a variety of stress responses. The biological analysis of one of these genes (At1g16850) is presented here, where it is shown to be necessary for the normal response to temperature and NaCl.
Supervised learning methods applied to large-scale gene expression measurements can be used to predict gene function. However, the ability of basic learning methods to predict stress response varies widely and depends heavily on how much dimensionality reduction is used. Our method of combining classifiers can improve the accuracy of such predictions – in this case, predictions of genes involved in stress response in plants – and it effectively chooses the appropriate amount of dimensionality reduction automatically. The method provides a useful means of identifying genes in A. thaliana that potentially respond to stress, and we expect it would be useful in other organisms and for other gene functions.
The immune system is characterized by high combinatorial complexity that necessitates the use of specialized computational tools for analysis of immunological data. Machine learning (ML) algorithms are used in combination with classical experimentation for the selection of vaccine targets and in computational simulations that reduce the number of necessary experiments. The development of ML algorithms requires standardized data sets, consistent measurement methods, and uniform scales. To bridge the gap between the immunology community and the ML community, we designed a repository for machine learning in immunology named Dana-Farber Repository for Machine Learning in Immunology (DFRMLI). This repository provides standardized data sets of HLA-binding peptides with all binding affinities mapped onto a common scale. It also provides a list of experimentally validated naturally processed T cell epitopes derived from tumor or virus antigens. The DFRMLI data were preprocessed and ensure consistency, comparability, detailed descriptions, and statistically meaningful sample sizes for peptides that bind to various HLA molecules. The repository is accessible at http://bio.dfci.harvard.edu/DFRMLI/.
In frequency-dependent games, strategy choice may be innate or learned. While experimental evidence in the producer–scrounger game suggests that learned strategy choice may be common, a recent theoretical analysis demonstrated that learning by only some individuals prevents learning from evolving in others. Here, however, we model learning explicitly, and demonstrate that learning can easily evolve in the whole population. We used an agent-based evolutionary simulation of the producer–scrounger game to test the success of two general learning rules for strategy choice. We found that learning was eventually acquired by all individuals under a sufficient degree of environmental fluctuation, and when players were phenotypically asymmetric. In the absence of sufficient environmental change or phenotypic asymmetries, the correct target for learning seems to be confounded by game dynamics, and innate strategy choice is likely to be fixed in the population. The results demonstrate that under biologically plausible conditions, learning can easily evolve in the whole population and that phenotypic asymmetry is important for the evolution of learned strategy choice, especially in a stable or mildly changing environment.
social foraging; producer–scrounger game; phenotypic asymmetries; evolutionary simulation
goal in computational chemistry has been to discover the
set of rules that can accurately predict the binding affinity of any
protein-drug complex, using only a single snapshot of its three-dimensional
structure. Despite the continual development of structure-based models,
predictive accuracy remains low, and the fundamental factors that
inhibit the inference of all-encompassing rules have yet to be fully
explored. Using statistical learning theory and information theory,
here we prove that even the very best generalized structure-based
model is inherently limited in its accuracy, and protein-specific
models are always likely to be better. Our results refute the prevailing
assumption that large data sets and advanced machine learning techniques
will yield accurate, universally applicable models. We anticipate
that the results will aid the development of more robust virtual screening
strategies and scoring function error estimations.
Classification is one of the data mining problems receiving enormous attention in the database community. Although artificial neural networks (ANNs) have been successfully applied in a wide range of machine learning applications, they are however often regarded as black boxes, i.e., their predictions cannot be explained. To enhance the explanation of ANNs, a novel algorithm to extract symbolic rules from ANNs has been proposed in this paper. ANN methods have not been effectively utilized for data mining tasks because how the classifications were made is not explicitly stated as symbolic rules that are suitable for verification or interpretation by human experts. With the proposed approach, concise symbolic rules with high accuracy, that are easily explainable, can be extracted from the trained ANNs. Extracted rules are comparable with other methods in terms of number of rules, average number of conditions for a rule, and the accuracy. The effectiveness of the proposed approach is clearly demonstrated by the experimental results on a set of benchmark data mining classification problems.
data mining; neural networks; symbolic rules; weight freezing; constructive algorithm; pruning; clustering; rule extraction; symbolic rules
Our objective is to facilitate semi-automated detection of suspicious access to EHRs. Previously we have shown that a machine learning method can play a role in identifying potentially inappropriate access to EHRs. However, the problem of sampling informative instances to build a classifier still remained. We developed an integrated filtering method leveraging both anomaly detection based on symbolic clustering and signature detection, a rule-based technique. We applied the integrated filtering to 25.5 million access records in an intervention arm, and compared this with 8.6 million access records in a control arm where no filtering was applied. On the training set with cross-validation, the AUC was 0.960 in the control arm and 0.998 in the intervention arm. The difference in false negative rates on the independent test set was significant, P=1.6×10−6. Our study suggests that utilization of integrated filtering strategies to facilitate the construction of classifiers can be helpful.
Individuals' beliefs about the malleability of their abilities may predict their response and outcome in learning from serious games. Individuals with growth mindsets believe their abilities can develop with practice and effort, whereas individuals with fixed mindsets believe their abilities are static and cannot improve. This study uses survey and gameplay server data to examine the implicit theory of intelligence in the context of serious game learning. The findings show that growth mindset players performed better than fixed mindset players, their mistakes did not affect their attention to the game, and they read more learning feedback than fixed mindset players. In addition, growth mindset players were more likely to actively seek difficult challenges, which is often essential to self-directed learning. General mindset measurements and domain-specific measurements were also compared. These findings suggest that players' psychological attributes should be considered when designing and applying serious games.
Machine learning is a branch of artificial intelligence that employs a variety of statistical, probabilistic and optimization techniques that allows computers to “learn” from past examples and to detect hard-to-discern patterns from large, noisy or complex data sets. This capability is particularly well-suited to medical applications, especially those that depend on complex proteomic and genomic measurements. As a result, machine learning is frequently used in cancer diagnosis and detection. More recently machine learning has been applied to cancer prognosis and prediction. This latter approach is particularly interesting as it is part of a growing trend towards personalized, predictive medicine. In assembling this review we conducted a broad survey of the different types of machine learning methods being used, the types of data being integrated and the performance of these methods in cancer prediction and prognosis. A number of trends are noted, including a growing dependence on protein biomarkers and microarray data, a strong bias towards applications in prostate and breast cancer, and a heavy reliance on “older” technologies such artificial neural networks (ANNs) instead of more recently developed or more easily interpretable machine learning methods. A number of published studies also appear to lack an appropriate level of validation or testing. Among the better designed and validated studies it is clear that machine learning methods can be used to substantially (15–25%) improve the accuracy of predicting cancer susceptibility, recurrence and mortality. At a more fundamental level, it is also evident that machine learning is also helping to improve our basic understanding of cancer development and progression.
Cancer; machine learning; prognosis; risk; prediction
Complex, high-dimensional data sets pose significant analytical challenges in the post-genomic era. Such data sets are not exclusive to genetic analyses and are also pertinent to epidemiology. There has been considerable effort to develop hypothesis-free data mining and machine learning methodologies. However, current methodologies lack exhaustivity and general applicability. Here we use a novel non-parametric, non-euclidean data mining tool, HyperCube®, to explore exhaustively a complex epidemiological malaria data set by searching for over density of events in m-dimensional space. Hotspots of over density correspond to strings of variables, rules, that determine, in this case, the occurrence of Plasmodium falciparum clinical malaria episodes. The data set contained 46,837 outcome events from 1,653 individuals and 34 explanatory variables. The best predictive rule contained 1,689 events from 148 individuals and was defined as: individuals present during 1992–2003, aged 1–5 years old, having hemoglobin AA, and having had previous Plasmodium malariae malaria parasite infection ≤10 times. These individuals had 3.71 times more P. falciparum clinical malaria episodes than the general population. We validated the rule in two different cohorts. We compared and contrasted the HyperCube® rule with the rules using variables identified by both traditional statistical methods and non-parametric regression tree methods. In addition, we tried all possible sub-stratified quantitative variables. No other model with equal or greater representativity gave a higher Relative Risk. Although three of the four variables in the rule were intuitive, the effect of number of P. malariae episodes was not. HyperCube® efficiently sub-stratified quantitative variables to optimize the rule and was able to identify interactions among the variables, tasks not easy to perform using standard data mining methods. Search of local over density in m-dimensional space, explained by easily interpretable rules, is thus seemingly ideal for generating hypotheses for large datasets to unravel the complexity inherent in biological systems.
Plague, caused by the bacterium Yersinia pestis, is a public and wildlife health concern in California and the western United States. This study explores the spatial characteristics of positive plague samples in California and tests Maxent, a machine-learning method that can be used to develop niche-based models from presence-only data, for mapping the potential distribution of plague foci. Maxent models were constructed using geocoded seroprevalence data from surveillance of California ground squirrels (Spermophilus beecheyi) as case points and Worldclim bioclimatic data as predictor variables, and compared and validated using area under the receiver operating curve (AUC) statistics. Additionally, model results were compared to locations of positive and negative coyote (Canis latrans) samples, in order to determine the correlation between Maxent model predictions and areas of plague risk as determined via wild carnivore surveillance.
Models of plague activity in California ground squirrels, based on recent climate conditions, accurately identified case locations (AUC of 0.913 to 0.948) and were significantly correlated with coyote samples. The final models were used to identify potential plague risk areas based on an ensemble of six future climate scenarios. These models suggest that by 2050, climate conditions may reduce plague risk in the southern parts of California and increase risk along the northern coast and Sierras.
Because different modeling approaches can yield substantially different results, care should be taken when interpreting future model predictions. Nonetheless, niche modeling can be a useful tool for exploring and mapping the potential response of plague activity to climate change. The final models in this study were used to identify potential plague risk areas based on an ensemble of six future climate scenarios, which can help public managers decide where to allocate surveillance resources. In addition, Maxent model results were significantly correlated with coyote samples, indicating that carnivore surveillance programs will continue to be important for tracking the response of plague to future climate conditions.
Structural properties of proteins such as secondary structure and solvent accessibility contribute to three-dimensional structure prediction, not only in the ab initio case but also when homology information to known structures is available. Structural properties are also routinely used in protein analysis even when homology is available, largely because homology modelling is lower throughput than, say, secondary structure prediction. Nonetheless, predictors of secondary structure and solvent accessibility are virtually always ab initio.
Here we develop high-throughput machine learning systems for the prediction of protein secondary structure and solvent accessibility that exploit homology to proteins of known structure, where available, in the form of simple structural frequency profiles extracted from sets of PDB templates. We compare these systems to their state-of-the-art ab initio counterparts, and with a number of baselines in which secondary structures and solvent accessibilities are extracted directly from the templates. We show that structural information from templates greatly improves secondary structure and solvent accessibility prediction quality, and that, on average, the systems significantly enrich the information contained in the templates. For sequence similarity exceeding 30%, secondary structure prediction quality is approximately 90%, close to its theoretical maximum, and 2-class solvent accessibility roughly 85%. Gains are robust with respect to template selection noise, and significant for marginal sequence similarity and for short alignments, supporting the claim that these improved predictions may prove beneficial beyond the case in which clear homology is available.
The predictive system are publicly available at the address .
Supervised classification is fundamental in bioinformatics. Machine learning models, such as neural networks, have been applied to discover genes and expression patterns. This process is achieved by implementing training and test phases. In the training phase, a set of cases and their respective labels are used to build a classifier. During testing, the classifier is used to predict new cases. One approach to assessing its predictive quality is to estimate its accuracy during the test phase. Key limitations appear when dealing with small-data samples. This paper investigates the effect of data sampling techniques on the assessment of neural network classifiers.
Three data sampling techniques were studied: Cross-validation, leave-one-out, and bootstrap. These methods are designed to reduce the bias and variance of small-sample estimations. Two prediction problems based on small-sample sets were considered: Classification of microarray data originating from a leukemia study and from small, round blue-cell tumours. A third problem, the prediction of splice-junctions, was analysed to perform comparisons. Different accuracy estimations were produced for each problem. The variations are accentuated in the small-data samples. The quality of the estimates depends on the number of train-test experiments and the amount of data used for training the networks.
The predictive quality assessment of biomolecular data classifiers depends on the data size, sampling techniques and the number of train-test experiments. Conservative and optimistic accuracy estimations can be obtained by applying different methods. Guidelines are suggested to select a sampling technique according to the complexity of the prediction problem under consideration.
Compare feedback strategies in three versions of an educational game.
Study abroad students (N = 482) participated by playing the game and completing pre-game/post-game surveys January-March 2010.
This study employed an experimental design. Primary outcome measures were knowledge gain, player-satisfaction, and risk perception.
One-third had previously traveled to a malaria-risk region and two-thirds planned to do so. Baseline malaria knowledge was low. Post-game knowledge and risk perception were significantly higher than pre-game, irrespective of past travel status. The group that automatically received explanatory feedback following game decisions scored higher for mean knowledge gain, without differences in player-satisfaction.
The challenges of designing a feedback strategy to support Web-based learning make these results highly relevant to health educators developing interactive multimedia interventions. The increasing number of students traveling to higher-risk destinations demands attention. Both malaria-naive and malaria-experienced students would benefit from this approach to travel health education.
college; feedback; game design; malaria; risk reduction; study abroad; Web-based
About 30% of genes code for membrane proteins, which are involved in a wide variety of crucial biological functions. Despite their importance, experimentally determined structures correspond to only about 1.7% of protein structures deposited in the Protein Data Bank due to the difficulty in crystallizing membrane proteins. Algorithms that can identify proteins whose high-resolution structure can aid in predicting the structure of many previously unresolved proteins are therefore of potentially high value. Active machine learning is a supervised machine learning approach which is suitable for this domain where there are a large number of sequences but only very few have known corresponding structures. In essence, active learning seeks to identify proteins whose structure, if revealed experimentally, is maximally predictive of others.
An active learning approach is presented for selection of a minimal set of proteins whose structures can aid in the determination of transmembrane helices for the remaining proteins. TMpro, an algorithm for high accuracy TM helix prediction we previously developed, is coupled with active learning. We show that with a well-designed selection procedure, high accuracy can be achieved with only few proteins. TMpro, trained with a single protein achieved an F-score of 94% on benchmark evaluation and 91% on MPtopo dataset, which correspond to the state-of-the-art accuracies on TM helix prediction that are achieved usually by training with over 100 training proteins.
Active learning is suitable for bioinformatics applications, where manually characterized data are not a comprehensive representation of all possible data, and in fact can be a very sparse subset thereof. It aids in selection of data instances which when characterized experimentally can improve the accuracy of computational characterization of remaining raw data. The results presented here also demonstrate that the feature extraction method of TMpro is well designed, achieving a very good separation between TM and non TM segments.
Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL’s classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.