PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of amiasummtspLink to Publisher's site
 
AMIA Summits Transl Sci Proc. 2012; 2012: 87–94.
Published online Mar 19, 2012.
PMCID: PMC3392048
A Selective Voting Convex-Hull Ensemble Procedure for Personalized Medicine
Radhakrishnan Nagarajan, Ph.D.1 and Ralph L. Kodell, Ph.D.2
1Division of Biomedical Informatics, University of Arkansas for Medical Sciences
2Department of Biostatistics, University of Arkansas for Medical Sciences
Genes work in concert as a system as opposed to independent entities and mediate disease states. There has been considerable interest in understanding variations in molecular signatures between normal and disease states. However, a majority of techniques implicitly assume homogeneity between samples within a given group and use a fixed set of genes in discerning the groups. The proposed study overcomes these caveats by using a selective-voting convex-hull ensemble procedure that accommodates molecular heterogeneity within and between groups. The significance of the study is its potential to selectively retrieve sample-specific ensemble sets and investigate variations in the networks corresponding to the ensemble set across these samples. These characteristics fit well within the scope of personalized medicine and comparative effectiveness research that emphasize on patient-tailored interventions. While the results are demonstrated on colon cancer gene expression profiles the approach as such is generic and can be readily extended to other settings.
Understanding the differences between normal and disease states from high-throughput molecular assays using suitable computational techniques is an area of active research15. Several techniques have been proposed in literature to identify genes that change markedly between normal and disease groups from their high-throughput molecular signatures. A majority of these techniques implicitly assume homogeneity between the samples within a given group. Such homogeneity assumptions need not necessarily be true in light of inherent differences between the members within that group. More importantly, homogeneity assumptions can be violated across multiple scales. For instance, lack of homogeneity or heterogeneity at the molecular level can be an outcome of inherent stochasticity underlying transcriptional mechanism, microenvironment effects, plasticity, de-differentiation, crosstalk between pathways, co-expression of multiple differentiation programs613. On the other hand, heterogeneity between patient populations can be attributed to genetic factors, co-morbidities, variations in disease progression and other clinical parameters613. Therefore, accommodating heterogeneity between patients within a given group may be a critical step prior to developing patient-tailored interventions and falls well within the objectives of comparative effectiveness research and personalized medicine.
Popular techniques that discern molecular signatures between a given pair of groups include clustering and classification techniques. However, these techniques traditionally use the same fixed set of genes to distinguish between normal and disease groups. Many methods have also been proposed for combining the outputs of classifiers into an ensemble including those for combining class labels and continuous outputs14. Alternative approaches include selecting ensemble members from among many models of the same type as well as different types 1517. A comprehensive review of ensemble-based methods can be found elsewhere 18. Majority voting among ensemble members is a common approach for combining class labels to predict the class of an unknown sample19. Two popular methods that employ majority voting are bagging and boosting 20,21. The Random Forest method 22 is a popular ensemble classifier that uses bagging, while LogitBoost23 uses boosting. Recently, Classification by Ensembles from Random Partitions (CERP) designed specifically for tree-based classifiers24,25 has also been shown to be competitive and a suitable alternative to bagging and boosting. Other notable classification based methods on high-dimensional feature include the random subspace method26, attribute bagging27 and genetic algorithm-based approaches28. While all of these methods show high accuracy when classifying cases not included in the training set, they tend to follow the standard “one-size-fits-all” approach to predicting class membership for unknown samples, i.e. every member of the trained ensemble of classifiers votes on every test sample. While classification accuracy is used as a benchmark, developing suitable techniques that accommodate population heterogeneity may be critical for a more realistic representation, generalization and possible incorporation in clinical workflows. More importantly, such techniques are likely to yield patient specific ensemble sets as opposed to class specific ensemble sets.
Recently, Kodell et al.29 introduced a model-free convex-hull-based approach30 in conjunction with a majority voting strategy for ensemble classification and compared its performance to other popular classification techniques using established performance measures. The convex-hull procedure29 by very definition captures second-order interaction and the geometry of the predictor variables in the two-dimensional space without imposing any model assumptions. This essentially renders the approach model-free. Since the ensemble set consists of gene pairs, the procedure is robust to data sets with missing values. Missing values are common across high-throughput molecular assays such as microarrays and can be an outcome of non-specific binding, poor hybridization and other measurement artifacts. Missing value of a gene across a single sample usually results in dropping that particular gene from the entire analysis. While imputation techniques can prove to be suitable alternative, they predict the missing value under certain implicit statistical assumptions and can lead to bias when these assumptions are violated. For the convex-hull selective voting procedure proposed in this study, voting is done by pairs of genes. Given a set of n genes, the number of possible pairs or voters is n(n − 1)/2. Thus a missing value in a gene across a given patient would essentially imply dropping the pairs corresponding to the gene for that particular patient in the analysis and not removal of the gene from the entire analysis. The selective-voting procedure is especially appealing since it lends itself to adapt the ensemble set across the samples retrieving personalized molecular signatures.
Unlike Kodell et al.29, the present study investigates the choice of a novel selective-voting strategy and convex-hull pruning procedure that relies on pairs whose interaction contributes significantly in discerning a given pair of groups while accommodating possible molecular heterogeneity within a group. A logistic regression pre-processing filter is used to first identify gene pairs with significant interaction term. Such an approach was inspired by traditional belief that associated genes are likely to have biologically relevant relationship (i.e. functional relationship)31,32. Capturing genes with significant interaction can also be useful in constructing network abstractions similar to those of relevance networks32. These pairs are subsequently mapped onto two-dimensional space for selective voting using convex hulls. A novel pruning procedure is proposed to minimize the overlap between the hulls corresponding to the control and the cancer samples. This procedure has to be contrasted to Kodell et al. 29, that used a distance metric and majority voting to assign the samples in the overlap region to the hulls. The pruning and selective voting process used in the present study results in sample-specific ensemble sets. Subsequently, network abstraction corresponding to the ensemble sets across each sample is generated. Since no explicit directionality information is available, these networks are in the form of undirected graphs. Several studies have successfully demonstrated the usefulness of network abstractions for obtaining preliminary system-level insights into biological pathways and signaling mechanisms9,31. It is our belief that the network abstractions of the ensemble sets are likely to provide insights into the intricate wiring of the genes and variation in the topological properties across the ensemble sets. Such an understanding may be an important step prior to devising meaningful patient-tailored interventions and falls within the scope of personalized medicine and comparative effectiveness research.
Gene Expression Data
The microarray gene expression data set1 (http://microarray.princeton.edu/oncology/affydata/index.html) used in the present study is publicly available and consists of 62 colon tissue samples (40 samples being from cancerous colon tissue of patients with colon adenocarcinoma and 22 samples being from normal colon tissue (controls) taken from the 40 cancer patients). Alon et al.1 identified 2000 of the 6500 genes generated using Affymetrix oligonucleotide array to have high intensity levels across the 62 tissue samples. The present study focuses on this set of 2000 genes. The expression level of each of the 2000 genes was subsequently log2-transformed and normalized across the 62 samples by subtracting the mean and dividing by the standard deviation.
Pre-processing Filter
Microarray gene expression data sets are high-dimensional by their very design and provide an unbiased simultaneous screening of the transcriptional activity. However, several studies have reiterated the fact that only a fraction of the genes on these arrays are biologically meaningful. This aspect may possibly explain the widely observed positively-skewed distribution of the microarray gene expression profiles with a majority of the genes exhibiting low intensities. We use a logistic regression pre-processing filter in order to minimize noise, alleviate computational burden on the convex-hull selective voting process and select pairs whose interaction contribute significantly to the class prediction. As noted earlier, identifying gene pairs with significant interaction terms is expected to be useful in capturing possible functional relationships. The logistic regression model was chosen to be of the form (Y = β0 + β1X1 + β2X2 + β3X1X2 + [sm epsilon]) where Y represents the class-prediction labels (22 control, 40 cancer) with X1, X2 and X1X2 representing the main and interaction effects of the given pair of genes. In the present study, pairs with significant (α = 0.01) interaction term pairs of genes (i.e. β3) were deemed interesting and used for subsequent analysis. It is customary to incorporate multiple-testing correction to control for false-discovery rate (FDR) in the logistic regression procedure. However, in the present study imposing FDR (Benjamini-Hochberg) drastically reduced the number of pairs to zero. Therefore, we excluded the FDR correction from the pre-processing filter. The reduced set after pre-processing consisted of 3016 pairs of genes. All subsequent analysis and discussion were restricted to these 3016 pairs.
Convex-Hull Selective Voting Algorithm
  • Step 1: Store the expression profile of m pairs of genes (i.e. m potential voting members of the ensemble) across n samples in identified using the pre-processing filter in Xmxn. The n samples are comprised of control (ncon)and cancer (ncan) samples such that n = ncon + ncan. Initialize the control and cancer voting matrices υcon (r, s) ← 0; υcan (r, s) ← 0, where r = 1 … 20, s = 1 … 62.
  • Step 2: Set r ← 1.
    Choose a window length k1 and divide Xmxn into test Umxk1 and training Vmxk2 datasets such that k2 = n \ k1 and k2 [dbl greater-than sign] k1. The class labels of the n samples in Xmxn are stored in the vector L1xn where
    equation mm1
  • Step 3: Consider the gene pair (i, j) [sm epsilon] m, ij. Their training and test data sets are (Vik2, Vjk2) and (Uik1, Ujk1) respectively. Project the training data (Vik2, Vjk2) onto a two-dimensional space and generate the convex-hulls30con, Ψcan) corresponding to the control (con) and cancer (can) samples. If Ψcon ∩ Ψcan[empty]︀ then the hulls overlap. In such a case, we iteratively prune the hulls till they’re separated using the procedure described below.
    Pruning:
    • Drop the vertices of the hulls in the overlap region and also their immediate neighbors.
    • Generate the pruned hulls ( equation mm2, equation mm3)
    • Set ( equation mm4, equation mm5)
      If Ψcon ∩ Ψcan[empty]︀, then and go to Step 4.
      Else go to (i).
  • Step 4: Vote the samples in the test data (Uik, Ujk), k = 1... k1 as follows:
    • If (Uik, Ujk) [set membership] Ψcon, then the control vote is incremented vote υcon(r, k) ← υcon(r, k)+1
    • If (Uik, Ujk) [set membership] Ψcan, then the cancer vote is incremented vote υcan(r, k) ← υcan(r, k)+1
  • Step 5: Repeat Steps 3 and 4 for each pair of genes (i, j) [sm epsilon] m, ij . The number of times each sample is voted as control and cancer is stored in υcon and υcan respectively.
  • Step 6: Repeat Steps 2–5 using the next window of k1 samples as test sample Umx1 and the remaining k2 = n \ k1 as training samples, Vmxk2.
  • Step 7: Repeat Step 6 till each sample is voted on as a test case once.
  • Step 8: Repeat Steps 1–7, by randomly permuting the columns of Xmxn 20 times (i.e. rr + 1, r ≤ 20)
    • Predicted labels (L*)
    • For the kth, k = 1… n sample, parametric ttest (α = 0.01) was used to determine the significant separation of υcon[, k] and υcan[, k].
    • L* (k) = 1, (i.e. control sample) provided the separation between the means corresponding to υcon[, k] and υcan[, k] are statistically significant (ttest, α = 0.01) and Econ[, k] > Ecan[, k] where E represents the means. The ensemble set (Gcon) of interest will contain those pairs of genes that contributed to υcon.
    • L* (k) = 2, (i.e. cancer sample) provided the separation between the means corresponding to υcon[, k] and υcan[, k] are statistically significant (ttest, α = 0.01) and Econ[, k] > Ecan[, k] where E represents the means. The ensemble set (Gcan) of interest will contain those pairs of genes that contributed to υcan.
    • L* (k) = 0, provided the separation between the means of the corresponding to υcon[, k] and υcan[, k] is not statistically significant (ttest, α = 0.01)
    • The kth sample is said to be correctly classified if L* (k) = L(k)
  • Step 9: Estimate the following statistics, Accuracy (ACC), Sensitivity (SEN), Specificity (SPC), Positive Predicted Value (PPV), Negative Predicted Value (NPV).
Given the small sample size of the dataset in the present study, we opted for a leave-one-out cross-validation (LOOCV) procedure, i.e. |k| = 1. Thus no repeats were necessary in Step 8 resulting in (r = 1, Step 1). Voting was done by a direct comparison of υcon and υcan across the 62 samples (22 control, 40 cancer) as opposed to using a ttest since only a single realization was used (r = 1). The control and cancer votes across the 62 samples were first determined (i.e. υcon (k), υcan (k) k = 1…62). If υcon (k) > υcan (k), then the kth sample was labeled as a control sample. On the other hand, if υcon (k)< υcan (k), then the kth sample was labeled as a cancer sample. No labels were assigned when υcon (k) = υcan (k), i.e. a tie. An example of the iterative pruning of the convex-hulls generated using a training set of 19 control and 36 cancer samples for a given pair of genes (X, Y) is shown in Fig. 1 for clarity. Of interest is to note the significant overlap between the cancer and control training samples, Fig. 1a. However, the iterative pruning process minimizes the overlap till the cancer and control samples are completely separated in Fig. 1c. The overlay of the test samples (4 cancer, 3 control) on the pruned hulls, Fig. 1d, shows that ~50% of the cancer test samples and ~66% of the control test samples are correctly classified while the remaining are unclassified. Thus not every test sample is voted on by the gene pair (X, Y).
Figure 1
Figure 1
Iterative pruning process for separating the convex-hulls corresponding to 19 control and the 36 cancer training samples using the given pairs of genes (X, Y) is shown in (a–d). The vertices in the overlap region that are dropped in the pruning (more ...)
Network Abstraction of the Ensemble Sets
As noted earlier, genes work in concert as a system and mediate phenotypic outcomes/disease states. Networks and graphs have proven to be convenient and useful abstractions of the underlying signaling mechanisms and pathways9,31,32. Understanding their variation across the patient samples is one of the objectives of the proposed study and is likely to be useful in identifying candidate genes for possible intervention. The logistic regression preprocessing filter was used to identify pairs of genes whose interaction contribute significantly in discerning the control and cancer samples. Subsequently, we use the convex hull selective voting algorithm once (r = 1) with the leave-one out procedure to identify pairs that vote across each of the samples. For the network abstraction (undirected graphs), we consider only pairs in the ensemble set (Step 8). For instance, if the given sample is a control sample, we consider only pairs corresponding to υcon. Each gene is represented by a vertex in the undirected graph and the pairs of genes by an edge. Since we consider pairs with no explicit directional information, the resulting graph by definition is undirected. Such networks often exhibit interesting topological and statistical properties.
Convex-Hull Selective Voting
Prior to investigating the network of genes and their variation across the patient samples we chose to determine the performance of the proposed convex-hull selective voting approach in classifying the control and cancer sample using popular performance metrics for completeness. The Accuracy (ACC = 82.3), Sensitivity (SEN = 92.5), Specificity (SPC = 63.6), Positive Predictive Value (PPV = 82.2) and Negative Predictive Value (NPV = 82.3) estimated using the proposed approach were comparable to those we had estimated earlier using more traditional algorithms in an earlier study 29. While the proposed method is model-free, some of the established algorithms29 demand judicious choice of various parameters for improved performance. These parameter choices are often nontrivial and may be data dependent failing to generalize across new samples. These inherent discrepancies in the nature of the approaches prevent us from conducting a direct comparison of the performance metrics. The control (υcon) and the cancer (υcan) votes estimated across the 62 samples are shown in Fig. 2a and and2b2b respectively. Of interest is to note that the number of control samples misclassified were more than that of the cancer samples. On a related note, the number of control samples is considerably lower than that of the cancer samples. The control samples (22) were also obtained from the 40 cancer subjects as opposed to normal subjects. We believe these may contribute to the discrepancies in the misclassification between the controls and the cancer samples. As expected, for the correctly classified control samples we have υcon (k) > υcan (k), k = 1…22, Figs. 2a–2b. A similar behavior was observed for correctly classified cancer samples i.e. υcan (k) > υcon (k), k = 23…62 Fig. 2a–2b. It should be noted that Figs. 2a–2b also reflect variation in the number of pairs of genes across the control and the cancer ensemble sets. This in turn justifies the need for selective-voting strategies and reflects inherent molecular heterogeneity between the samples. In order to obtain further insight, we investigated the overlap in voting members of the 22 control ensemble sets Gcon (k), k = 1…22 and the 40 cancer ensemble sets Gcan (k), k = 23…62. A grayscale heatmap was used to visualize the extent of overlap between the controls and between the cancer ensemble sets, Fig. 2c with lighter color representing larger overlap. Of interest is to note the characteristic dark streaks in heatmap corresponding to misclassified control and the cancer samples. The dark streaks reflect minimal overlap between the ensemble sets corresponding to the misclassified samples and rest. The dark streaks were especially pronounced across the control samples since the size of the control ensemble sets (Figs. 2a–2b) were relatively smaller than those of the cancer ensemble sets. There were also additional samples that were correctly classified but shared minimal overlap with the other samples within each group. These samples are accompanied by intermittent dark streaks in Fig. 2c.
Figure 2
Figure 2
The number of gene pairs in the control (υcon) and cancer (υcan) ensemble sets for the 62 patient tissue samples (22 control and 40 cancer) obtained using the convex-hull ensemble selective voting technique (LOOCV) is shown in (a) and (more ...)
Network Abstraction of the Ensemble Sets
Network abstractions were generated from the adjacency matrix corresponding to the pairs of genes for the 22 control (Gcon (k) = 1…22) and 40 cancer (Gcan (k) = 1…40) ensemble sets. For the control ensemble sets, nine genes had non-zero degree centrality across the 22 sets. These genes were (Hsa. 4689; Hsa. 5444; Hsa. 832; Hsa. 1130; Hsa. 996; Hsa. 821; Hsa. 831; Hsa.692; Hsa.3305). Out of these nines genes (Hsa. 5444; 40S Ribosomal Protein S24 (human)) was also identified as one of the co-regulated proteins in the ribosomal cluster in the original study (see Table 1, Alon et al., 19991). These genes were shown to be accompanied by relatively low intensities in the control tissue as opposed to cancer tissue. The gene (Hsa. 3305; tropmysin alpha chain smooth muscle (human)) was also reported in the original study. This gene was chosen as one of the 17 genes used to estimate the muscle index and distinguish the control and cancer specimens (see Fig. 4, Alon et al., 19991). The cancer ensemble sets failed to exhibit any overlap across the 40 sets. The absence of overlap in the case of cancer may possibly be due to increased heterogeneity in molecular signatures across diseased states as opposed to normal. However, based on the results from Fig. 2c, we did note that 29 of the 40 samples had a considerably large number of pairs of genes in their ensemble sets. Three genes that had non-zero degree centrality across these 29 ensemble sets were (Hsa. 10755; Hsa.692; Hsa. 122). Out of these three genes (Hsa. 1221; actin aortic smooth muscle (human)) was identified as one of the 17 smooth muscle genes used to compute the muscle index and subsequently distinguish the control and cancer tissues (see Fig. 4, Alon et al., 19991).
The connectivity of the graphs was investigated and the dominant components of each of the graphs corresponding to the ensemble sets were retrieved. The number of nodes in the dominant component across each of the graphs for the 22 control and 40 cancer ensemble sets is shown in Figs. 3a3b respectively. It should be noted that the number of nodes corresponding to the dominant component of the misclassified samples were considerably lower than others indicating sparse connectivity across the misclassified samples. This was true for the graphs corresponding to misclassified samples across control as well as cancer ensemble sets. Subsequently, the edge density of these dominant components was estimated. The graph density δ of an undirected graph G with V vertices (nodes) and E edges is given by the expression δ(G) = 2E / V(V − 1). In contrast to the number of nodes, the graph densities of the dominant component corresponding to the misclassified samples was considerably higher than correctly classified ones, Figs. 3c3d. These results may be an outcome of the relatively small number of classifiers (Figs. 2a2c) in the ensemble sets for the misclassified samples as opposed to others within the control and the cancer groups. Other established metrics did not provide any marked insights into the sample variations in the present study. However, these need to be investigated in greater detail.
Figure 3
Figure 3
Nodes in the dominant components across the 22 control (left) and 40 cancer (right) ensemble sets are shown in (a) and (b) respectively. The corresponding edge densities of these graphs are shown in (c) and (d) respectively. The misclassified control (more ...)
Traditional approaches for discerning normal and disease states from their molecular signatures implicitly assume homogeneity within groups. Such an assumption may be a serious limitation of these approaches as there can considerable differences between samples within a given group attributed to heterogeneity across multiple scales. There has been recent interest and emphasis on developing methodologies that can accommodate heterogeneity and individual variations that can lead to patient-tailored interventions. Such an approach falls within personalized medicine and comparative effectiveness research. The present study investigated the choice of logistic regression pre-filtering approach in conjunction with convex-hull ensemble selective voting strategy and network abstractions in understanding molecular heterogeneity in colon cancer gene expression profiles. While the pre-filtering is useful in identifying pairs of genes whose interaction contribute significantly to the prediction of the classes, the convex-hull selective voting is useful in identifying sample-specific marker genes. The convex-hull approach is model free and robust to missing values common in high-throughput molecular assays. The ability of this approach to capture second-order dependencies and geometry in the two-dimensional space makes it appealing. In contrast to more traditional classification approaches, the approach presented in this study lends itself for generating network abstractions of the ensemble sets and investigate their variation across the samples. Such networks have conventionally been useful in obtaining system-level insights and proven to be convenient abstractions of underlying signaling mechanisms. Their understanding may especially be necessary in developing patient-tailored interventions. The results presented on publicly available colon cancer gene expression profiles also validated some of the molecular markers identified in the original study. While the proposed approach is demonstrated on gene expression profiles, its generic nature lends itself to be adapted readily across other settings.
Acknowledgments
This research was supported by National Cancer Institute Grant 1R01CA152667.
1. Alon U, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences. 1999;96:6745–6750. [PubMed]
2. Perou CM, et al. Molecular portraits of human breast tumours. Nature. 2000;406(6797):747–52. [PubMed]
3. Golub TR, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. [PubMed]
4. Tibshirani R, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings National Academy of Science. 2002;99:6567–6572. [PubMed]
5. Khan J, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine. 2001;7:673–679. [PMC free article] [PubMed]
6. Aubin, et al. Heterogeniety of the osteoblast phenotype. Endocrinologist. 1999;9:25–31.
7. Candeliere, Candeliere GA, Liu F, Aubin JE, et al. Individual osteoblasts in the developing calvaria express different gene repertories. Bone. 2001;2001;28:351–361. [PubMed]
8. Ross SE, et al. Inhibition of adipogenesis by Wnt signaling. Science. 2000;289:950–953. [PubMed]
9. Sachs K, et al. Causal protein-signaling networks derived from multiparameter single-cell data. Science. 2005;308:523–529. [PubMed]
10. Allinen M, et al. Molecular characterization of the tumor microenvironment in breast cancer. Cancer Cell. 2004;6:17–32. [PubMed]
11. Elowitz MB, et al. Stochastic Gene Expression in a Single Cell. Science. 2002;297(5584):1183–6. [PubMed]
12. West M, et al. Embracing the complexity of genomic data for personalized medicine. Genome Research. 2006;16(5):559–66. [PubMed]
13. Common JEA, et al. Clinical and Genetic Heterogeneity of Erythrokeratoderma Variabilis. Journal of Investigative Dermatology. 2005;125:920–927. [PubMed]
14. Polikar R. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine. 2006;6:21–45.
15. Zhou Z-H, Wu J, Tang W. Ensembling neural networks: Many could be better than all. Artificial Intelligence. 2002;137:239–263.
16. Caruana R, et al. Proceedings of the 21st International Conference on Machine Learning. New York: ACM; 2004. Ensemble selection from libraries of models; p. 8.
17. Rokach L. Collective-agreement-based pruning of ensembles. Computational Statistics and Data Analysis. 2009;53:1015–1026.
18. Rokach L. Ensemble-based classifiers. Artificial Intelligence Review. 2010;33:1–39.
19. Lam L, Suen CY. Application of majority voting to pattern recognition: an analysis of its behavior and performance. Systems, Man & Cybernetics. 1997;27:553–568.
20. Breiman L. Bagging predictors. Machine Learning. 1996;24:123–140.
21. Freund Y, Schapire RE. Experiments with a new boosting algorithm. Proceedings of the 13th International Conference on Machine Learning. 1996. pp. 148–156.
22. Breiman L. Random forest. Machine Learning. 2001;45:5–32.
23. Friedman J, Hastie T, Tibshirani R. Adaptive logistic regression: a statistical view of boosting. Annals of Statistics. 2000;28:337–374.
24. Moon H, et al. Ensemble methods for classification of patients for personalized medicine with high-dimensional data. Artificial Intelligence in Medicine. 2007;41:197–207. [PubMed]
25. Ahn H, et al. Classification by ensembles from random partitions for high-dimensional data. Computational Statistics and Data Analysis. 2007;51:6166–6179.
26. Ho TK. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1998:20. 832–844.
27. Bryll R, Gutierrez-Osuna R, Quek F. Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets. Pattern Recognition. 2003;36:1291–1302.
28. Rokach L. Genetic algorithm-based feature set partitioning for classification problems. Pattern Recognition. 2008;41:1676–1700.
29. Kodell RL, et al. A model-free ensemble method for class prediction with application to biomedical decision making. Artificial Intelligence in Medicine. 2009;46:267–276. [PubMed]
30. Barber CB, Dobkin DP, Huhdanpaa HT. The Quickhull Algorithm for Convex Hulls. ACM Trans. Mathematical Software. 1996;22(4):469–483.
31. Gardner TS, et al. Inferring genetic networks and identifying compound mode of action via expression profiling. Science. 2003;301:102–105. [PubMed]
32. Butte AJ, Kohane IS. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pacific Symposium on Biocomputing. 2000. pp. 418–429. [PubMed]
Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of
American Medical Informatics Association