Understanding the differences between normal and disease states from high-throughput molecular assays using suitable computational techniques is an area of active research1
. Several techniques have been proposed in literature to identify genes that change markedly between normal and disease groups from their high-throughput molecular signatures. A majority of these techniques implicitly assume homogeneity between the samples within a given group. Such homogeneity assumptions need not necessarily be true in light of inherent differences between the members within that group. More importantly, homogeneity assumptions can be violated across multiple scales. For instance, lack of homogeneity or heterogeneity at the molecular level can be an outcome of inherent stochasticity underlying transcriptional mechanism, microenvironment effects, plasticity, de-differentiation, crosstalk between pathways, co-expression of multiple differentiation programs6
. On the other hand, heterogeneity between patient populations can be attributed to genetic factors, co-morbidities, variations in disease progression and other clinical parameters6
. Therefore, accommodating heterogeneity between patients within a given group may be a critical step prior to developing patient-tailored interventions and falls well within the objectives of comparative effectiveness research and personalized medicine.
Popular techniques that discern molecular signatures between a given pair of groups include clustering and classification techniques. However, these techniques traditionally use the same fixed set of genes to distinguish between normal and disease groups. Many methods have also been proposed for combining the outputs of classifiers into an ensemble including those for combining class labels and continuous outputs14
. Alternative approaches include selecting ensemble members from among many models of the same type as well as different types 15
. A comprehensive review of ensemble-based methods can be found elsewhere 18
. Majority voting among ensemble members is a common approach for combining class labels to predict the class of an unknown sample19
. Two popular methods that employ majority voting are bagging and boosting 20
. The Random Forest method 22
is a popular ensemble classifier that uses bagging, while LogitBoost23
uses boosting. Recently, Classification by Ensembles from Random Partitions (CERP) designed specifically for tree-based classifiers24
has also been shown to be competitive and a suitable alternative to bagging and boosting. Other notable classification based methods on high-dimensional feature include the random subspace method26
, attribute bagging27
and genetic algorithm-based approaches28
. While all of these methods show high accuracy when classifying cases not included in the training set, they tend to follow the standard “one-size-fits-all” approach to predicting class membership for unknown samples, i.e. every member of the trained ensemble of classifiers votes on every test sample. While classification accuracy is used as a benchmark, developing suitable techniques that accommodate population heterogeneity may be critical for a more realistic representation, generalization and possible incorporation in clinical workflows. More importantly, such techniques are likely to yield patient specific ensemble sets as opposed to class specific ensemble sets.
Recently, Kodell et al.29
introduced a model-free convex-hull-based approach30
in conjunction with a majority voting strategy for ensemble classification and compared its performance to other popular classification techniques using established performance measures. The convex-hull procedure29
by very definition captures second-order interaction and the geometry of the predictor variables in the two-dimensional space without imposing any model assumptions. This essentially renders the approach model-free. Since the ensemble set consists of gene pairs, the procedure is robust to data sets with missing values. Missing values are common across high-throughput molecular assays such as microarrays and can be an outcome of non-specific binding, poor hybridization and other measurement artifacts. Missing value of a gene across a single sample usually results in dropping that particular gene from the entire analysis. While imputation techniques can prove to be suitable alternative, they predict the missing value under certain implicit statistical assumptions and can lead to bias when these assumptions are violated. For the convex-hull selective voting procedure proposed in this study, voting is done by pairs of genes. Given a set of n
genes, the number of possible pairs or voters is n
− 1)/2. Thus a missing value in a gene across a given patient would essentially imply dropping the pairs corresponding to the gene for that particular patient in the analysis and not removal of the gene from the entire analysis. The selective-voting procedure is especially appealing since it lends itself to adapt the ensemble set across the samples retrieving personalized molecular signatures.
Unlike Kodell et al.29
, the present study investigates the choice of a novel selective-voting strategy and convex-hull pruning procedure that relies on pairs whose interaction contributes significantly in discerning a given pair of groups while accommodating possible molecular heterogeneity within a group. A logistic regression pre-processing filter is used to first identify gene pairs with significant interaction term. Such an approach was inspired by traditional belief that associated genes are likely to have biologically relevant relationship (i.e. functional relationship)31
. Capturing genes with significant interaction can also be useful in constructing network abstractions similar to those of relevance networks32
. These pairs are subsequently mapped onto two-dimensional space for selective voting using convex hulls. A novel pruning procedure is proposed to minimize the overlap between the hulls corresponding to the control and the cancer samples. This procedure has to be contrasted to Kodell et al. 29
, that used a distance metric and majority voting to assign the samples in the overlap region to the hulls. The pruning and selective voting process used in the present study results in sample-specific ensemble sets. Subsequently, network abstraction corresponding to the ensemble sets across each sample is generated. Since no explicit directionality information is available, these networks are in the form of undirected graphs. Several studies have successfully demonstrated the usefulness of network abstractions for obtaining preliminary system-level insights into biological pathways and signaling mechanisms9
. It is our belief that the network abstractions of the ensemble sets are likely to provide insights into the intricate wiring of the genes and variation in the topological properties across the ensemble sets. Such an understanding may be an important step prior to devising meaningful patient-tailored interventions and falls within the scope of personalized medicine and comparative effectiveness research.