Given a set of tissue-specific microarray experiments performed under different conditions, this work presents a new method for identifying genes that can explain or get affected by these conditions. Such informative genes are shown not only to accurately discriminate between different disease types or stages but also reveal the existence of known or new sub-groupings within a main category and pinpoint molecular mechanisms that are likely to support these groupings.
Unlike existing filter feature selection techniques, this method applies no restrictions to the mean expression values of informative genes between the different classes. Informative genes are identified according to the existence of at least one well defined expression region that corresponds to a significant number of samples of the same class. The underlying assumption is that if the expression of a gene in a significant number of same-class samples ranges within a given interval, then this expression interval may constitute a marker-feature for this sub-group of samples. Moreover, the use of multiple expression regions as distinct classifiers for different sub-groups of samples is based on the hypothesis that a gene does not necessarily need to be down-regulated (or up-regulated) in one class relative to the other in order to be informative. It is possible that a gene contains more than one expression region characteristic of the same class and one or no characteristic region for another class. Such an example is gene KIAA0016 that codes a mitochondrial import receptor (TOMM20) [
44], whose expression profile in AML
vs. ALL samples reveals that both the lower and the higher expression values are characteristic of ALL samples while most of the intermediate values are characteristic of AML samples (see Figure 5 in [
Additional file 1]). Although not directly linked to cancer, this receptor has been shown to interact with bcl-2, a central anti-apoptotic protein whose expression is high in both AML and ALL patients, possibly allowing its insertion to the mitochondrial outer membrane [
45,
46]. Interestingly, the expression of bcl-2 serves as a prognostic marker for remission outcome and long-term survival in AML [
47] but not ALL patients where results are controversial [
48,
49]. Our findings open up the possibility of a TOMM20 involvement, through the bcl-2 interaction, in the differential resistance to chemotherapy evident in AML
vs. ALL patients. Note that such a gene, although of high discrimination quality as well as biological relevancy, would be missed by most standard statistical feature selection techniques since its mean expression value in each class is approximately the same. Another interesting example is gene SMARCA4 (see Table 4 and Figure 7 in [
Additional file 1]) which is found by our method to have a 3-step-like expression profile with low expression values in AML samples, intermediate expression values in MLL samples and higher expression values in ALL samples. This gene was previously shown to be differentially expressed in MLL
vs. ALL samples [
40] as well as between AML and ALL samples [
23] but its heterogeneous expression profile and discrimination capacity across all three categories could not be captured in those studies.
The utilization of higher order (multiple-region) genes often results in a significant improvement in discrimination performance. A nice example is the classification of Central Nervous System samples into two classes representing poor
vs. good treatment outcome, where the utilization of higher order genes alone results in significantly higher accuracy compared to the use of first order genes. Note that first order (single-region) genes are similar to those detected by Signal-to-Noise, t-test or ICED methods as they often have a single-threshold for classification. It is likely however, that treatment outcome in these patients depends heavily on genes with a more complex expression pattern that differentially characterizes the heterogeneous group of CNS tumors used in this study. A comparison between AML/ALL Leukemia, Breast Cancer, and Central Nervous System datasets -all of which are performed using the same microarray chips – reveals several interesting differences in the number and order of selected genes. First order genes comprise nearly 70% of the total number of selected genes in both Breast Cancer and AML/ALL Leukemia datasets, but less than ~40% in the Central Nervous System dataset. On the contrary, more higher-order (4
th and 3
rd) genes are selected in the Central Nervous System dataset as compared to the other two, supporting the hypothesis that treatment outcome for CNS tumor patients is characterized by complex gene expression patterns (see Figure 6 in [
Additional file 1]). Moreover, a number of higher order genes selected by our method have been associated with CNS tumors and treatment outcome. Interesting examples include the gene encoding for CD70/CD27 ligand, the antiapoptotic gene seladin-1, the gene coding for the interleukin-1 receptor (IL1R1) and the gene coding for the Ser/Thr protein kinase CDK5 (see Table 4 in [
Additional file 1]). CD70 is a member of the Tumor Necrosis Factor family which is highly expressed in human brain tumors [
50] and was recently shown to play an immune stimulatory role -preventing tumor growth in vivo- that encourages its application in tumor immunotherapy [
51]. The interleukin-1 receptor (IL1R1) is a membrane protein which is variably expressed in different brain tumors [
52] and has also been suggested to play a role in brain immunotherapy of astrocytomas [
53]. The antiapoptotic gene seladin-1, which is implicated in Alzheimer's disease and cholesterol metabolism, was also found to integrate cellular response to oncogenic and oxidative stress [
54]. This gene was recently found to be downregulated in adrenocortical adenomas and carcinomas [
55] while its differential expression in pituitary adenomas has been suggested to associate with a different apoptotic response to somatostatin analogs [
56]. Cyclin dependent kinase 5 (CdK5) is a proline-direct protein kinase that is most active in the CNS and has been implicated in certain neurodegenerative diseases. It was recently shown to facilitate the progression of apoptosis by regulating the activity of the tumor suppressor protein p53 [
57], the expression of which has been associated with poor prognosis in primary CNS diffuse large B-cell lymphoma [
58]. In addition, overexpression of both p53 and bcl-2 proteins has been associated with ominous prognosis in pediatric glioblastoma multiforme tumours [
59]. Taken together, these findings suggest that genes with heterogeneous expression detected by our method are not simply the result of technical or biological irrelevant variation but they can have an important biological role.
In addition to prediction accuracy, higher order genes may reveal a general tendency of samples to cluster in sub-groups within a given category. A characteristic example is given by the separation of T-cell from B-cell samples within the ALL leukemia class. Note that this separation is achieved using informative genes selected according to their discrimination capacity with respect to the original class distinctions, in this case AML
vs. ALL. Interestingly, among the seven identified genes that support the T-cell
vs. B-cell separation (see Table 3 in [
Additional file 1]), gene X00437 corresponds to a protein that specifies part of the human T-cell receptor [
23]. While it is expected that such a gene would support T-cell
vs. B-cell discrimination, it is not intuitive that it would be selected as an AML
vs. ALL classifier. In a similar context, our technique is able to separate "Failure" from "Success" AML samples with high accuracy as well as identify the genes that achieve this separation (see Table 2 in [
Additional file 1]). Comparable discrimination results were previously achieved with other methods but only when treating these samples as distinct classes and selecting genes that specifically discriminate between the two [
26,
60] (also see Figure 3 and Figure 4 in [
Additional file 1]).
In conclusion, this work describes a new method for the identification of informative genes that takes into account inherent genetic variation in disease samples which may be characteristic of certain sub-groups within a disease category. This relatively simple approach, in conjunction with a committee voting classifier allows for improved class prediction as well as identification of interesting disease sub-groups. More importantly, our method allows the detection of marker genes that support these sub-groupings, thus possibly shedding some light on the underlying molecular mechanisms involved in disease related processes and providing a new tool that may facilitate efforts towards individualized medicine.