Classification analysis
Stool-based molecular diagnostic tests are emerging as important new approaches that have the potential of providing cost-effective, sensitive early detection of colorectal neoplasia. Details of many of the currently used and novel approaches have been recently reviewed (
22). Because a single genetic product is unlikely to have sufficient detection sensitivity and specificity to be used as a “stand-alone” diagnostic test, a fecal-based DNA detection system that exploits the concept of chromosomal instability with mutations progressively accumulating in the adenomatous polyposis coli, p53 tumor suppressor genes, and the K-
ras oncogene has been recently developed (
23,
24). Publications in small trials (16-65 subjects) reported test sensitivity ranging from 62% to 91% for adenocarcinoma detection and 27% to 82% for adenoma detection, with specificity ranging from 93% to 98%. Validation of these preliminary data in a large (4,404 evaluated subjects) prospective colorectal cancer screening trial resulted in a sensitivity of 52% (95% confidence interval, 35-68%) for detection of adenocarcinoma and 15% (95% confidence interval, 12-19%) for detection of adenomas ≥1 cm, with double the sensitivity when the adenoma had dysplasia. Specificity for the fecal DNA test was 94% (
23). Very recently, stool DNA test 2 and a novel digital melt curve assay, which targets more broadly informative markers, detected significantly more screen-relevant neoplasms compared with occult blood testing (
25,
26). From these data, it is logical to assume that fecal DNA tests could serve as an intermediate, noninvasive screening tool for colorectal adenocarcinoma.
A major disadvantage of DNA-based methods is that it is inherently limited to a small number of hybridizing oligonucleotides, which reduces the likelihood that a neoplasia-associated mutation will be found in the large number and heterogeneity of mutational events occurring in human neoplasia. In addition, a fecal DNA testing panel using nucleotide probes will not detect important epigenetic events associated with human carcinogenesis. For example, epigenetic modifications of DNA (i.e., aberrant promoter hypermethylation) of multiple tumor suppressor genes lead to loss of expression (
27). DNA-based methods do not detect these important molecular events. This severely limits the utility of current DNA-based assays. Recently, several attempts have been made to use DNA from stool to detect aberrant CpG island methylation (
28,
29). Thus, it is possible that methylated genes may be effective early detection markers for colon adenomas, and offer another mechanistic approach that may increase performance characteristics of stool markers based on mutation detection alone.
To enhance current colon cancer molecular detection assays, our laboratory was first to develop a novel noninvasive molecular method using feces containing intact viable exfoliated colonocytes to quantify colonic mRNAs and determine gene expression profiles (
9). Because “global” changes in patterns of gene expression occur throughout the colon well before macroscopic tumors are apparent (
30,
31), these data suggest that “diagnostic” gene expression profiles are associated with a large number of shed cells, and hence, recovered cell number should not be a limiting factor (
13).
In this feasibility study, our goal was to identify mRNA expression patterns that may establish the basis of a new noninvasive molecular diagnostic method. For this purpose, we applied an algorithm to 12 different pairs of classes arising from the experimental design as described in and
Supplementary Table S1. The number of genes/features for each linear classifier was limited to 3, which allowed for an exhaustive search. The use of small (three-gene) classifiers is not new in the classification of cancer. It goes back some number of years (
21,
32). As an initial step within the context of classification, we identified the best single genes (single-gene classifiers) to distinguish phenotype. To illustrate how this approach compares to the traditional statistical analysis, we considered the classes (+IR, + Polyps) versus (−IR, −Polyps) at BL1. The top 10 feature sets of size 1 were compared with the differentially expressed genes in the colonic biomarker set

, where
t tests were done using normalized and log-transformed gene intensity values. The comparison revealed that 7 of the 10 top one-feature sets (genes) identified by the linear (LDA) classifier also had
P values <0.05. This is not surprising because individual differentially expressed genes have been traditionally used to discriminate between phenotypes (
33). Interestingly, the results show that there are several cases where single genes can provide good (in terms of the error estimate) classification (). However, when comparing these results to the two-feature classification for the same two classes, a phenomenon was observed that has been recently documented in the context of gene network modeling (
34). Specifically, the expression profiles of a group of genes predicted the target (either a gene or a phenotype) with greater accuracy relative to any proper subset of these genes. For example, single-gene classifiers (one-feature) based on either the Homeobox protein-A3 (
HOXA3) or uncoupling protein-2 (
UCP2) performed very poorly when discriminating between (+IR, + Polyps) and (−IR, −Polyps) at BL1 (; ). Interestingly, HOXA3 was close to the worst predictor of all of the available 97 genes (ranked 94). In comparison, when combined as a two-feature set,
UCP2 and
HOXA3 provided one of the best two-feature classifiers (one misclassified data point only) among all of the 4,656 possible two-gene sets (; ). These data clearly illustrate why complex phenotypes can be explained better by multivariate feature sets.
| Table 1Classification of (+IR, +Polyps) subjects versus (−IR, −Polyps) subjects at BL1 |
To identify sets of genes that perform in a multivariate manner to provide strong classification, we specifically looked for pairs of genes that performed better than either of the genes individually, and triplets of genes that performed well and substantially better than the best-performing pair among the three, and so on. To estimate the improvements of the classification performance, we introduced two quantities for each feature set: εbolstered and Δ(εbolstered). εbolstered denotes the bolstered resubstitution error for the LDA classifier for the respective feature set, and Δ(εbolstered) denotes the largest decrease in error for the full feature set relative to all of its subsets. The feature sets were initially ranked based on the value of εbolstered, and subsequently ranked again based on the improvement Δ(εbolstered). For multiple-gene classifiers, we focused on feature sets with high rank in both lists. Along these lines, we designed two-feature classifiers for the classification of (+IR, +Polyps) versus (−IR, −Polyps) data at baseline BL1; (−IR, −Polyps, control diet) versus (−IR, −Polyps, legume diet) data at the end of the two diet periods DP1 and DP2; (+IR, + Polyps) versus (−IR, −Polyps) at baselines BL1 and BL2; (+Polyps) versus (−Polyps) at baselines BL1 and BL2; and (+IR) versus (−IR) at all of the time points. and describe the best (according to this ranking procedure) feature sets identified for the first two of these classification categories, and shows representative multivariate classifiers.
| Table 2Classification of (−IR, −Polyps) subjects on control diet versus (−IR, −Polyps) subjects on the legume diet |
The results in show that the two factors, IR and history of adenomas, should be considered in tandem when determining the risk for the patient. For example, combining baseline samples (BL1 and BL2) increased the classification error, indicating complications related to the crossover design (). Similarly, the three-feature set LDA classifiers performed poorly when the classification was considered separately with respect to either one of the two experimental factors (IR) or (Polyps; ). The advantage of reporting the results in this way is that multivariate discriminatory power is revealed. This is clearly shown in with regard to
HOXA3. The gene did not appear on the single-gene list, indicating that the error of the respective classifier exceeded 0.3 (
εbolstered = 0.4882). However, it appeared with
UCP2, 14-3-3ζ (
YWHAZ), insulin growth factor receptor-I (
IGF1R), beclin-1 (
BECN1), and mitogen-activated protein kinase-11 (
MAPK11) genes in the two-gene and three-gene lists, which improved classification error. Interestingly, members of the homeoprotein family of transcription factors (
HOXA3 and
HOXC6) are developmental regulators of gastrointestinal growth, patterning, and differentiation (
35). It is also noteworthy that
YWHAZ and
IGF1R are capable of regulating apoptosis and cell adhesion (
36,
37);
UCP2 promotes chemoresistance in cancer cells and mitochondrial Ca
2+ sequestration (
38,
39);
BECN1 stimulates autophagy and inhibits tumor cell growth (
40); and
MAPK11 (
p38β) mediates response to inflammatory cytokines and cellular stress (
41). For comparative purposes, fold changes in select genes are presented in
Supplementary Table S5.
Legumes and pulses are a rich source of fermentable dietary fibers, which are precursors to luminal butyrate (
4). Butyrate has well-known anti-inflammatory and antineoplastic actions (
6,
7). In addition, pulses have a low glycemic index (
5). Some studies suggest that diets high in fiber and with a lower glycemic index may reduce risk of colorectal cancer and decrease inflammatory markers (
4,
42,
43). Therefore, it was important to note that the approach applied in this study can be used to identify genes that are modulated by the consumption of a legume-rich diet (). Our data show that although transforming growth factor β (TGFβ), which plays a permissive role in cancer progression and wound repair (
44,
45), is by itself a reasonable discriminator, when it is combined with HOXA3 and death-associated protein kinase (DAPK1), the error is significantly improved. These observations are worth noting in view of the fact that DAPK1 is an extremely pleiotropic molecule capable of influencing the propensity of cells to undergo autophagy (
46). Moreover, it has been recently shown that dietary fiber (butyrate) can enhance TGFβ/Smad3-tumor suppressor signaling in the colon (
47,
48). Considering that dietary legumes promote short-chain fatty acid production in the colonic lumen, it is probable that butyrate may have altered TGFβ expression. Clearly, additional studies are needed to elucidate the effect of legume consumption on TGFβ-dependent signaling.
The objective of this proof-of-principle study was to develop diagnostic gene sets for the noninvasive identification of different phenotypes. As opposed to using expression levels of either significantly increased or decreased genes, we applied novel mRNA-based noninvasive methods to identify the best single genes and two- to three-gene combinations for distinguishing polyps, IR, and exposure to a chemoprotective legume-enriched diet. Similar to previous studies (
20,
21,
32,
49), we report that by using combinations of genes, the classification error rate can be significantly lowered. Two- and three-gene combinations thus provide robust classifiers with potential to noninvasively identify discriminative molecular signatures for differential diagnostic purposes. These findings provide insight into a new paradigm and support the development of noninvasive methods using exfoliated colonocytes to quantify colonic mRNAs. This strategy can be a complementary, and likely useful, approach to enhance current efforts to define colon cancer risk. In addition, because of a lack of genomic precision in defining clinically relevant phenotypes, two- and three-gene combinations may have application in personalized genomic medicine (e.g., the stratification of patients according to response to risk of recurrence in trials of adjuvant treatment of the disease). Further studies are needed to validate the prognostic power and reliability of this molecular diagnostic approach.