We have shown that simple two-transcript gene expression classifiers can accurately classify a wide spectrum of human diseases. This algorithm is invariant to data normalization and generates robust, statistically significant biological classifiers even in the context of low sample sizes. Our results reveal that many pathological processes, even those not traditionally considered genetic in nature such as infections and inflammatory disorders, can be diagnosed through just two transcriptional measurements. Whereas previous work has shown the diagnostic value of gene expression perturbations, this study demonstrates that as few as two transcriptional measurements can reliably detect diverse human diseases.
Transcriptional networks themselves can thus be seen to encode aspects of pathological phenotypes, with strong correlation observed between gene expression status and disease state. These transcriptional signatures were sufficiently robust to be detected even in tissue samples of possibly heterogeneous cell populations. The accuracies observed in these simple diagnostic modalities were comparable to pre-existing transcription-based classifiers that rely on more complex, multivariate measurements. For example, a 12-gene classifier generated against the same Crohn's disease dataset using a weighted-voting scheme exhibited a cross-validation accuracy of 94%, compared with equivalent TSP cross-validation performance of 87% [29
]. Additionally, a 35-gene k-Nearest-Neighbor classifier trained on the same viral and bacterial infection dataset achieved a cross-validation accuracy of 91%, compared with 96% for the TSP approach [30
The TSP method compared favourably to the estimated accuracy of standard clinical methods for the differentiation of viral and bacterial infection, as well as cardiomyopathy classification- conditions that present ongoing diagnostic challenges in the clinic. For example, a recently developed clinical prediction rule to discriminate between bacterial and viral pneumonia in children achieved positive predictive value of under 80%, in contrast to a TSP classifier cross-validation accuracy of 96.7% [31
]. Additionally, a recent study of over 1200 patients presenting with diverse cardiomyopathies found that no pathologic etiology could be definitively elucidated in over 50% of clinical cases, in comparison with a cross-validation accuracy of over 70% achieved by the corresponding TSP classifier [32
]. These results do not imply that the TSP method provides intrinsically superior diagnostic discrimination to 'gold standard' clinical measures - the TSP classifiers themselves are constrained by the fidelity of clinical methods used to diagnose patient samples contained within their respective training datasets. However, these results do indicate that properly trained TSP classifiers may exhibit higher accuracy in medical contexts where high-fidelity diagnoses are difficult or impractical to regularly obtain using other methods.
Interestingly, the ability of the classifier to obtain an accurate diagnosis was significantly lower in the comparison of ischemic and idiopathic cardiomyopathies than in any other dataset we examined. This is likely due to the broad cellular and metabolic heterogeneity observed in these two closely related conditions. Both clinical and molecular differentiation of ischemic and idiopathic cardiomyopathies remains a significant challenge [33
]. Ischemic cardiomyopathy is diagnosed when oxygen delivery to the myocardium is inhibited, most often due to coronary artery disease. However, the presence of this condition is not diagnosed with great precision in the clinic, and idiopathic cardiomyopathy is diagnosed when no etiological factor for cardiovascular dysfunction can be explicitly isolated [32
]. The failure of the algorithm to accurately discriminate between these two conditions may indicate that they represent overlapping genetic and physiological states, or that their respective diagnoses are not made with high fidelity in clinic, or a combination of both factors. This molecular heterogeneity has recently been confirmed using alternative gene expression analysis methods [34
]. It is possible that other factors, such as consistency of tissue collection and processing, may negatively impact the quality of microarray data and thus the apparent performance of the algorithm. It is also possible that the two-transcript classifier scheme does not capture pathological information encoded by other molecular media - for example, protein or metabolite levels - that may more accurately predict pathological state. However, it is clear that a chief factor constraining the performance of the TSP cardiomyopathy classifier is the low fidelity of diagnostic decisions upon which it was trained. In the phenotypes studied where higher clinical diagnostic efficacy is achieved, the TSP classifier exhibits likewise higher accuracy.
We observed that the genes present in highly accurate two-transcript classifiers were often associated with disease processes in previous literature reports. For example, PRUNE2 has been shown to inhibit certain forms of oncogenic transformation, which may correspond to its differential regulation in GIST and LMS as observed through the TSP method [35
]. The TSP prediction rule to diagnose Type I Diabetes is based on the relative expression of the genes CD1D and PSD. CD1D is a transmembrane protein involved in the presentation of lipid antigens to T cells and known to contribute to the generation of diabetes, and PSD belongs to a family of intracellular signal transduction proteins known to increase insulin sensitivity [36
]. The change in expression of these two genes within the classifier thus recapitulates the underlying molecular etiology of the disease. While not all genes in the classifiers found through this study were known a priori
to be involved in pathological processes, the strong association held by many such transcripts with their cognate phenotypes demonstrates the biomolecular relevance of these classifiers.
Intriguingly, in this study it was found that analysis of transcription in circulating mononuclear cells provides a robust diagnostic platform for both the detection of invading cellular or viral pathogens, and the diagnosis of somatic medical conditions such as diabetes and Crohn's Disease. Of particular interest are the simplicity, robustness and accuracy of two-transcript classifiers using a data source that provides an easily accessed transcriptomic 'readout' from pathologies of disparate tissues. Recent studies have examined the utility of serum-borne mRNA in the prediction of diseases, with varying fidelity [40
]. These methods are constrained by the finite stability of RNA transcripts in the circulation. In contrast, the metazoan immune system exhibits an intrinsic and long-lasting 'memory' of cellular and other interactions that can persist in circulating cells for long periods. The interrogation of leukocyte gene expression would provide an easily deployed method for clinical diagnosis which, as indicated by these results, might present an informative discriminative measure in the diagnosis of diverse human diseases.
To implement the two-transcript classifiers, transcriptional measurements can be readily obtained in the clinic through routine PCR procedures [42
]. The success of previous two-transcript diagnostics shows that, despite being formulated using microarray platforms, these intrinsically simple classifiers can be implemented efficiently through pre-existing gene expression methodologies. These classifiers therefore embody a promising platform for diverse diagnostic and prognostic tasks. These results also raise the exciting possibility that widespread human diseases could be reliably diagnosed through the acquisition of standard blood samples, a major objective of personalized medicine [43
]. Sufficient information about the state of somatic tissues and organs may be encoded by the circulating leukocyte transcriptome to create a 'battery' of gene expression measurements that could simultaneously diagnose a large number of medical conditions. Further research is warranted to examine the degree to which different human pathologies could be inferred using simple transcriptional measurements from circulating cells.