The identification of genome-wide expression profiles that discriminate between disease phenotypes is now a relatively routine research procedure. However, clinical implementation has been slow, in part because marker sets identified by independent studies rarely display substantial overlap [
1-
4]. For example, van't Veer
et al. [
4], and Wang
et al. [
3] identified gene sets of size 70 and 76 to distinguish metastatic from non-metastatic breast cancer, but the two sets had an overlap of only 3 genes [
5]. As another example, Dressman
et al. [
2] found a set of 100 genes significant in predicting responses to cis-platinum therapy for ovarian cancer using a stochastic search method, which had no intersection with another set of 86 genes predictive of ovarian survival found by Crijns
et al. [
1] using functional class scoring analysis. Though the latter pair of sets was not studied to predict identical phenotypes in ovarian cancer, their empty intersection still indicates the lack of uniformity in predictive biomarkers among different experiments.
There is a well-recognized need for canonical biomarkers, which will make it possible to record marker-based data interchangeably among different laboratories, using well-recognized quantitative features to encode cancer and other phenotypes. This need is related to future improvements in standardized diagnosis and prognosis regimes which will incorporate genomic cancer information as a matter of course, augmenting current cancer classification protocols.
In connection with this, many researchers have suggested a more effective and robust means of marker identification which combines gene expression measurements over functional or otherwise naturally defined sets of genes. For example, Chuang
et al. [
5] used average differential expressions in protein-protein interaction subnetworks as markers for distinguishing metastatic from non-metastatic breast cancer using the Wang
et al. [
3] and van de Vijver
et al. [
6] data sets. The stability (overlap) between the two independent datasets of their selected critical genes is 12.7%, whereas it is 7% for critical individual gene markers. This overlap was calculated between 906 and 618 critical genes obtained from the Wang and van de Vijver data sets, respectively [
5]. However, protein-protein networks do not yield canonical coherent gene subsets as pathways do, and such critical gene sets change with individual experiments. In this regard, pathways, being the most documented of protein interactions, yield stable sets of functional relationships related with molecular biological activities such as metabolic, signaling, protein interaction and gene regulation processes [
7]. Nevertheless, pathway-based aggregation of gene information is one among a number of ways of incorporating gene-gene relationships in augmenting predictive performance, and protein-protein interaction-based as well as coexpression-based aggregations have been shown in various contexts to improve performance of classification methods [
5].
What distinguishes this work from the above is our goal to obtain collections of biomarkers which are not only discriminative between phenotypes, but are also canonical, in that they come from standardized gene sets in the form of pathways and other functional sets. This in turn fits into a view toward clinical applications in which comprehensible and reproducible biomarkers can be extracted and used for phenotype prediction, using existing standardized gene set enrichment algorithms.
With the recent availability of large quantities of pathway information such as KEGG (Ogata
et al., 1999), GeneGo (
http://www.genego.com) and BioCarta (
http://www.biocarta.com[
8]), pathway-based analysis has been used [
9-
11] to perform classification of expression profiles and also applied to discriminate different classes of disease. Class distinction based on differences in pathway activity can be more stable than distinction based on genes alone. For instance, 16 pathways overlapped between the 48 significant pathways obtained from the study of Dressman
et al. [
2] and the 17 pathways from Crijns
et al. [
1], using the methodology in Crijns
et al. for significant pathway identification. Thus, pathway markers (and perhaps other gene set markers) are more reproducible than individual genes selected from expression profiles. A growing body of research has focused on pathway-based classification, and has often presented comparable or better performance of classification than gene-based classifiers [
9-
11]. For example, Guo
et al. [
9] proposed the use of mean or median gene expression values in gene ontology (GO) modules [
12] to infer module activity. Recently, Su
et al. [
11] proposed a classification method based on probabilistic inference of pathway activity.
Other pathway-level analyses which have led to classification methods for cancer phenotypes include the work of Lee
et al. [
10], who identified core genes in pathways as differentiators of disease phenotypes. Vaske
et al. [
13] inferred pathway-level perturbations in cancer tissue based on omics-level analyses of individual genes using a factor graph model, yielding inferences regarding survival outcomes. Breslin
et al. [
14] used pathway signatures based on activations of their downstream genes to classify cancer samples. Svensson
et al. [
15] used pathway signatures to differentiate radiation toxicity responses in irradiated tissues involved in the treatment of prostate cancer.
The above uses of pathway-level inference to classify phenotypes provide evidence of the usefulness of better and more stable (reproducible) biomarkers related to gene expression measurements. In this paper we focus on measuring the stability of pathway-based biomarkers and evaluating protocols for using standardized (as well as some new) gene set enrichment algorithms to directly and automatically infer pathway activation levels as stable features able to discriminate cancer phenotypes. This can provide methods for supplementing or replacing gene-level activation features with additional features which form new biomarkers based directly on pathway-level activations.
Our use of the term pathway activation (or activity) markers parallels usage in other references [
10,
14] and denotes transcriptional activity of genes in these pathways which act coherently within them. In particular this term is not used in the biological sense that the pathway has been activated into producing downstream products, which generally may be the case in one of its differentiating (e.g. case vs. control) states.
The result of precisely quantifying such pathway-level activity can be new standardized sets of biomarkers (‘pathway-based expression arrays’) which we will show to be consistently more reproducible across different data sets. We study the stability properties of these biomarkers, in addition to their discriminatory abilities among phenotypes. A desired application of these results involves eventual clinical uses which will be capable of using off the shelf enrichment algorithms for genes (e.g. GSEA, [
16]) to automatically infer phenotype differences based on pathway (and other gene set) -level feature vectors.
The examples in this paper use RNA expression arrays from microarrays. However, current new methods of obtaining mRNA expression levels from RNA-Seq abundances (using methods such as Cufflinks [
17] and NEUMA [
18]) can also produce gene-level (mRNA) expression arrays which can be analyzed in the same way using the pathway-aggregated methods (e.g. GSEA and related algorithms) discussed in this paper. In addition, with newer RNA-Seq data, other groups of biologically related RNA transcripts may also be candidates for aggregation of individual RNA abundance levels in the same way as is done here for mRNA levels of pathway genes.
In this paper we obtain stable mRNA expression-based markers at two levels, the pathway level and the gene level. Given a stable pathway P involved in a given cancer phenotype, we can also identify stable gene sets based on P, by identifying the most discriminative genes in P, again using a standard enrichment algorithm such as GSEA. (In the latter case the discriminative genes can be identified by finding the leading edge genes in P).
We remark that the use of multi-level hierarchical feature aggregates, with individual genes at the first level and initial gene aggregates (pathways) at the next level, is an instance of a machine learning approach which organizes individual features into a
hierarchical feature structure. The application of an SVM classifier to features derived from such a structure (as is done here) is known as a
hierarchical SVM. A general hierarchical feature structure organizes individual features
xi in a feature vector x

=

(
x1,…,
xn) into a tree hierarchy. In such a hierarchy the first level (e.g., gene-level) features form the leaves of the tree, with aggregate features forming the second level, and higher order aggregates recursively forming the upper structure of the tree. A more general example of such a structure (than the present pathway-based hierarchy) involves, for a given gene set, a tree structure based on a gene ontology (GO) [
12] tree, which will be studied in future work.
We remark in addition that since our primary purpose here is to develop protocols involving canonical biomarkers (markers obtained using standard methodologies such as GSEA that are also stable across datasets and platforms) for identifying cancer phenotypes, the strategy of using pathway-level aggregate features would require focusing on biomarkers within a given pathway P, even if there are better markers outside of this or perhaps any other pathway. A benefit of this is indeed that such additional associated genes in P would in fact be 'false negatives,' since they are ostensibly weak classifiers which nevertheless cooperate with genes in a pathway P that has had its role in the phenotype established.
Gene set enrichment analysis (GSEA) proposed by Subramanian
et al. [
16] and extended by Hung
et al. [
19] to include network topology is a useful tool for pathway-based class distinction. Briefly, GSEA determines whether genes from a particular pathway or some other predefined gene set are significantly overrepresented in the set of genes that are most differentially expressed. Since the differentially expressed genes are rank ordered, the procedure also returns so-called leading edge genes for which enrichment has its maximum value. The leading edge genes of a given pathway are those genes which as a group maximally differentiate the gene expression signatures (in the pathway) of the two classes being studied [
16].
The approach in this paper is to use prior classes of gene groups (e.g., KEGG pathways, functional gene sets from the Molecular Signatures Databases; MSigDB [
16]) to identify group-based biomarkers which can reliably and reproducibly classify medical and biological samples. We will show this using classifications based on the 200 curated KEGG pathways and 522 functional gene sets from MSigDB [
16] collected from eight data sets and experimental literature, and compare these with classifications based on direct gene expression profiles. This methodology can provide (pathway-based) biomarkers with significantly increased stability for differentiating cancer as well as other phenotypes, as is shown here for ovarian and breast cancer (see results and discussion). In addition to the primary aim of greatly increased marker stability, the predictive accuracy using these biomarkers shows overall improvement (again for survival and metastasis prediction).
The pathway-based biomarkers related to ovarian cancer survival time as well as to breast cancer metastasis yield important stable pathways in both types of datasets (see below), most of which have had independently demonstrated involvement in cancer. In addition, we identify stable pathways between studies on breast cancer on the one hand and ovarian cancer on the other (the cytokine-cytokine receptor interaction, type 1 diabetes mellitus, and hedgehog signaling pathways), all three of which have been demonstrated in a number of independent studies to have significant involvement in cancer (see Results/Discussion). For the diabetes mellitus pathway in particular, the cancer-related immune system arm (involving the HLA family of genes) is the only portion capable of differentiating survival time in ovarian cancer or in determining metastasis in breast cancer.
Ovarian cancer is the fifth leading cause of death from gynecological malignancy in the United States and Western Europe [
20]. The major histological subtypes are classified as mucinous, serous, endometrioid, and clear cell. Among these, serous carcinoma is the most common ovarian cancer subtype [
21]. Carcinoma of the ovary is subclassified into 4 tumor stages defined by how that disease has spread (metastasized) from the original site to other parts of the body. The guidelines defining stages are provided from the International Federation of Gynecology and Obstetrics (FIGO). In particular, in stage IV, cancer has spread beyond the abdomen into tissues in the liver and other organs (NCI, FIGO). Some 75% of all patients with ovarian cancer are commonly diagnosed with stage III/IV disease, for which the 5-year survival rate is only 5% - 30%, with 21 months being the average survival time [
1]. Consequently, a convenient diagnostic for early stage disease [
1,
2] that could be routinely applied has the potential to prolong survival by enabling early intervention. Similarly, the identification of biomarkers that would distinguish short from long term survivors who are on the same therapeutic regimen can help guide therapy.
Similarly, breast cancer has a large impact on health, being the primary cancer which strikes women overwhelmingly, and the second leading cause of cancer death in women. There is high variability in the outcomes and responses to therapies among patients in the same stage of this disease [
22,
23]. Though hormonal and chemo-therapies can significantly reduce the chances of metastases in this disease [
24], predictions of such outcomes based on genomic information have not been implemented in a standardizable way.
Identification of biomarkers differentiating stages or expected survival terms in ovarian cancer, as well as metastasis in breast cancer, can help guide treatment. In this paper, we suggest our standardized pathway-based approach to biomarker analysis for discriminating cancer subtypes as a research tool with strong potential for clinical applications. The methodologies in this paper are validated in the identification of specific stable pathway markers (see below) which differentiate cancer phenotypes and validate previous studies identifying the same pathways and their component genes as significantly linked to cancer.