|Home | About | Journals | Submit | Contact Us | Français|
This study presents a novel network methodology to identify prognostic gene signatures. Implication networks based on prediction logic are used to construct genome-wide coexpression networks for different disease states. From the differential components associated with specific disease states, candidate genes that are co-expressed with major disease signal hallmarks are selected. From these candidate genes, top genes that are the most predictive of clinical outcome are identified using univariate Cox model and Relief algorithm. Using this approach, a 13-gene lung cancer prognosis signature was identified, which generated significant prognostic stratifications (log-rank P < 0.05) in Director’s Challenge Study (n=442).
Lung cancer is the leading cause of cancer-related deaths in industrialized countries. Non-small cell lung cancer (NSCLC) accounts for about 80% of lung cancer cases. Currently, surgery is the major treatment option for patients with stage I NSCLC. However, 35–50% of stage I NSCLC patients will develop recurrence and die within 5 years (Hoffman, Mauer, and Vokes, 2000). It remains an unsolved challenge for physicians to reliably identify patients at high risk for recurrence as candidates for chemotherapy. A few studies have described transcriptional profiling for lung cancer prognosis (Chen et al., 2007; Potti et al., 2006; Shedden et al., 2008). Nevertheless, there is no clinically applied gene test for this deadly disease.
The accurate assessment of disease progression in individual patients is a critical prerequisite in personalized medicine. With the completion of the Human Genome Project, the emphasis of biomarker identification has shifted from cataloguing the “parts list” of signature genes and proteins to elucidating the networks of interactions that take place among them (Ideker and Sharan, 2008). Molecular network analysis had been shown to be useful in disease classification (Chuang et al., 2007) and identification of novel therapeutic targets (Csermely, Agoston, and Pongor, 2005). Nevertheless, major challenges have been the development of methods for efficiently constructing genome-scale coexpression networks and the identification of a particular set of markers, from among the enormous number of potential markers, that has the highest predictive ability for disease outcome (Sotiriou and Piccart, 2007). In this study, we hypothesised that the combined analysis of disease-mediated genome-wide coexpression networks, hallmark signalling pathways, and clinical approaches would lead to more informed clinical decision-making. This study will focus on the molecular prognosis of lung cancer relapse and metastasis.
In current genome-wide expression studies, genes are ranked according to their association with the clinical outcome, and the top-ranked genes are included in the classifier. It has been noted that individual biomarkers showing strong association with disease outcome are not necessarily good classifiers (Emir et al., 1998). Genes and proteins do not function in isolation, but rather interact with one another to form modular machines (Hartwell et al., 1999). Molecular network analysis has led to promising applications in identifying new disease genes (Emilsson et al., 2008) and disease-related subnetworks (Calvano et al., 2005), and classifying diseases (Chuang et al., 2007).
Boolean networks can provide important biological insights into regulation functions (Albert and Othmer, 2003). Nevertheless, as the number of global states is exponential in the number of entities and the analysis relies on an exhaustive enumeration of all possible trajectories, this method is computationally expensive and only practical for small networks (Karlebach and Shamir, 2008). A recent formalism, causal Bayesian belief networks, have been utilized to model cellular networks (Friedman, 2004). Nevertheless, the number of possible networks is exponential in the number of nodes under consideration, which makes it impossible to evaluate all possible networks. Furthermore, it is not always possible to determine the causal relationships between nodes, i.e., the direction of the edges, owing to a property known as Markov equivalence (Zhu et al., 2008). More importantly, the acyclic Bayesian network structure was unable to model feedback loops, which are essential in signalling pathways (Sachs et al., 2005) and genetic networks (Milo et al., 2002; Milo et al., 2004; Wuchty, Oltvai, and Barabasi, 2003). To overcome this limitation, a more complex scheme, dynamic Bayesian networks, was explored for modelling temporal microarray data (Kim, Imoto, and Miyano, 2003; Pe’er et al., 2001).
As an alternative to Bayesian networks, an implication network model employs a partial order knowledge structure (POKS) for structural learning and uses the Bayesian theory for inference propagation (Desmarais, Maluf, and Liu, 1996; Desmarais, Meshkinfam, and Gagnon, 2006). An implication network is a general methodology for reasoning under uncertainty. POKSs are closed under union and intersection of implication relations, and have the formal properties of directed acyclic graphs. The constraints on the partial order can be entirely represented by AND/OR graphs (Desmarais, Maluf, and Liu, 1996; Falmagne et al., 1990). When the constraints on the partial order are relaxed, the implication networks can represent cyclic relations among the nodes. In this condition, the implication network structure is a directed graph with nodes connected by implication (causal) rules, which can contain cycles such as feedback loops.
Liu and Desmarais (1997) presented the first implication network formalism based on binomial distribution. Boolean implication networks (Sahoo et al., 2008) used scatter plots of expression between two genes to induce the implication relations. In this study, we developed an induction algorithm based on prediction logic (Guo, Cukic, and Singh, 2003) to derive implication relations. The implication network was employed for efficient construction of disease-mediated genome-wide coexpression networks for the identification of prognostic gene signatures. The prognostic performance of the identified gene signature was evaluated by comparing with clinical covariates and other gene expression signatures. Furthermore, functional pathway analysis was done to confirm the biological relevance of the identified gene signature.
The first implication network induction algorithm is based on binomial distribution, which is suitable for binary datasets (Liu and Desmarais, 1997). We proposed an alternative network induction algorithm based on prediction logic (Guo, Cukic, and Singh, 2003), which is applicable for more general applications, including multinomial datasets and multi-classification problems. Prediction logic reveals the implication relationships among variables in a dataset and evaluates propositions in formal logic. Prediction logic integrates formal logic theory and statistics to build a convenient predictive structure for a dataset. The most important aspect of prediction logic is the conceptual value of prediction analysis in constructing and evaluating useful statements, particularly in complex multinomial problems with moderate sample sizes. This feature is essential for clinical applications, in which many clinical parameters are multinomial and the patient sample size is small.
We used prediction logic based on formal logic rules relating two dichotomous variables to induct the implication network. The six most important implication rules relating two dichotomous variables are shown in Fig. 1, where each table is a contingency table and the shaded cells represent the errors for the corresponding implication rule. For example, A∧¬B is the error cell for the implication rule A B, NA∧¬B represents the number of error occurrences.
Our algorithm modified U-Optimality method (Hildebrand, Laing, and Rosenthal, 1977) (Fig. 2), and was used to derive the implication relation between each pair of variables in the dataset. In the implication induction algorithm (Fig. 2), Up is the scope of the implication rule, representing the portion of the data covered by the implication relation, and p is the precision of the implication rule, representing the prediction success of the corresponding implication relation. An implication rule has high precision when the number of error occurrences is a small portion of the data covered by the implication rule. The minimum scope and precision required by the implication rule are indicated respectively by Umin and min, which must be positive for a valid implication relation. The induction algorithm derives an implication rule if it has the maximum scope, Up and it satisfies the constraint that its scope, Up and precision, p are greater than the required minimum values, Umin and min, respectively. To simplify the computations of the maximization problem, the ij value of every error cell must be greater than that of the non-error cells for the corresponding implication rule (Guo, Cukic, and Singh, 2003).
For a single error cell, where Nij is the number of error occurrences, scope Up, and precisionp are defined as:
For multiple error cells, they are defined as:
where ωij = 1 for error cells; otherwise, ωij = 0.
This implication induction algorithm is general for discrete datasets. With the expansion of the contingency table Mij (Fig. 2), implication rules can be induced for multinomial datasets, where error cells are those with top precision (ij values) and satisfying all the constraints. The proposition can then be induced according to the error set.
The complexity of the induction algorithm is O(Nv2), where N is the sample size and v is the number of variables in the dataset (i.e. nodes in the implication networks) (Guo, Cukic, and Singh, 2003). The difference between this algorithm and the original U-Optimality (Hildebrand, Laing, and Rosenthal, 1977) is that minimum requirements for deriving an implication rule were set for both scope (Up) and precision (p ), instead of for precision alone.
From the set of prognostic genes identified from implication network methodology, Relief was used to rank these genes with software WEKA 3.4 (Witten and Frank, 2005) in order to select the most predictive genes. Relief evaluates the importance of a variable by repeatedly sampling an instance and checking the value of the given variable for the nearest instance from the same and different classes. The values of the attributes of the nearest neighbours are compared to the sampled instance and used to update the relevance scores for each attribute. As approximated in following equation, Relief computes the weight of attribute A as:
Relief assigns more weight to those attributes that have the same value for instances from the same class and differentiate from instances in different classes (Hall and Holmes, 2003; Witten and Frank, 2005).
A proprietary web-based software Ingenuity Pathway Analysis (IPA, Ingenuity® Systems1) were used to derive curated molecular interactions, including both physical and functional interactions, and pathway relevance reported in the literature. The databases and software toolsets weigh and integrateinformation from numerous sources, including experimental repositories and text collections from published literature. Core analysis was used to identify significant biological processes and functions from the merged network related to the identified 13-gene signature in human tissues and cell lines.
Gene expression profiles quantified with Affymetrix HG-U133A on 442 lung adenocarcinoma samples from a published study (Shedden et al., 2008) were used in this study. This study cohort is composed of four data sets (University of Michigan, H. Lee Moffitt Cancer Center, Memorial Sloan-Kettering Cancer Center, and Dana-Farber Cancer Institute) contributed by six institutions. Tumours were collected by surgical resection from patients who have provided consent and protocols were approved by the Institutional Review Boards (IRB-Med) of the respective institutions. None of the patients received preoperative chemotherapy or radiation and least two years of follow-up information was available. Regions containing a minimum of 60% tumour cellularity were required for macro-dissection, and in most instances tumour cellularity of at least 70–90% was identified for inclusion in the sample for RNA isolation. The raw microarray data are available from caArray website2. The data used in the analysis was quantile-normalized and log2-transoformed with dChip (Li, 2008).
In this study, patient samples from UM and HLM formed the training set (n = 256), whereas samples from MSK (n = 104) and DFCI (n = 82) constituted two independent test sets. Genes with missing measurements in at least half of the samples were removed from analysis. Furthermore, for genes measured with multiple probes, the average expression of the duplicates was used to represent the expression profile of a unique gene. This gave 12,566 unique genes for the implication network analysis.
To construct implication networks, the mean expression of each gene in a patient cohort was used as a cut-off to partition the expression profiles. If the expression of a gene in a patient sample was greater than the mean in the cohort, this gene was denoted as up-regulated in this tumour sample; otherwise, it was denoted as down-regulated in the tumour sample. In the training set, patients who died within 5 years were labelled as poor-prognosis (n = 125), and those who survived 5 years after surgery were labelled as good-prognosis (n = 104). Censored cases (those with follow-up of less than 5 years) were removed from the analysis (n = 27). For each patient group in the training set, a genome-scale coexpression network was constructed using the implication induction algorithm. Between each pair of genes, possible significant (P < 0.05; one-sided z-tests of Umin and min ) coexpression relations were derived in each patient group, constituting disease-mediated gene coexpression networks. By comparing the implication rules connecting each pair of nodes between the two networks, disease-specific differential network components were identified. These differential components contain the coexpression relations (interactions) that were either present in the poor-prognosis group but missing in the good-prognosis group, or conversely, those present in the good-prognosis group but missing in the poor-prognosis group.
Next, genes displaying direct co-regulation with major NSCLC signal proteins were identified from the disease-specific differential network components. The signal proteins included in the study were retrieved from the human NSCLC signalling pathways delineated by the KEGG pathway database3. Genes of a significant (P < 0.05; z-tests) coexpression relation with MET, EGF, KRAS, TP53, E2F2, and E2F4 were pinpointed from the differential components associated with each prognosis group. As a result, 76 genes were identified from the poor-prognosis group, 58 genes from the good-prognosis group, and 9 genes common in both groups, yielding a set of 125 genes. The number of nodes (genes) and edges (interactions) in the networks associated with specific prognostic groups in each analytical step are shown in Fig. 3.
We sought to evaluate whether the genes identified from the proposed implication network analysis could generate accurate prognostic prediction. From the training set of the original continuous microarray data, 24 probes out of the 125 genes selected in the previous steps were significantly associated with overall survival (P < 0.05, univariate Cox modelling). These 24 significant probes were ranked by Relief (Witten and Frank, 2005) and a step-wise forward selection was used to identify the subset with the highest prognostic accuracy. Specifically, starting from the top ranked gene, one gene was added at each step to the gene set, until the prognostic accuracy could not be improved by adding more genes. As a result, the top 13 genes were identified as the most accurate prognostic gene signature (Table 1).
Multivariate Cox proportional hazard model was fitted with the 13 genes as covariates on bootstrapped training samples for 1,000 times. The average of the 1,000 coefficients obtained for each covariate was used to represent the final coefficients in the training model. Using the training model, a survival risk score was generated for each patient. A risk score of 8.87 was identified as a cut-off value for patient stratification in the training set (Fig. 4A). This training model and cut-off value were then applied to the two validation sets to generate prognostic categorization without re-estimating parameters (Fig. 4B and 4C). In all three patient cohorts, this scheme stratified patients into prognostic groups with distinct post-operative overall survival (log-rank P < 0.006, Kaplan-Meier analyses).
When high-risk group is defined as a group of patients who survived 5 years or less, and low-risk group as a group of patients who survived 5 years or longer, this model achieved sensitivity (correctly predicted high-risk patients) of 52% in the training set, 67.65% in MSK, and 67.86% in DFCI; and specificity (correctly predicted low-risk patients) of 77.88% in the training set, 51.61 % in MSK, and 61.11% in DFCI.
Current treatment options for patients with NSCLC are given based on AJCC tumour stage. Surgical resection is the major treatment option for stage I NSCLC patients. However, about 35–50% of stage I NSCLC patients will develop and die from tumour recurrence within the five years following surgery (Hoffman, Mauer, and Vokes, 2000; Naruke et al., 1988). On the other hand, stage IB patients who received surgical resection followed by adjuvant chemotherapy showed improved survival rate (Lu et al., 2006). This indicates that certain patients with stage I NSCLC are at high risk for developing recurrent diseases. Thus, we sought to explore whether the constructed 13-gene prognostic model could identify specific high-risk patients with stage I tumours.
Results show that the 13-gene prognostic signature could identify high-risk patients with stage I tumours on both the training set and two test sets (log-rank P ≤ 0.05; Fig. 5A–5C). The prognostic model also separated high- and low-risk groups within stage IB patients in the combined test sets (log-rank P = 1.47e-4; Fig. 5D). The 13-gene signature was able to generate significant prognostic stratification on the stage IA patients on the training set but not the test set (results not shown).
These results demonstrate that the constructed prognostic model provides more refined prognosis than the current AJCC staging system. Using this model, patients with stage I NSCLC could be advised to either receive or spare from chemotherapy according to the expression profiles of the 13 prognostic genes.
The constructed 13-gene prognostic model was evaluated with common lung cancer prognostic factors to further validate the prognostic power of the model. The clinical factors studied include gender, age, tumour stage, smoking history, race, and tumour differentiation. The predicted 13-gene risk score on the combined testing cohorts (MSK and DFCI) was used as a covariate in the analysis. Hazard ratio of the 13-gene risk score represents the likelihood of death from lung cancer for predicted high-risk patients (with estimated probability of death within 5-year after surgery ≥ 50%) vs. predicted low-risk patients (with estimated probability of death < 20%), based on the estimated average rate of death associated with gene expression defined-risk scores (Fig. S1).
In the first multivariate Cox analysis with major prognostic factors (Table 2), tumour stage was the only factor significantly (P < 0.00006) associated with elevated risk of lung cancer death (recurrence) when the model was fitted without the 13-gene risk score. When the 13-gene risk score was added to the multivariate Cox model, the 13-gene risk score demonstrated a strong association with the lung cancer survival (hazard ratio = 2.25, 95% CI: [1.28, 3.98]), and tumour stage remained significant (Table 2). Similarly, a comprehensive evaluation was carried out with all available clinical covariates and demographic data in the dataset, including smoking history, race, and tumour differentiation (Table 3). In this comprehensive evaluation, the 13-gene risk score remained a highly significant prognostic factor with a hazard ratio of 2.28 (95% CI: [1.23, 4.21]). Furthermore, in the comprehensive multivariate analysis, the hazard ratios of the 13-gene risk score algorithm were higher than other clinical covariates except tumour stage (III vs. I). These results demonstrate that the 13-gene signature is a more accurate prognostic factor than some commonly used clinical parameters.
Multiple prognostic classifiers were evaluated in the Director’s Challenge Study (Shedden et al., 2008), There were twelve classifiers constructed with gene signatures alone. Five of the twelve analysed signatures were from previous studies on lung cancer molecular prognosis (Chen et al., 2007; Potti et al., 2006). These published lung cancer gene signatures were identified using traditional statistical and machine learning methods (Table S1). Among the twelve gene signatures compared in the Director’s Challenge Study, the best signature was reported as “method A” (referred to as “Shedden A” in this study), which contains about 9,591 genes/probes. In order to compare the prognostic performance of our gene signature with the best lung cancer gene signatures reported to date, the estimated hazard ratio and the concordance probability estimate (CPE) of the gene signatures in both test sets were evaluated. Hazard ratios greater than 1 indicate that patients with high predicted risk scores have poor clinical outcome. The model has strong predictive power if the CPE value is close to 1; CPE value close to 0.5 indicates that the model has poor predictive power (comparable to random prediction).
Results show that the 13-gene prognostic model gives comparative performance as “Shedden A”, and is better than all other lung cancer gene signatures. The 13-gene model and “Shedden A” are the only two models with hazard ratio significantly (P < 0.05) greater than 1 in patients with all tumour stages (Fig. 6A). CPE of the 13-gene model is close to 0.6 in patients samples with all stages (Fig. 6C), which is comparable with “Shedden A”. Nevertheless, “Shedden A” is composed of more than nine thousand genes, which would be infeasible to be implemented as a clinical prognostic test.
Having established the clinical relevance of the 13 prognostic genes identified, we sought to explore the functional involvement of this gene set in lung tumorigenesis and tumour progression. Curated molecular interactions related to the 13 genes were retried using functional pathway analysis tools, Ingenuity Pathway Analysis (IPA, Ingenuity® Systems). IPA results show that cancer is among the top five most significant disease and disorder functions in the network related to the 13 genes (Fig. 7A). Furthermore, four of the prognostic genes exhibit indirect interactions with major lung cancer signalling pathways, such as TP53 and EGFR (Fig. 7B). The functional pathway analysis suggests that the 13 genes are involved in lung cancer oncogenesis and tumour progression.
Our previous study identified a 12-gene signature using hybrid models combining differential expression analysis (SAM) and Relief algorithm (Wan et al., 2010). Both 12- and 13-gene signatures had significant hazard ratios in prognostic categorization for patients with all tumour stages in two test sets (MSK and DFCI) of the Director’s Challenge Cohorts. The overall accuracies of the 12- and 13-gene prognostic models were not significantly different in 5-year survival prediction (Table S2). For stage I patients, the 13-gene had a significant (P< 0.05) hazard ratio of 2.96 in DFCI set but not in MSK, whereas the hazard ratio of 12-gene signature was not significant in any test sets. Taken together, the 13-gene signature identified in this network-based study has better prognostic performance in stage I lung adenocarcinoma patients than our previously identified 12-gene signature using traditional statistical methods.
In this study, a significance level of P < 0.05 (one-sided z-tests of Umin and min ) for coexpression relation was used as cut-offs to define possible gene interactions in the network construction. Our approach in the first step of this methodology is not to attempt to produce a list of gene-gene interactions or co-expressions with which we can associate a robust measure of significance. In order to do this, we would need to implement one of two types of mechanism to control for multiple hypothesis testing. The first type of mechanism is to compute a raw P-value for each interaction and then to adjust those P-values using a correction such as Bonferroni or Benjamini-Hochberg test. These corrections tend to be highly conservative when there is a high degree of dependence between the hypotheses, which is clearly the case here when the set of hypotheses is all possible gene-gene interactions, and the consequent loss of statistical power would be too high. The second mechanism is to perform a large number of random permutations of the class labels of the data and to perform the algorithm for each permutation, and then to compare the actual results to the generated null distribution of data. This mechanism is prohibitively computationally expensive in this context.
Instead, our approach is as follows. Rather than determine pairs of genes between which interactions exist, for each pair of genes we determine which type of interaction is most strongly supported by the data. In other words, our methodology classifies a coexpression relation (if any) between a pair of genes, as represented in one of six common logical rules. Our cutoffs determined by Umin and min eliminate a relatively small number of pairs of genes for which no logical rule provides strong evidence. We then determine a list of differential components, which are interactions between pairs of genes that classify differently between the different disease states. This list will inevitably contain a number of false positives, for reasons outlined above. Instead of controlling for the number of false positives at this stage of the methodology, we subsequently filter the differential components through steps including selecting significant genes from univariate Cox models and then from genes ranked top in the Relief algorithm, as outlined in the previous sections. As a matter of fact, we have experimented using more stringent significance levels for Umin and min, and the size of the resultant differential components associated with different disease-states was not reduced remarkably (data not shown).
This study presents a novel network-based approach to identifying a 13-gene signature for lung cancer prognosis. The identified signature could accurately estimate post-operative survival in lung adenocarcinoma patients. Furthermore, the signature could stratify patients within stage I and specifically, stage IB, into distinct prognostic groups. The gene expression-defined risk score is a more accurate prognostic factor than commonly used clinical covariates. In functional pathway analysis, the 13 genes also exhibit strong associations with cancer signalling hallmarks and oncogenesis.
This study demonstrates that the implication network methodology based on prediction logic is suitable for constructing genome-wide coexpression networks for analyzing perturbed gene/protein expression patterns in different disease states. The disease-mediated differential network components may contain important information for the discovery of biomarkers and pathways with implications for targeted therapy and prognostic prediction. These results conclude that the presented implication network methodology can retrieve disease relevant gene coexpression patterns and is useful in the discovery of clinically important biomarkers. Furthermore, gene signatures identified with this novel network-based methodology provide better prognostic performance than those identified with traditional statistical and machine learning methods on the same datasets.
We thank Dr. David Beer at University of Michigan and Dr. Trey Ideker at University of California in San Diego for thoughtful discussions. This project is supported by NIH R01LM009500
Dr. Nancy Lan Guo is Assistant Professor of Community Medicine/Mary Babb Randolph Cancer Center at West Virginia University (WVU).
Ying-Wooi Wan is a Ph.D. candidate in Computer and Information Science at WVU.
Swetha Bose has M.S. in Electrical Engineering.
Dr. James Denvir is a Research Associate at WVU and Marshall University.
Dr. Michael L. Kashon is a Mathematical Statistician at Biostatistics and Epidemiology Branch of Health Effects Laboratory Division (HELD) in National Institute of Occupational Safety and Health (NIOSH).
Dr. Michael E. Andrew is a Mathematical Statistician at Biostatistics and Epidemiology Branch of HELD in NIOSH.
Software: The software package of implication networks is available at: http://www.hsc.wvu.edu/mbrcc/fs/GuoLab/products.asp
Nancy Lan Guo, Department of Community Medicine, Mary Babb Randolph Cancer Center, West Virginia University, Morgantown, WV 26506-9300, USA, Tel: 304-293-6455, Fax: 304-293-4667.
Ying-Wooi Wan, Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA.
Swetha Bose, Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA.
James Denvir, Department of Statistics, West Virginia University, Morgantown, WV 26506, USA.
Michael L. Kashon, Biostatistics and Epidemiology, National Institute of Occupational Safety and Health, 1095 Willowdale Road, Morgantown, WV 26505, USA.
Michael E. Andrew, Biostatistics and Epidemiology, National Institute of Occupational Safety and Health, 1095 Willowdale Road, Morgantown, WV 26505, USA.