Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Int J Comput Biol Drug Des. Author manuscript; available in PMC 2012 February 17.
Published in final edited form as:
Int J Comput Biol Drug Des. 2011; 4(1): 19–39.
Published online 2011 February 17. doi:  10.1504/IJCBDD.2011.038655
PMCID: PMC3095973

A novel network model identified a 13-gene lung cancer prognostic signature


This study presents a novel network methodology to identify prognostic gene signatures. Implication networks based on prediction logic are used to construct genome-wide coexpression networks for different disease states. From the differential components associated with specific disease states, candidate genes that are co-expressed with major disease signal hallmarks are selected. From these candidate genes, top genes that are the most predictive of clinical outcome are identified using univariate Cox model and Relief algorithm. Using this approach, a 13-gene lung cancer prognosis signature was identified, which generated significant prognostic stratifications (log-rank P < 0.05) in Director’s Challenge Study (n=442).

Keywords: Prognostic gene signature, lung cancer, implication networks, gene co-expression networks, signalling pathways

1 Introduction

Lung cancer is the leading cause of cancer-related deaths in industrialized countries. Non-small cell lung cancer (NSCLC) accounts for about 80% of lung cancer cases. Currently, surgery is the major treatment option for patients with stage I NSCLC. However, 35–50% of stage I NSCLC patients will develop recurrence and die within 5 years (Hoffman, Mauer, and Vokes, 2000). It remains an unsolved challenge for physicians to reliably identify patients at high risk for recurrence as candidates for chemotherapy. A few studies have described transcriptional profiling for lung cancer prognosis (Chen et al., 2007; Potti et al., 2006; Shedden et al., 2008). Nevertheless, there is no clinically applied gene test for this deadly disease.

The accurate assessment of disease progression in individual patients is a critical prerequisite in personalized medicine. With the completion of the Human Genome Project, the emphasis of biomarker identification has shifted from cataloguing the “parts list” of signature genes and proteins to elucidating the networks of interactions that take place among them (Ideker and Sharan, 2008). Molecular network analysis had been shown to be useful in disease classification (Chuang et al., 2007) and identification of novel therapeutic targets (Csermely, Agoston, and Pongor, 2005). Nevertheless, major challenges have been the development of methods for efficiently constructing genome-scale coexpression networks and the identification of a particular set of markers, from among the enormous number of potential markers, that has the highest predictive ability for disease outcome (Sotiriou and Piccart, 2007). In this study, we hypothesised that the combined analysis of disease-mediated genome-wide coexpression networks, hallmark signalling pathways, and clinical approaches would lead to more informed clinical decision-making. This study will focus on the molecular prognosis of lung cancer relapse and metastasis.

In current genome-wide expression studies, genes are ranked according to their association with the clinical outcome, and the top-ranked genes are included in the classifier. It has been noted that individual biomarkers showing strong association with disease outcome are not necessarily good classifiers (Emir et al., 1998). Genes and proteins do not function in isolation, but rather interact with one another to form modular machines (Hartwell et al., 1999). Molecular network analysis has led to promising applications in identifying new disease genes (Emilsson et al., 2008) and disease-related subnetworks (Calvano et al., 2005), and classifying diseases (Chuang et al., 2007).

Boolean networks can provide important biological insights into regulation functions (Albert and Othmer, 2003). Nevertheless, as the number of global states is exponential in the number of entities and the analysis relies on an exhaustive enumeration of all possible trajectories, this method is computationally expensive and only practical for small networks (Karlebach and Shamir, 2008). A recent formalism, causal Bayesian belief networks, have been utilized to model cellular networks (Friedman, 2004). Nevertheless, the number of possible networks is exponential in the number of nodes under consideration, which makes it impossible to evaluate all possible networks. Furthermore, it is not always possible to determine the causal relationships between nodes, i.e., the direction of the edges, owing to a property known as Markov equivalence (Zhu et al., 2008). More importantly, the acyclic Bayesian network structure was unable to model feedback loops, which are essential in signalling pathways (Sachs et al., 2005) and genetic networks (Milo et al., 2002; Milo et al., 2004; Wuchty, Oltvai, and Barabasi, 2003). To overcome this limitation, a more complex scheme, dynamic Bayesian networks, was explored for modelling temporal microarray data (Kim, Imoto, and Miyano, 2003; Pe’er et al., 2001).

As an alternative to Bayesian networks, an implication network model employs a partial order knowledge structure (POKS) for structural learning and uses the Bayesian theory for inference propagation (Desmarais, Maluf, and Liu, 1996; Desmarais, Meshkinfam, and Gagnon, 2006). An implication network is a general methodology for reasoning under uncertainty. POKSs are closed under union and intersection of implication relations, and have the formal properties of directed acyclic graphs. The constraints on the partial order can be entirely represented by AND/OR graphs (Desmarais, Maluf, and Liu, 1996; Falmagne et al., 1990). When the constraints on the partial order are relaxed, the implication networks can represent cyclic relations among the nodes. In this condition, the implication network structure is a directed graph with nodes connected by implication (causal) rules, which can contain cycles such as feedback loops.

Liu and Desmarais (1997) presented the first implication network formalism based on binomial distribution. Boolean implication networks (Sahoo et al., 2008) used scatter plots of expression between two genes to induce the implication relations. In this study, we developed an induction algorithm based on prediction logic (Guo, Cukic, and Singh, 2003) to derive implication relations. The implication network was employed for efficient construction of disease-mediated genome-wide coexpression networks for the identification of prognostic gene signatures. The prognostic performance of the identified gene signature was evaluated by comparing with clinical covariates and other gene expression signatures. Furthermore, functional pathway analysis was done to confirm the biological relevance of the identified gene signature.

2 Materials and Methods

2.1 Implication induction algorithm

The first implication network induction algorithm is based on binomial distribution, which is suitable for binary datasets (Liu and Desmarais, 1997). We proposed an alternative network induction algorithm based on prediction logic (Guo, Cukic, and Singh, 2003), which is applicable for more general applications, including multinomial datasets and multi-classification problems. Prediction logic reveals the implication relationships among variables in a dataset and evaluates propositions in formal logic. Prediction logic integrates formal logic theory and statistics to build a convenient predictive structure for a dataset. The most important aspect of prediction logic is the conceptual value of prediction analysis in constructing and evaluating useful statements, particularly in complex multinomial problems with moderate sample sizes. This feature is essential for clinical applications, in which many clinical parameters are multinomial and the patient sample size is small.

We used prediction logic based on formal logic rules relating two dichotomous variables to induct the implication network. The six most important implication rules relating two dichotomous variables are shown in Fig. 1, where each table is a contingency table and the shaded cells represent the errors for the corresponding implication rule. For example, A∧¬B is the error cell for the implication rule A[implies] B, NA∧¬B represents the number of error occurrences.

Figure 1
Six important implication rules relating two dichotomous variables

Our algorithm modified U-Optimality method (Hildebrand, Laing, and Rosenthal, 1977) (Fig. 2), and was used to derive the implication relation between each pair of variables in the dataset. In the implication induction algorithm (Fig. 2), Up is the scope of the implication rule, representing the portion of the data covered by the implication relation, and [nabla]p is the precision of the implication rule, representing the prediction success of the corresponding implication relation. An implication rule has high precision when the number of error occurrences is a small portion of the data covered by the implication rule. The minimum scope and precision required by the implication rule are indicated respectively by Umin and [nabla]min, which must be positive for a valid implication relation. The induction algorithm derives an implication rule if it has the maximum scope, Up and it satisfies the constraint that its scope, Up and precision, [nabla]p are greater than the required minimum values, Umin and [nabla]min, respectively. To simplify the computations of the maximization problem, the [nabla]ij value of every error cell must be greater than that of the non-error cells for the corresponding implication rule (Guo, Cukic, and Singh, 2003).

Figure 2
Implication induction algorithm for building coexpression networks.

For a single error cell, where Nij is the number of error occurrences, scope Up, and precision[nabla]p are defined as:

equation M1

For multiple error cells, they are defined as:

equation M2

where ωij = 1 for error cells; otherwise, ωij = 0.

This implication induction algorithm is general for discrete datasets. With the expansion of the contingency table Mij (Fig. 2), implication rules can be induced for multinomial datasets, where error cells are those with top precision ([nabla]ij values) and satisfying all the constraints. The proposition can then be induced according to the error set.

The complexity of the induction algorithm is O(Nv2), where N is the sample size and v is the number of variables in the dataset (i.e. nodes in the implication networks) (Guo, Cukic, and Singh, 2003). The difference between this algorithm and the original U-Optimality (Hildebrand, Laing, and Rosenthal, 1977) is that minimum requirements for deriving an implication rule were set for both scope (Up) and precision ([nabla]p ), instead of for precision alone.

2.2 Relief feature selection algorithm

From the set of prognostic genes identified from implication network methodology, Relief was used to rank these genes with software WEKA 3.4 (Witten and Frank, 2005) in order to select the most predictive genes. Relief evaluates the importance of a variable by repeatedly sampling an instance and checking the value of the given variable for the nearest instance from the same and different classes. The values of the attributes of the nearest neighbours are compared to the sampled instance and used to update the relevance scores for each attribute. As approximated in following equation, Relief computes the weight of attribute A as:

equation M3

Relief assigns more weight to those attributes that have the same value for instances from the same class and differentiate from instances in different classes (Hall and Holmes, 2003; Witten and Frank, 2005).

2.3 Functional pathway Analysis

A proprietary web-based software Ingenuity Pathway Analysis (IPA, Ingenuity® Systems1) were used to derive curated molecular interactions, including both physical and functional interactions, and pathway relevance reported in the literature. The databases and software toolsets weigh and integrateinformation from numerous sources, including experimental repositories and text collections from published literature. Core analysis was used to identify significant biological processes and functions from the merged network related to the identified 13-gene signature in human tissues and cell lines.

2.4 Microarray profiles and patient samples

Gene expression profiles quantified with Affymetrix HG-U133A on 442 lung adenocarcinoma samples from a published study (Shedden et al., 2008) were used in this study. This study cohort is composed of four data sets (University of Michigan, H. Lee Moffitt Cancer Center, Memorial Sloan-Kettering Cancer Center, and Dana-Farber Cancer Institute) contributed by six institutions. Tumours were collected by surgical resection from patients who have provided consent and protocols were approved by the Institutional Review Boards (IRB-Med) of the respective institutions. None of the patients received preoperative chemotherapy or radiation and least two years of follow-up information was available. Regions containing a minimum of 60% tumour cellularity were required for macro-dissection, and in most instances tumour cellularity of at least 70–90% was identified for inclusion in the sample for RNA isolation. The raw microarray data are available from caArray website2. The data used in the analysis was quantile-normalized and log2-transoformed with dChip (Li, 2008).

3 Results and Discussion

3.1 Identification of prognostic genes using implication networks

In this study, patient samples from UM and HLM formed the training set (n = 256), whereas samples from MSK (n = 104) and DFCI (n = 82) constituted two independent test sets. Genes with missing measurements in at least half of the samples were removed from analysis. Furthermore, for genes measured with multiple probes, the average expression of the duplicates was used to represent the expression profile of a unique gene. This gave 12,566 unique genes for the implication network analysis.

To construct implication networks, the mean expression of each gene in a patient cohort was used as a cut-off to partition the expression profiles. If the expression of a gene in a patient sample was greater than the mean in the cohort, this gene was denoted as up-regulated in this tumour sample; otherwise, it was denoted as down-regulated in the tumour sample. In the training set, patients who died within 5 years were labelled as poor-prognosis (n = 125), and those who survived 5 years after surgery were labelled as good-prognosis (n = 104). Censored cases (those with follow-up of less than 5 years) were removed from the analysis (n = 27). For each patient group in the training set, a genome-scale coexpression network was constructed using the implication induction algorithm. Between each pair of genes, possible significant (P < 0.05; one-sided z-tests of Umin and [nabla]min ) coexpression relations were derived in each patient group, constituting disease-mediated gene coexpression networks. By comparing the implication rules connecting each pair of nodes between the two networks, disease-specific differential network components were identified. These differential components contain the coexpression relations (interactions) that were either present in the poor-prognosis group but missing in the good-prognosis group, or conversely, those present in the good-prognosis group but missing in the poor-prognosis group.

Next, genes displaying direct co-regulation with major NSCLC signal proteins were identified from the disease-specific differential network components. The signal proteins included in the study were retrieved from the human NSCLC signalling pathways delineated by the KEGG pathway database3. Genes of a significant (P < 0.05; z-tests) coexpression relation with MET, EGF, KRAS, TP53, E2F2, and E2F4 were pinpointed from the differential components associated with each prognosis group. As a result, 76 genes were identified from the poor-prognosis group, 58 genes from the good-prognosis group, and 9 genes common in both groups, yielding a set of 125 genes. The number of nodes (genes) and edges (interactions) in the networks associated with specific prognostic groups in each analytical step are shown in Fig. 3.

Figure 3
The characteristics of the coexpression networks at each step in the identification of the 13-gene signature

3.2 Survival prediction using the identified prognostic genes

We sought to evaluate whether the genes identified from the proposed implication network analysis could generate accurate prognostic prediction. From the training set of the original continuous microarray data, 24 probes out of the 125 genes selected in the previous steps were significantly associated with overall survival (P < 0.05, univariate Cox modelling). These 24 significant probes were ranked by Relief (Witten and Frank, 2005) and a step-wise forward selection was used to identify the subset with the highest prognostic accuracy. Specifically, starting from the top ranked gene, one gene was added at each step to the gene set, until the prognostic accuracy could not be improved by adding more genes. As a result, the top 13 genes were identified as the most accurate prognostic gene signature (Table 1).

Table 1
The identified 13-gene lung cancer prognostic signature.

Multivariate Cox proportional hazard model was fitted with the 13 genes as covariates on bootstrapped training samples for 1,000 times. The average of the 1,000 coefficients obtained for each covariate was used to represent the final coefficients in the training model. Using the training model, a survival risk score was generated for each patient. A risk score of 8.87 was identified as a cut-off value for patient stratification in the training set (Fig. 4A). This training model and cut-off value were then applied to the two validation sets to generate prognostic categorization without re-estimating parameters (Fig. 4B and 4C). In all three patient cohorts, this scheme stratified patients into prognostic groups with distinct post-operative overall survival (log-rank P < 0.006, Kaplan-Meier analyses).

Figure 4
Prognostic prediction of lung cancer survival after surgery by the 13-gene prognostic model

When high-risk group is defined as a group of patients who survived 5 years or less, and low-risk group as a group of patients who survived 5 years or longer, this model achieved sensitivity (correctly predicted high-risk patients) of 52% in the training set, 67.65% in MSK, and 67.86% in DFCI; and specificity (correctly predicted low-risk patients) of 77.88% in the training set, 51.61 % in MSK, and 61.11% in DFCI.

3.3 Survival prediction on early stage lung cancer

Current treatment options for patients with NSCLC are given based on AJCC tumour stage. Surgical resection is the major treatment option for stage I NSCLC patients. However, about 35–50% of stage I NSCLC patients will develop and die from tumour recurrence within the five years following surgery (Hoffman, Mauer, and Vokes, 2000; Naruke et al., 1988). On the other hand, stage IB patients who received surgical resection followed by adjuvant chemotherapy showed improved survival rate (Lu et al., 2006). This indicates that certain patients with stage I NSCLC are at high risk for developing recurrent diseases. Thus, we sought to explore whether the constructed 13-gene prognostic model could identify specific high-risk patients with stage I tumours.

Results show that the 13-gene prognostic signature could identify high-risk patients with stage I tumours on both the training set and two test sets (log-rank P ≤ 0.05; Fig. 5A–5C). The prognostic model also separated high- and low-risk groups within stage IB patients in the combined test sets (log-rank P = 1.47e-4; Fig. 5D). The 13-gene signature was able to generate significant prognostic stratification on the stage IA patients on the training set but not the test set (results not shown).

Figure 5
Prognostic performance of the 13-gene model in stage I lung cancer

These results demonstrate that the constructed prognostic model provides more refined prognosis than the current AJCC staging system. Using this model, patients with stage I NSCLC could be advised to either receive or spare from chemotherapy according to the expression profiles of the 13 prognostic genes.

3.4 Prognosis evaluation with clinical covariates

The constructed 13-gene prognostic model was evaluated with common lung cancer prognostic factors to further validate the prognostic power of the model. The clinical factors studied include gender, age, tumour stage, smoking history, race, and tumour differentiation. The predicted 13-gene risk score on the combined testing cohorts (MSK and DFCI) was used as a covariate in the analysis. Hazard ratio of the 13-gene risk score represents the likelihood of death from lung cancer for predicted high-risk patients (with estimated probability of death within 5-year after surgery ≥ 50%) vs. predicted low-risk patients (with estimated probability of death < 20%), based on the estimated average rate of death associated with gene expression defined-risk scores (Fig. S1).

In the first multivariate Cox analysis with major prognostic factors (Table 2), tumour stage was the only factor significantly (P < 0.00006) associated with elevated risk of lung cancer death (recurrence) when the model was fitted without the 13-gene risk score. When the 13-gene risk score was added to the multivariate Cox model, the 13-gene risk score demonstrated a strong association with the lung cancer survival (hazard ratio = 2.25, 95% CI: [1.28, 3.98]), and tumour stage remained significant (Table 2). Similarly, a comprehensive evaluation was carried out with all available clinical covariates and demographic data in the dataset, including smoking history, race, and tumour differentiation (Table 3). In this comprehensive evaluation, the 13-gene risk score remained a highly significant prognostic factor with a hazard ratio of 2.28 (95% CI: [1.23, 4.21]). Furthermore, in the comprehensive multivariate analysis, the hazard ratios of the 13-gene risk score algorithm were higher than other clinical covariates except tumour stage (III vs. I). These results demonstrate that the 13-gene signature is a more accurate prognostic factor than some commonly used clinical parameters.

Table 2
Multivariate Cox proportional analysis of 13-gene risk score and major clinical covariates including gender, age, and tumor stage on the combined testing cohorts (MSK and DFCI).
Table 3
Multivariate Cox proportional analysis of all available clinical covariates and 13-gene risk score in the combined test cohorts (MSK and DFCI).

3.5 Comparison with other lung cancer gene signatures

Multiple prognostic classifiers were evaluated in the Director’s Challenge Study (Shedden et al., 2008), There were twelve classifiers constructed with gene signatures alone. Five of the twelve analysed signatures were from previous studies on lung cancer molecular prognosis (Chen et al., 2007; Potti et al., 2006). These published lung cancer gene signatures were identified using traditional statistical and machine learning methods (Table S1). Among the twelve gene signatures compared in the Director’s Challenge Study, the best signature was reported as “method A” (referred to as “Shedden A” in this study), which contains about 9,591 genes/probes. In order to compare the prognostic performance of our gene signature with the best lung cancer gene signatures reported to date, the estimated hazard ratio and the concordance probability estimate (CPE) of the gene signatures in both test sets were evaluated. Hazard ratios greater than 1 indicate that patients with high predicted risk scores have poor clinical outcome. The model has strong predictive power if the CPE value is close to 1; CPE value close to 0.5 indicates that the model has poor predictive power (comparable to random prediction).

Results show that the 13-gene prognostic model gives comparative performance as “Shedden A”, and is better than all other lung cancer gene signatures. The 13-gene model and “Shedden A” are the only two models with hazard ratio significantly (P < 0.05) greater than 1 in patients with all tumour stages (Fig. 6A). CPE of the 13-gene model is close to 0.6 in patients samples with all stages (Fig. 6C), which is comparable with “Shedden A”. Nevertheless, “Shedden A” is composed of more than nine thousand genes, which would be infeasible to be implemented as a clinical prognostic test.

Figure 6
Comparison of 13-gene prognostic model and various gene expression-defined models presented in the Director’s Challenge Study (Shedden et al. 2008) in two test sets in term of hazard ratio (A) and concordance probability estimate (B).

3.6 Functional pathway analysis

Having established the clinical relevance of the 13 prognostic genes identified, we sought to explore the functional involvement of this gene set in lung tumorigenesis and tumour progression. Curated molecular interactions related to the 13 genes were retried using functional pathway analysis tools, Ingenuity Pathway Analysis (IPA, Ingenuity® Systems). IPA results show that cancer is among the top five most significant disease and disorder functions in the network related to the 13 genes (Fig. 7A). Furthermore, four of the prognostic genes exhibit indirect interactions with major lung cancer signalling pathways, such as TP53 and EGFR (Fig. 7B). The functional pathway analysis suggests that the 13 genes are involved in lung cancer oncogenesis and tumour progression.

Figure 7
Functional pathways analysis of the 13 prognostic genes

3.7 Discussion

Our previous study identified a 12-gene signature using hybrid models combining differential expression analysis (SAM) and Relief algorithm (Wan et al., 2010). Both 12- and 13-gene signatures had significant hazard ratios in prognostic categorization for patients with all tumour stages in two test sets (MSK and DFCI) of the Director’s Challenge Cohorts. The overall accuracies of the 12- and 13-gene prognostic models were not significantly different in 5-year survival prediction (Table S2). For stage I patients, the 13-gene had a significant (P< 0.05) hazard ratio of 2.96 in DFCI set but not in MSK, whereas the hazard ratio of 12-gene signature was not significant in any test sets. Taken together, the 13-gene signature identified in this network-based study has better prognostic performance in stage I lung adenocarcinoma patients than our previously identified 12-gene signature using traditional statistical methods.

In this study, a significance level of P < 0.05 (one-sided z-tests of Umin and [nabla]min ) for coexpression relation was used as cut-offs to define possible gene interactions in the network construction. Our approach in the first step of this methodology is not to attempt to produce a list of gene-gene interactions or co-expressions with which we can associate a robust measure of significance. In order to do this, we would need to implement one of two types of mechanism to control for multiple hypothesis testing. The first type of mechanism is to compute a raw P-value for each interaction and then to adjust those P-values using a correction such as Bonferroni or Benjamini-Hochberg test. These corrections tend to be highly conservative when there is a high degree of dependence between the hypotheses, which is clearly the case here when the set of hypotheses is all possible gene-gene interactions, and the consequent loss of statistical power would be too high. The second mechanism is to perform a large number of random permutations of the class labels of the data and to perform the algorithm for each permutation, and then to compare the actual results to the generated null distribution of data. This mechanism is prohibitively computationally expensive in this context.

Instead, our approach is as follows. Rather than determine pairs of genes between which interactions exist, for each pair of genes we determine which type of interaction is most strongly supported by the data. In other words, our methodology classifies a coexpression relation (if any) between a pair of genes, as represented in one of six common logical rules. Our cutoffs determined by Umin and [nabla]min eliminate a relatively small number of pairs of genes for which no logical rule provides strong evidence. We then determine a list of differential components, which are interactions between pairs of genes that classify differently between the different disease states. This list will inevitably contain a number of false positives, for reasons outlined above. Instead of controlling for the number of false positives at this stage of the methodology, we subsequently filter the differential components through steps including selecting significant genes from univariate Cox models and then from genes ranked top in the Relief algorithm, as outlined in the previous sections. As a matter of fact, we have experimented using more stringent significance levels for Umin and [nabla]min, and the size of the resultant differential components associated with different disease-states was not reduced remarkably (data not shown).

4 Conclusions

This study presents a novel network-based approach to identifying a 13-gene signature for lung cancer prognosis. The identified signature could accurately estimate post-operative survival in lung adenocarcinoma patients. Furthermore, the signature could stratify patients within stage I and specifically, stage IB, into distinct prognostic groups. The gene expression-defined risk score is a more accurate prognostic factor than commonly used clinical covariates. In functional pathway analysis, the 13 genes also exhibit strong associations with cancer signalling hallmarks and oncogenesis.

This study demonstrates that the implication network methodology based on prediction logic is suitable for constructing genome-wide coexpression networks for analyzing perturbed gene/protein expression patterns in different disease states. The disease-mediated differential network components may contain important information for the discovery of biomarkers and pathways with implications for targeted therapy and prognostic prediction. These results conclude that the presented implication network methodology can retrieve disease relevant gene coexpression patterns and is useful in the discovery of clinically important biomarkers. Furthermore, gene signatures identified with this novel network-based methodology provide better prognostic performance than those identified with traditional statistical and machine learning methods on the same datasets.

Supplementary Material



We thank Dr. David Beer at University of Michigan and Dr. Trey Ideker at University of California in San Diego for thoughtful discussions. This project is supported by NIH R01LM009500



Dr. Nancy Lan Guo is Assistant Professor of Community Medicine/Mary Babb Randolph Cancer Center at West Virginia University (WVU).


Ying-Wooi Wan is a Ph.D. candidate in Computer and Information Science at WVU.


Swetha Bose has M.S. in Electrical Engineering.


Dr. James Denvir is a Research Associate at WVU and Marshall University.


Dr. Michael L. Kashon is a Mathematical Statistician at Biostatistics and Epidemiology Branch of Health Effects Laboratory Division (HELD) in National Institute of Occupational Safety and Health (NIOSH).


Dr. Michael E. Andrew is a Mathematical Statistician at Biostatistics and Epidemiology Branch of HELD in NIOSH.

Contributor Information

Nancy Lan Guo, Department of Community Medicine, Mary Babb Randolph Cancer Center, West Virginia University, Morgantown, WV 26506-9300, USA, Tel: 304-293-6455, Fax: 304-293-4667.

Ying-Wooi Wan, Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA.

Swetha Bose, Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA.

James Denvir, Department of Statistics, West Virginia University, Morgantown, WV 26506, USA.

Michael L. Kashon, Biostatistics and Epidemiology, National Institute of Occupational Safety and Health, 1095 Willowdale Road, Morgantown, WV 26505, USA.

Michael E. Andrew, Biostatistics and Epidemiology, National Institute of Occupational Safety and Health, 1095 Willowdale Road, Morgantown, WV 26505, USA.


  • Albert R, Othmer HG. The topology of the regulatory interactions predicts the expression pattern of the segment polarity genes in Drosophila melanogaster. J Theor Biol. 2003;223(1):1–18. [PubMed]
  • Calvano SE, Xiao W, Richards DR, Felciano RM, Baker HV, Cho RJ, Chen RO, Brownstein BH, Cobb JP, Tschoeke SK, Miller-Graziano C, Moldawer LL, Mindrinos MN, Davis RW, Tompkins RG, Lowry SF. A network-based analysis of systemic inflammation in humans. Nature. 2005;437(7061):1032–1037. [PubMed]
  • Chen HY, Yu SL, Chen CH, Chang GC, Chen CY, Yuan A, Cheng CL, Wang CH, Terng HJ, Kao SF, Chan WK, Li HN, Liu CC, Singh S, Chen WJ, Chen JJ, Yang PC. A five-gene signature and clinical outcome in non-small-cell lung cancer. N Engl J Med. 2007;356(1):11–20. [PubMed]
  • Chuang HY, Lee E, Liu YT, Lee D, Ideker T. Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007;3:140. [PMC free article] [PubMed]
  • Csermely P, Agoston V, Pongor S. The efficiency of multi-target drugs: the network approach might help drug design. Trends Pharmacol Sci. 2005;26(4):178–182. [PubMed]
  • Desmarais MC, Maluf A, Liu J. User-expertise modeling with empirically derived probabilistic implication networks. User Modeling and User-Adapted Interaction. 1996;5(3–4):283–315.
  • Desmarais MC, Meshkinfam P, Gagnon M. Learned Student Models with Item to Item Knowledge Structures. User Modeling and User-Adapted Interaction. 2006;16(5):403–434.
  • Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, Zhu J, Carlson S, Helgason A, Walters GB, Gunnarsdottir S, Mouy M, Steinthorsdottir V, Eiriksdottir GH, Bjornsdottir G, Reynisdottir I, Gudbjartsson D, Helgadottir A, Jonasdottir A, Jonasdottir A, Styrkarsdottir U, Gretarsdottir S, Magnusson KP, Stefansson H, Fossdal R, Kristjansson K, Gislason HG, Stefansson T, Leifsson BG, Thorsteinsdottir U, Lamb JR, Gulcher JR, Reitman ML, Kong A, Schadt EE, Stefansson K. Genetics of gene expression and its effect on disease. Nature. 2008;452(7186):423–428. [PubMed]
  • Emir B, Wieand S, Su JQ, Cha S. Analysis of repeated markers used to predict progression of cancer. Stat Med. 1998;17(22):2563–2578. [PubMed]
  • Falmagne JC, Doignon JP, Koppen M, Villano M, Johannesen L. Introduction to knowledge spaces: how to build, test and search them. Psychological Review. 1990;97(2):201–224.
  • Friedman N. Inferring cellular networks using probabilistic graphical models. Science. 2004;303(5659):799–805. [PubMed]
  • Guo L, Cukic B, Singh H. Predicting Fault Prone Modules by the Dempster-Shafer Belief Networks. 18th IEEE International Conference on Automated Software Engineering (ASE’03); 2003. pp. 249–252.
  • Hall MA, Holmes G. Benchmarking Attribute Selection Techniques for Discrete Class Data Mining. IEEE Transactions on Knowledge and Data Engineering. 2003;15(3):1437–1447.
  • Hartwell LH, Hopfield JJ, Leibler S, Murray AW. From molecular to modular cell biology. Nature. 1999;402(6761 Suppl):C47–C52. [PubMed]
  • Hildebrand DK, Laing JD, Rosenthal H. Prediction Analysis of Cross Classifications. John Wiley & Sons; 1977.
  • Hoffman PC, Mauer AM, Vokes EE. Lung cancer. Lancet. 2000;355(9202):479–485. [PubMed]
  • Ideker T, Sharan R. Protein networks in disease. Genome Res. 2008;18(4):644–652. [PubMed]
  • Karlebach G, Shamir R. Modelling and analysis of gene regulatory networks. Nat Rev Mol Cell Biol. 2008;9(10):770–780. [PubMed]
  • Kim SY, Imoto S, Miyano S. Inferring gene networks from time series microarray data using dynamic Bayesian networks. Brief Bioinform. 2003;4(3):228–235. [PubMed]
  • Li C. Automating dChip: toward reproducible sharing of microarray data analysis. BMC Bioinformatics. 2008;9:231. [PMC free article] [PubMed]
  • Liu J, Desmarais MC. A Method of Learning Implication Networks from Empirical Data: Algorithm and Monte-Carlo Simulation-Based Validation. IEEE Transactions on Knowledge and Data Engineering. 1997;9(6):990–1004.
  • Lu Y, Lemon W, Liu PY, Yi Y, Morrison C, Yang P, Sun Z, Szoke J, Gerald WL, Watson M, Govindan R, You M. A gene expression signature predicts survival of patients with stage I non-small cell lung cancer. PLoS Med. 2006;3(12):e467. [PMC free article] [PubMed]
  • Milo R, Itzkovitz S, Kashtan N, Levitt R, Shen-Orr S, Ayzenshtat I, Sheffer M, Alon U. Superfamilies of evolved and designed networks. Science. 2004;303(5663):1538–1542. [PubMed]
  • Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U. Network motifs: simple building blocks of complex networks. Science. 2002;298(5594):824–827. [PubMed]
  • Naruke T, Goya T, Tsuchiya R, Suemasu K. Prognosis and survival in resected lung carcinoma based on the new international staging system. J Thorac Cardiovasc Surg. 1988;96(3):440–447. [PubMed]
  • Pe’er D, Regev A, Elidan G, Friedman N. Inferring subnetworks from perturbed expression profiles. Bioinformatics. 2001;17(Suppl 1):S215–S224. [PubMed]
  • Potti A, Mukherjee S, Petersen R, Dressman HK, Bild A, Koontz J, Kratzke R, Watson MA, Kelley M, Ginsburg GS, West M, Harpole DH, Jr, Nevins JR. A genomic strategy to refine prognosis in early-stage non-small-cell lung cancer. N Engl J Med. 2006;355(6):570–580. [PubMed]
  • Sachs K, Perez O, Pe’er D, Lauffenburger DA, Nolan GP. Causal protein-signaling networks derived from multiparameter single-cell data. Science. 2005;308(5721):523–529. [PubMed]
  • Sahoo D, Dill DL, Gentles AJ, Tibshirani R, Plevritis SK. Boolean implication networks derived from large scale, whole genome microarray datasets. Genome Biol. 2008;9(10):R157. [PMC free article] [PubMed]
  • Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, Chang AC, Zhu CQ, Strumpf D, Hanash S, Shepherd FA, Ding K, Seymour L, Naoki K, Pennell N, Weir B, Verhaak R, Ladd-Acosta C, Golub T, Gruidl M, Sharma A, Szoke J, Zakowski M, Rusch V, Kris M, Viale A, Motoi N, Travis W, Conley B, Seshan VE, Meyerson M, Kuick R, Dobbin KK, Lively T, Jacobson JW, Beer DG. Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med. 2008;14(8):822–827. [PMC free article] [PubMed]
  • Sotiriou C, Piccart MJ. Taking gene-expression profiling to the clinic: when will molecular signatures become relevant to patient care? Nat Rev Cancer. 2007;7(7):545–553. [PubMed]
  • Wan YW, Sabbagh E, Raese R, Qian Y, Luo D, Denvir J, Vallyathan V, Castranova V, Guo NL. Hybrid models identified a 12-gene signature for lung cancer prognosis and chemoresponse prediction. PLoS ONE. 2010;5(8):e12222. [PMC free article] [PubMed]
  • Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. 2. Morgan Kaufmann; 2005.
  • Wuchty S, Oltvai ZN, Barabasi AL. Evolutionary conservation of motif constituents in the yeast protein interaction network. Nat Genet. 2003;35(2):176–179. [PubMed]
  • Zhu J, Zhang B, Smith EN, Drees B, Brem RB, Kruglyak L, Bumgarner RE, Schadt EE. Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat Genet. 2008;40(7):854–861. [PMC free article] [PubMed]