Despite widespread interest in the application of next-generation-sequencing (NGS) to the mutation profiling of individual cancer specimens, the onset of personalized clinical genomics is currently stalled due in part to technical hurdles. As tumors are genetically-heterogeneous and often mixed with normal/stromal cells, the resulting low-abundance DNA somatic mutations often produce ambiguous results or fall below the current NGS detection limit, thus hindering mutation calling that abides to clinical sensitivity/specificity standards. Here we examine the feasibility of applying COLD-PCR, a form of PCR that magnifies selectively the mutations, to boost the detection of unknown rare somatic mutations prior to applying NGS-based amplicon re-sequencing to clinical samples. We amplified DNA from serially-diluted mutation-containing human cell-lines into wild-type (WT) DNA, as well as lung adenocarcinoma and colorectal cancer specimens using COLD-PCR or conventional PCR for comparison. Following individual amplification of TP53, KRAS, IDH1, and EGFR regions, PCR products were barcoded, pooled for library preparation and sequenced on the Illumina-HiSeq2000 platform. Regardless of sequencing depth, sequencing errors dictated a mutation-detection limit of ~1–2% mutation abundance in conventional PCR amplicons analyzed by NGS. In contrast, COLD-PCR amplicons enabled genuine mutations to exceed the sequence noise levels, thus allowing reliable identification of mutation abundances of ~0.04%. Sequencing depth was not a significant factor in the identification of COLD-PCR-magnified mutations. The analyzed clinical specimens revealed several TP53 and KRAS missense mutations that could not be called following NGS of conventional amplicons, yet were clearly detectable in COLD-PCR amplicons. Extensive tumor heterogeneity in the TP53 gene was revealed in some samples. As cancer care shifts toward personalized intervention, based on the unique genetic abnormalities in each patient’s tumor genome, we anticipate that COLD-PCR-NGS will elucidate the role of rare mutations in tumors, enable NGS-based analysis of diverse clinical specimens and the broad inter-phasing of NGS with clinical practice.
COLD-PCR; mutation enrichment; low-abundance mutations; next generation sequencing; cancer
Single sample predictors (SSPs) and Subtype classification models (SCMs) are gene expression–based classifiers used to identify the four primary molecular subtypes of breast cancer (basal-like, HER2-enriched, luminal A, and luminal B). SSPs use hierarchical clustering, followed by nearest centroid classification, based on large sets of tumor-intrinsic genes. SCMs use a mixture of Gaussian distributions based on sets of genes with expression specifically correlated with three key breast cancer genes (estrogen receptor [ER], HER2, and aurora kinase A [AURKA]). The aim of this study was to compare the robustness, classification concordance, and prognostic value of these classifiers with those of a simplified three-gene SCM in a large compendium of microarray datasets.
Thirty-six publicly available breast cancer datasets (n = 5715) were subjected to molecular subtyping using five published classifiers (three SSPs and two SCMs) and SCMGENE, the new three-gene (ER, HER2, and AURKA) SCM. We used the prediction strength statistic to estimate robustness of the classification models, defined as the capacity of a classifier to assign the same tumors to the same subtypes independently of the dataset used to fit it. We used Cohen κ and Cramer V coefficients to assess concordance between the subtype classifiers and association with clinical variables, respectively. We used Kaplan–Meier survival curves and cross-validated partial likelihood to compare prognostic value of the resulting classifications. All statistical tests were two-sided.
SCMs were statistically significantly more robust than SSPs, with SCMGENE being the most robust because of its simplicity. SCMGENE was statistically significantly concordant with published SCMs (κ = 0.65–0.70) and SSPs (κ = 0.34–0.59), statistically significantly associated with ER (V = 0.64), HER2 (V = 0.52) status, and histological grade (V = 0.55), and yielded similar strong prognostic value.
Our results suggest that adequate classification of the major and clinically relevant molecular subtypes of breast cancer can be robustly achieved with quantitative measurements of three key genes.
A major goal in translational cancer research is to identify biological signatures driving cancer progression and metastasis. A common technique applied in genomics research is to cluster patients using gene expression data from a candidate prognostic gene set, and if the resulting clusters show statistically significant outcome stratification, to associate the gene set with prognosis, suggesting its biological and clinical importance. Recent work has questioned the validity of this approach by showing in several breast cancer data sets that “random” gene sets tend to cluster patients into prognostically variable subgroups. This work suggests that new rigorous statistical methods are needed to identify biologically informative prognostic gene sets. To address this problem, we developed Significance Analysis of Prognostic Signatures (SAPS) which integrates standard prognostic tests with a new prognostic significance test based on stratifying patients into prognostic subtypes with random gene sets. SAPS ensures that a significant gene set is not only able to stratify patients into prognostically variable groups, but is also enriched for genes showing strong univariate associations with patient prognosis, and performs significantly better than random gene sets. We use SAPS to perform a large meta-analysis (the largest completed to date) of prognostic pathways in breast and ovarian cancer and their molecular subtypes. Our analyses show that only a small subset of the gene sets found statistically significant using standard measures achieve significance by SAPS. We identify new prognostic signatures in breast and ovarian cancer and their corresponding molecular subtypes, and we show that prognostic signatures in ER negative breast cancer are more similar to prognostic signatures in ovarian cancer than to prognostic signatures in ER positive breast cancer. SAPS is a powerful new method for deriving robust prognostic biological signatures from clinically annotated genomic datasets.
A major goal in biomedical research is to identify sets of genes (or “biological signatures”) associated with patient survival, as these genes could be targeted to aid in diagnosing and treating disease. A major challenge in using prognostic associations to identify biologically informative signatures is that in some diseases, “random” gene sets are associated with prognosis. To address this problem, we developed a new method called “Significance Analysis of Prognostic Signatures” (or “SAPS”) for the identification of biologically informative gene sets associated with patient survival. To test the effectiveness of SAPS, we use SAPS to perform a subtype-specific meta-analysis of prognostic signatures in large breast and ovarian cancer meta-data sets. This analysis represents the largest of its kind ever performed. Our analyses show that only a small subset of the gene sets found statistically significant using standard measures achieve significance by SAPS. We identify new prognostic signatures in breast and ovarian cancer and their corresponding molecular subtypes, and we demonstrate a striking similarity between prognostic pathways in ER negative breast cancer and ovarian cancer, suggesting new shared therapeutic targets for these aggressive malignancies. SAPS is a powerful new method for deriving robust prognostic biological pathways from clinically annotated genomic datasets.
Summary: The survcomp package provides functions to assess and statistically compare the performance of survival/risk prediction models. It implements state-of-the-art statistics to (i) measure the performance of risk prediction models; (ii) combine these statistical estimates from multiple datasets using a meta-analytical framework; and (iii) statistically compare the performance of competitive models.
Availability: The R/Bioconductor package survcomp is provided open source under the Artistic-2.0 License with a user manual containing installation, operating instructions and use case scenarios on real datasets. survcomp requires R version 2.13.0 or higher. http://bioconductor.org/packages/release/bioc/html/survcomp.html
Contact: firstname.lastname@example.org; email@example.com
Supplementary Information: Supplementary data are available at Bioinformatics online.
Motivation: The ability to detect copy-number variation (CNV) and loss of heterozygosity (LOH) from exome sequencing data extends the utility of this powerful approach that has mainly been used for point or small insertion/deletion detection.
Results: We present ExomeCNV, a statistical method to detect CNV and LOH using depth-of-coverage and B-allele frequencies, from mapped short sequence reads, and we assess both the method's power and the effects of confounding variables. We apply our method to a cancer exome resequencing dataset. As expected, accuracy and resolution are dependent on depth-of-coverage and capture probe design.
Availability: CRAN package ‘ExomeCNV’.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
MicroRNAs (miRNAs) are nucleic acid regulators of many human mRNAs, and are associated with many tumorigenic processes. miRNA expression levels have been used in profiling studies, but some evidence suggests that expression levels do not fully capture miRNA regulatory activity. In this study we integrate multiple gene expression datasets to determine miRNA activity patterns associated with cancer phenotypes and oncogenic pathways in mesenchymal tumors – a very heterogeneous class of malignancies.
Using a computational method, we identified differentially activated miRNAs between 77 normal tissue specimens and 135 sarcomas and we validated many of these findings with microarray interrogation of an independent, paraffin-based cohort of 18 tumors. We also showed that miRNA activity is imperfectly correlated with miRNA expression levels. Using next-generation miRNA sequencing we identified potential base sequence alterations which may explain differential activity. We then analyzed miRNA activity changes related to the RAS-pathway and found 21 miRNAs that switch from silenced to activated status in parallel with RAS activation. Importantly, nearly half of these 21 miRNAs were predicted to regulate integral parts of the miRNA processing machinery, and our gene expression analysis revealed significant reductions of these transcripts in RAS-active tumors. These results suggest an association between RAS signaling and miRNA processing in which miRNAs may attenuate their own biogenesis.
Our study represents the first gene expression-based investigation of miRNA regulatory activity in human sarcomas, and our findings indicate that miRNA activity patterns derived from integrated transcriptomic data are reproducible and biologically informative in cancer. We identified an association between RAS signaling and miRNA processing, and demonstrated sequence alterations as plausible causes for differential miRNA activity. Finally, our study highlights the value of systems level integrative miRNA/mRNA assessment with high-throughput genomic data, and the applicability of paraffin-tissue-derived RNA for validation of novel findings.
MicroRNA; Microarray; RAS; Mesenchymal tumors; MicroRNA biogenesis
Motivation: Meta-analysis of genomics data seeks to identify genes associated with a biological phenotype across multiple datasets; however, merging data from different platforms by their features (genes) is challenging. Meta-analysis using functionally or biologically characterized gene sets simplifies data integration is biologically intuitive and is seen as having great potential, but is an emerging field with few established statistical methods.
Results: We transform gene expression profiles into binary gene set profiles by discretizing results of gene set enrichment analyses and apply a new iterative bi-clustering algorithm (iBBiG) to identify groups of gene sets that are coordinately associated with groups of phenotypes across multiple studies. iBBiG is optimized for meta-analysis of large numbers of diverse genomics data that may have unmatched samples. It does not require prior knowledge of the number or size of clusters. When applied to simulated data, it outperforms commonly used clustering methods, discovers overlapping clusters of diverse sizes and is robust in the presence of noise. We apply it to meta-analysis of breast cancer studies, where iBBiG extracted novel gene set—phenotype association that predicted tumor metastases within tumor subtypes.
Availability: Implemented in the Bioconductor package iBBiG
Many human diseases, arising from mutations of disease susceptibility genes (genetic diseases), are also associated with viral infections (virally implicated diseases), either in a directly causal manner or by indirect associations. Here we examine whether viral perturbations of host interactome may underlie such virally implicated disease relationships. Using as models two different human viruses, Epstein-Barr virus (EBV) and human papillomavirus (HPV), we find that host targets of viral proteins reside in network proximity to products of disease susceptibility genes. Expression changes in virally implicated disease tissues and comorbidity patterns cluster significantly in the network vicinity of viral targets. The topological proximity found between cellular targets of viral proteins and disease genes was exploited to uncover a novel pathway linking HPV to Fanconi anemia.
Many “virally implicated human diseases” - diseases for which there is scientific consensus of viral involvement - are associated with genetic alterations in particular disease susceptibility genes. We proposed and demonstrated that for two human viruses, Epstein-Barr virus and human papillomavirus, topological proximity should exist between host targets of viruses and genes associated with virally implicated diseases on host interactome networks (local impact hypothesis). For representative EBV- and HPV16- implicated diseases, genes in the neighborhood of viral targets in the host interactome have significantly shifted expression levels in virally implicated disease tissues, in line with the local impact hypothesis. The viral neighborhoods in the host interactome, along with their disease associations, defined as “viral disease networks”, contain connections known to be informative upon disease mechanisms as well as diseases whose associations with viruses are not yet known. We prioritized these diseases for their candidacy as potential virally implicated diseases based on network topology, and benchmarked this prioritization of candidate diseases using relative risk measurement which depicts population-based clinical associations between candidate diseases and viral infection. Exogenous expression of HPV viral proteins in a human cell line offered evidence for a novel disease pathway that links HPV to Fanconi anemia.
Ovarian cancer is the fifth leading cause of cancer death for women in the U.S. and the seventh most fatal worldwide. Although ovarian cancer is notable for its initial sensitivity to platinum-based therapies, the vast majority of patients eventually develop recurrent cancer and succumb to increasingly platinum-resistant disease. Modern, targeted cancer drugs intervene in cell signaling, and identifying key disease mechanisms and pathways would greatly advance our treatment abilities. In order to shed light on the molecular diversity of ovarian cancer, we performed comprehensive transcriptional profiling on 129 advanced stage, high grade serous ovarian cancers. We implemented a, re-sampling based version of the ISIS class discovery algorithm (rISIS: robust ISIS) and applied it to the entire set of ovarian cancer transcriptional profiles. rISIS identified a previously undescribed patient stratification, further supported by micro-RNA expression profiles, and gene set enrichment analysis found strong biological support for the stratification by extracellular matrix, cell adhesion, and angiogenesis genes. The corresponding “angiogenesis signature” was validated in ten published independent ovarian cancer gene expression datasets and is significantly associated with overall survival. The subtypes we have defined are of potential translational interest as they may be relevant for identifying patients who may benefit from the addition of anti-angiogenic therapies that are now being tested in clinical trials.
Epstein-Barr virus (EBV) latent membrane protein 1 (LMP1) transforms rodent fibroblasts and is expressed in most EBV-associated malignancies. LMP1 (transformation effector site 2 [TES2]/C-terminal activation region 2 [CTAR2]) activates NF-κB, p38, Jun N-terminal protein kinase (JNK), extracellular signal-regulated kinase (ERK), and interferon regulatory factor 7 (IRF7) pathways. We have investigated LMP1 TES2 genome-wide RNA effects at 4 time points after LMP1 TES2 expression in HEK-293 cells. By using a false discovery rate (FDR) of <0.001 after correction for multiple hypotheses, LMP1 TES2 caused >2-fold changes in 1,916 mRNAs; 1,479 RNAs were upregulated and 437 were downregulated. In contrast to tumor necrosis factor alpha (TNF-α) stimulation, which transiently upregulates many target genes, LMP1 TES2 maintained most RNA effects through the time course, despite robust and sustained induction of negative feedback regulators, such as IκBα and A20. LMP1 TES2-regulated RNAs encode many NF-κB signaling proteins and secondary interacting proteins. Consequently, many LMP1 TES2-regulated RNAs encode proteins that form an extensive interactome. Gene set enrichment analyses found LMP1 TES2-upregulated genes to be significantly enriched for pathways in cancer, B- and T-cell receptor signaling, and Toll-like receptor signaling. Surprisingly, LMP1 TES2 and IκBα superrepressor coexpression decreased LMP1 TES2 RNA effects to only 5 RNAs, with FDRs of <0.001-fold and >2-fold changes. Thus, canonical NF-κB activation is critical for almost all LMP1 TES2 RNA effects in HEK-293 cells and a more significant therapeutic target than previously appreciated.
The purpose of the online resource presented here, POPcorn (Project Portal for corn), is to enhance accessibility of maize genetic and genomic resources for plant biologists. Currently, many online locations are difficult to find, some are best searched independently, and individual project websites often degrade over time—sometimes disappearing entirely. The POPcorn site makes available (1) a centralized, web-accessible resource to search and browse descriptions of ongoing maize genomics projects, (2) a single, stand-alone tool that uses web Services and minimal data warehousing to search for sequence matches in online resources of diverse offsite projects, and (3) a set of tools that enables researchers to migrate their data to the long-term model organism database for maize genetic and genomic information: MaizeGDB. Examples demonstrating POPcorn's utility are provided herein.
Cyclin D1 is a component of the core cell cycle machinery1. Abnormally high levels of cyclin D1 are detected in many human cancer types2. To elucidate the molecular functions of cyclin D1 in human cancers, here we performed a proteomic screen for cyclin D1 protein partners in several types of human tumors. Analyses of cyclin D1-interactors revealed a network of DNA repair proteins, including RAD51, a recombinase that drives the homologous recombination process3. We found that cyclin D1 directly binds RAD51, and that cyclin D1-RAD51 interaction is induced by radiation. Like RAD51, cyclin D1 is recruited to DNA damage sites in a BRCA2-dependent fashion. Reduction of cyclin D1 levels in human cancer cells impaired recruitment of RAD51 to damaged DNA, impeded the homologous recombination-mediated DNA repair, and increased sensitivity of cells to radiation in vitro and in vivo. This effect was seen in cancer cells lacking the retinoblastoma protein, which do not require D-cyclins for proliferation4, 5. These findings reveal an unexpected function of a core cell cycle protein in DNA repair and suggest that targeting cyclin D1 may be beneficial also in retinoblastoma-negative cancers which are currently thought to be oblivious to cyclin D1 inhibition.
Traditional strategies for selecting variables in high dimensional classification problems aim to find sets of maximally relevant variables able to explain the target variations. If these techniques may be effective in generalization accuracy they often do not reveal direct causes. The latter is essentially related to the fact that high correlation (or relevance) does not imply causation. In this study, we show how to efficiently incorporate causal information into gene selection by moving from a single-input single-output to a multiple-input multiple-output setting.
We show in synthetic case study that a better prioritization of causal variables can be obtained by considering a relevance score which incorporates a causal term. In addition we show, in a meta-analysis study of six publicly available breast cancer microarray datasets, that the improvement occurs also in terms of accuracy. The biological interpretation of the results confirms the potential of a causal approach to gene selection.
Integrating causal information into gene selection algorithms is effective both in terms of prediction accuracy and biological interpretation.
GeneSigDB (http://www.genesigdb.org or http://compbio.dfci.harvard.edu/genesigdb/) is a database of gene signatures that have been extracted and manually curated from the published literature. It provides a standardized resource of published prognostic, diagnostic and other gene signatures of cancer and related disease to the community so they can compare the predictive power of gene signatures or use these in gene set enrichment analysis. Since GeneSigDB release 1.0, we have expanded from 575 to 3515 gene signatures, which were collected and transcribed from 1604 published articles largely focused on gene expression in cancer, stem cells, immune cells, development and lung disease. We have made substantial upgrades to the GeneSigDB website to improve accessibility and usability, including adding a tag cloud browse function, facetted navigation and a ‘basket’ feature to store genes or gene signatures of interest. Users can analyze GeneSigDB gene signatures, or upload their own gene list, to identify gene signatures with significant gene overlap and results can be viewed on a dynamic editable heatmap that can be downloaded as a publication quality image. All data in GeneSigDB can be downloaded in numerous formats including .gmt file format for gene set enrichment analysis or as a R/Bioconductor data file. GeneSigDB is available from http://www.genesigdb.org.
Genomics provided us with an unprecedented quantity of data on the genes that are activated or repressed in a wide range of phenotypes. We have increasingly come to recognize that defining the networks and pathways underlying these phenotypes requires both the integration of multiple data types and the development of advanced computational methods to infer relationships between the genes and to estimate the predictive power of the networks through which they interact. To address these issues we have developed Predictive Networks (PN), a flexible, open-source, web-based application and data services framework that enables the integration, navigation, visualization and analysis of gene interaction networks. The primary goal of PN is to allow biomedical researchers to evaluate experimentally derived gene lists in the context of large-scale gene interaction networks. The PN analytical pipeline involves two key steps. The first is the collection of a comprehensive set of known gene interactions derived from a variety of publicly available sources. The second is to use these ‘known’ interactions together with gene expression data to infer robust gene networks. The PN web application is accessible from http://predictivenetworks.org. The PN code base is freely available at https://sourceforge.net/projects/predictivenets/.
attract is a knowledge-driven analytical approach for identifying and annotating the gene-sets that best discriminate between cell phenotypes. attract finds distinguishing patterns within pathways, decomposes pathways into meta-genes representative of these patterns, and then generates synexpression groups of highly correlated genes from the entire transcriptome dataset. attract can be applied to a wide range of biological systems and is freely available as a Bioconductor package and has been incorporated into the MeV software system.
Summary: RNA-Seq is an exciting methodology that leverages the power of high-throughput sequencing to measure RNA transcript counts at an unprecedented accuracy. However, the data generated from this process are extremely large and biologist-friendly tools with which to analyze it are sorely lacking. MultiExperiment Viewer (MeV) is a Java-based desktop application that allows advanced analysis of gene expression data through an intuitive graphical user interface. Here, we report a significant enhancement to MeV that allows analysis of RNA-Seq data with these familiar, powerful tools. We also report the addition to MeV of several RNA-Seq-specific functions, addressing the differences in analysis requirements between this data type and traditional gene expression data. These tools include automatic conversion functions from raw count data to processed RPKM or FPKM values and differential expression detection and functional annotation enrichment detection based on published methods.
Availability: MeV version 4.7 is written in Java and is freely available for download under the terms of the open-source Artistic License version 2.0. The website (http://mev.tm4.org/) hosts a full user manual as well as a short quick-start guide suitable for new users.
Gene expression analysis has become a ubiquitous tool for studying a wide range of human diseases. In a typical analysis we compare distinct phenotypic groups and attempt to identify genes that are, on average, significantly different between them. Here we describe an innovative approach to the analysis of gene expression data, one that identifies differences in expression variance between groups as an informative metric of the group phenotype. We find that genes with different expression variance profiles are not randomly distributed across cell signaling networks. Genes with low-expression variance, or higher constraint, are significantly more connected to other network members and tend to function as core members of signal transduction pathways. Genes with higher expression variance have fewer network connections and also tend to sit on the periphery of the cell. Using neural stem cells derived from patients suffering from Schizophrenia (SZ), Parkinson's disease (PD), and a healthy control group, we find marked differences in expression variance in cell signaling pathways that shed new light on potential mechanisms associated with these diverse neurological disorders. In particular, we find that expression variance of core networks in the SZ patient group was considerably constrained, while in contrast the PD patient group demonstrated much greater variance than expected. One hypothesis is that diminished variance in SZ patients corresponds to an increased degree of constraint in these pathways and a corresponding reduction in robustness of the stem cell networks. These results underscore the role that variation plays in biological systems and suggest that analysis of expression variance is far more important in disease than previously recognized. Furthermore, modeling patterns of variability in gene expression could fundamentally alter the way in which we think about how cellular networks are affected by disease processes.
Genes are a repository of information that provides the framework for cellular processes, with the flow of information from gene (DNA) to phenotype via an intermediate molecule—the messenger RNA. We understand that sequence variations in a gene may lead to phenotypic variations, but less well understood is how variation in the information flow itself might also impact on phenotype. In this study we demonstrated that disease phenotypes were correlated with expression variance. A change in expression variance might infer that the genetic networks representing information flow were less robust—surprisingly, we found that too little and too much variance were equally detrimental in the context of neurological disease.
Gene expression patterns characterizing clinically-relevant molecular subgroups of glioblastoma are difficult to reproduce. We suspect a combination of biological and analytic factors confounds interpretation of glioblastoma expression data. We seek to clarify the nature and relative contributions of these factors, to focus additional investigations, and to improve the accuracy and consistency of translational glioblastoma analyses.
We analyzed gene expression and clinical data for 340 glioblastomas in The Cancer Genome Atlas (TCGA). We developed a logic model to analyze potential sources of biological, technical, and analytic variability and used standard linear classifiers and linear dimensional reduction algorithms to investigate the nature and relative contributions of each factor.
Commonly-described sources of classification error, including individual sample characteristics, batch effects, and analytic and technical noise make measurable but proportionally minor contributions to inconsistent molecular classification. Our analysis suggests that three, previously underappreciated factors may account for a larger fraction of classification errors: inherent non-linear/non-orthogonal relationships among the genes used in conjunction with classification algorithms that assume linearity; skewed data distributions assumed to be Gaussian; and biologic variability (noise) among tumors, of which we propose three types.
Our analysis of the TCGA data demonstrates a contributory role for technical factors in molecular classification inconsistencies in glioblastoma but also suggests that biological variability, abnormal data distribution, and non-linear relationships among genes may be responsible for a proportionally larger component of classification error. These findings may have important implications for both glioblastoma research and for translational application of other large-volume biological databases.
The majority of breast cancer deaths result from metastases rather than from direct effects of the primary tumor itself. Recently, Landemaine and colleagues described a six-gene signature purported to predict lung metastasis risk. They analyzed gene expression in 23 metastases from breast cancer patients (5 lung, 18 non-lung) identifying a 21-gene signature. Expression of 16 of these was analyzed in primary breast tumors from 72 patients with known outcome, and six were selected that were predictive of lung metastases: DSC2, TFCP2L1, UGT8, ITGB8, ANP32E, and FERMT1. Despite the value of such a signature, our analysis indicates that this analysis ignored potentially important confounding factors and that their signature is instead a surrogate for molecular subtype.
Carcinogenesis is a complex process with multiple genetic and environmental factors contributing to the development of one or more tumors. Understanding the underlying mechanism of this process and identifying related markers to assess the outcome of this process would lead to more directed treatment and thus significantly reduce the mortality rate of cancers. Recently, molecular diagnostics and prognostics based on the identification of patterns within gene expression profiles in the context of protein interaction networks were reported. However, the predictive performances of these approaches were limited. In this study we propose a novel integrated approach, named CAERUS, for the identification of gene signatures to predict cancer outcomes based on the domain interaction network in human proteome. We first developed a model to score each protein by quantifying the domain connections to its interacting partners and the somatic mutations present in the domain. We then defined proteins as gene signatures if their scores were above a preset threshold. Next, for each gene signature, we quantified the correlation of the expression levels between this gene signature and its neighboring proteins. The results of the quantification in each patient were then used to predict cancer outcome by a modified naïve Bayes classifier. In this study we achieved a favorable accuracy of 88.3%, sensitivity of 87.2%, and specificity of 88.9% on a set of well-documented gene expression profiles of 253 consecutive breast cancer patients with different outcomes. We also compiled a list of cancer-associated gene signatures and domains, which provided testable hypotheses for further experimental investigation. Our approach proved successful on different independent breast cancer data sets as well as an ovarian cancer data set. This study constitutes the first predictive method to classify cancer outcomes based on the relationship between the domain organization and protein network.
It is widely known that cancer is a complex process in which a large number of genes appear to be involved. Through experimental approaches, some oncogenes and tumor suppressors have been identified as playing important roles in the signaling and the regulatory pathways. However, we have not fully understood the complete mechanism of how cancer develops and how it leads to different disease outcomes (aggressive/dangerous or non-aggressive/less-dangerous). In order to identify a list of gene signatures and better predict cancer outcome, we developed an integrated and systematical approach by investigating gene expression profiling alternation caused by disruptions between protein-protein interactions and domain-domain interactions in the human interactome. Our approach achieves the favorable predictive performance if tested on a set of well-documented breast cancer patients, which suggests that the disrupted interactome is important to determine patient prognosis. Our approach is robust if tested on other independent data sets. This work provides a promising prognostic tool to classify different cancer outcomes.
Public data integration may help overcome challenges in clinical implementation of microarray profiles. We integrated several ovarian cancer datasets to identify a reproducible predictor of survival.
Four microarray datasets from different institutions comprising 265 advanced stage tumors were uniformly reprocessed into a single training dataset, also adjusting for inter-laboratory variation (“batch-effect”). Supervised principal component survival analysis was employed to identify prognostic models. Models were independently validated in a 61-patient cohort using a custom array genechip and a publicly available 229-array dataset. Molecular correspondence of high- and low-risk outcome groups between training and validation datasets was demonstrated using Subclass Mapping. Previously established molecular phenotypes in the 2nd validation set were correlated with high and low-risk outcome groups. Functional representational and pathway analysis was used to explore gene networks associated with high and low risk phenotypes. A 19-gene model showed optimal performance in the training set (median OS 31 and 78 months, p<0.01), 1st validation set (median OS 32 months versus not-yet-reached, p = 0.026) and 2nd validation set (median OS 43 versus 61 months, p = 0.013) maintaining independent prognostic power in multivariate analysis. There was strong molecular correspondence of the respective high- and low-risk tumors between training and 1st validation set. Low and high-risk tumors were enriched for favorable and unfavorable molecular subtypes and pathways, previously defined in the public 2nd validation set.
Integration of previously generated cancer microarray datasets may lead to robust and widely applicable survival predictors. These predictors are not simply a compilation of prognostic genes but appear to track true molecular phenotypes of good- and poor-outcome.