Integrating gene expression data with secondary data such as pathway or protein-protein interaction data has been proposed as a promising approach for improved outcome prediction of cancer patients. Methods employing this approach usually aggregate the expression of genes into new composite features, while the secondary data guide this aggregation. Previous studies were limited to few data sets with a small number of patients. Moreover, each study used different data and evaluation procedures. This makes it difficult to objectively assess the gain in classification performance. Here we introduce the Amsterdam Classification Evaluation Suite (ACES). ACES is a Python package to objectively evaluate classification and feature-selection methods and contains methods for pooling and normalizing Affymetrix microarrays from different studies. It is simple to use and therefore facilitates the comparison of new approaches to best-in-class approaches. In addition to the methods described in our earlier study (Staiger et al., 2012), we have included two prominent prognostic gene signatures specific for breast cancer outcome, one more composite feature selection method and two network-based gene ranking methods. Employing the evaluation pipeline we show that current composite-feature classification methods do not outperform simple single-genes classifiers in predicting outcome in breast cancer. Furthermore, we find that also the stability of features across different data sets is not higher for composite features. Most stunningly, we observe that prediction performances are not affected when extracting features from randomized PPI networks.
outcome prediction; breast cancer; classification; feature selection; networks; evaluation
Deposition of crystallographic structures should be concurrent with or prior to manuscript submission for peer review, enabling validation and increasing reliability of the PDB.
Most of the macromolecular structures in the Protein Data Bank (PDB), which are used daily by thousands of educators and scientists alike, are determined by X-ray crystallography. It was examined whether the crystallographic models and data were deposited to the PDB at the same time as the publications that describe them were submitted for peer review. This condition is necessary to ensure pre-publication validation and the quality of the PDB public archive. It was found that a significant proportion of PDB entries were submitted to the PDB after peer review of the corresponding publication started, and many were only submitted after peer review had ended. It is argued that clear description of journal policies and effective policing is important for pre-publication validation, which is key in ensuring the quality of the PDB and of peer-reviewed literature.
Protein Data Bank; deposition; validation
The characterization of post-transcriptional gene regulation by small regulatory RNAs of 20–30 nt length, particularly miRNAs and piRNAs, has become a major focus of research in recent years. A prerequisite for the characterization of small RNAs is their identification and quantification across different developmental stages, normal and diseased tissues, as well as model cell lines. Here we present a step-by-step protocol for the bioinformatic analysis of barcoded cDNA libraries for small RNA profiling generated by Illumina sequencing, thereby facilitating miRNA and other small RNA profiling of large sample collections.
Bioinformatic analysis; Small RNA; miRNA; Barcoding; Next-generation sequencing; Nucleotide variation
Traditional methods that aim to identify biomarkers that distinguish between two groups, like Significance Analysis of Microarrays or the t-test, perform optimally when such biomarkers show homogeneous behavior within each group and differential behavior between the groups. However, in many applications, this is not the case. Instead, a subgroup of samples in one group shows differential behavior with respect to all other samples. To successfully detect markers showing such imbalanced patterns of differential signal, a different approach is required. We propose a novel method, specifically designed for the Detection of Imbalanced Differential Signal (DIDS). We use an artificial dataset and a human breast cancer dataset to measure its performance and compare it with three traditional methods and four approaches that take imbalanced signal into account. Supported by extensive experimental results, we show that DIDS outperforms all other approaches in terms of power and positive predictive value. In a mouse breast cancer dataset, DIDS is the only approach that detects a functionally validated marker of chemotherapy resistance. DIDS can be applied to any continuous value data, including gene expression data, and in any context where imbalanced differential signal is manifested.
The evolution of colorectal cancer suggests the involvement of many genes. We performed insertional mutagenesis with the Sleeping Beauty (SB) transposon system in mice carrying germline or somatic Apc mutation. Analysis of common insertion sites (CISs) isolated from 446 tumors revealed many hundreds of candidate cancer drivers. Comparison to human datasets suggested that 234 CIS genes are also deregulated in human colorectal cancers. 183 CIS genes are candidate Wnt targets, and 20 are shown to be novel modifiers of canonical Wnt signaling. We also identified gene mutations associated with a subset of tumors containing an expanded number of Paneth cells, a hallmark of deregulated Wnt signaling, and genes associated with more severe dysplasia included members of the FGF signaling cascade. Some 70 genes showed pairwise co-occurrence clustering into 38 sub-networks that may regulate tumor development.
Insertional mutagenesis is a potent forward genetic screening technique used to identify candidate cancer genes in mouse model systems. An important, yet unresolved issue in the analysis of these screens, is the identification of the genes affected by the insertions. To address this, we developed Kernel Convolved Rule Based Mapping (KC-RBM). KC-RBM exploits distance, orientation and insertion density across tumors to automatically map integration sites to target genes. We perform the first genome-wide evaluation of the association of insertion occurrences with aberrant gene expression of the predicted targets in both retroviral and transposon data sets. We demonstrate the efficiency of KC-RBM by showing its superior performance over existing approaches in recovering true positives from a list of independently, manually curated cancer genes. The results of this work will significantly enhance the accuracy and speed of cancer gene discovery in forward genetic screens. KC-RBM is available as R-package.
Motivation: Permutation tests have become a standard tool to assess the statistical significance of an event under investigation. The statistical significance, as expressed in a P-value, is calculated as the fraction of permutation values that are at least as extreme as the original statistic, which was derived from non-permuted data. This empirical method directly couples both the minimal obtainable P-value and the resolution of the P-value to the number of permutations. Thereby, it imposes upon itself the need for a very large number of permutations when small P-values are to be accurately estimated. This is computationally expensive and often infeasible.
Results: A method of computing P-values based on tail approximation is presented. The tail of the distribution of permutation values is approximated by a generalized Pareto distribution. A good fit and thus accurate P-value estimates can be obtained with a drastically reduced number of permutations when compared with the standard empirical way of computing P-values.
Availability: The Matlab code can be obtained from the corresponding author on request.
Supplementary information:Supplementary data are available at Bioinformatics online.
Colorectal cancer (CRC) is the second most common cause of cancer-related death in Europe and its prognosis is largely dependent on stage at diagnosis. Currently, there are no suitable tumour markers for early detection of CRC. In a retrospective study we previously found discriminative CRC serum protein profiles with surface enhanced laser desorption ionisation—time of flight mass spectrometry (SELDI-TOF MS). We now aimed at prospective validation of these profiles. Additionally, we assessed their applicability for follow-up after surgery and investigated tissue protein profiles of patients with CRC and adenomatous polyps (AP). Serum and tissue samples were collected from patients without known malignancy with an indication for colonoscopy and patients with AP and CRC during colonoscopy. Serum samples of controls (CON; n = 359), patients with AP (n = 177) and CRC (n = 73), as well as tissue samples from AP (n = 52) and CRC (n = 47) were analysed as described previously. Peak intensities were compared by non-parametric testing. Discriminative power of differentially expressed proteins was assessed with support vector machines (SVM). We confirmed the decreased serum levels of apolipoprotein C-1 in CRC in the current population. No differences were observed between CON and AP. Apolipoprotein C-I levels did not change significantly within 1 month post-surgery, although a gradual return to normal levels was observed. Several proteins differed between AP and CRC tissue, among which a peak with similar mass as apolipoprotein C-1. This peak was increased in CRC compared to AP. Although we prospectively validated the serum decrease of apolipoprotein C-1 in CRC, serum protein profiles did not yield SVM classifiers with suitable sensitivity and specificity for classification of our patient groups.
biomarkers; colorectal cancer; SELDI-TOF MS; validation
The availability of large collections of microarray datasets (compendia), or knowledge about grouping of genes into pathways (gene sets), is typically not exploited when training predictors of disease outcome. These can be useful since a compendium increases the number of samples, while gene sets reduce the size of the feature space. This should be favorable from a machine learning perspective and result in more robust predictors.
We extracted modules of regulated genes from gene sets, and compendia. Through supervised analysis, we constructed predictors which employ modules predictive of breast cancer outcome. To validate these predictors we applied them to independent data, from the same institution (intra-dataset), and other institutions (inter-dataset).
We show that modules derived from single breast cancer datasets achieve better performance on the validation data compared to gene-based predictors. We also show that there is a trend in compendium specificity and predictive performance: modules derived from a single breast cancer dataset, and a breast cancer specific compendium perform better compared to those derived from a human cancer compendium. Additionally, the module-based predictor provides a much richer insight into the underlying biology. Frequently selected gene sets are associated with processes such as cell cycle, E2F regulation, DNA damage response, proteasome and glycolysis. We analyzed two modules related to cell cycle, and the OCT1 transcription factor, respectively. On an individual basis, these modules provide a significant separation in survival subgroups on the training and independent validation data.
Despite continuous efforts, not a single predictor of breast cancer chemotherapy resistance has made it into the clinic yet. However, it has become clear in recent years that breast cancer is a collection of molecularly distinct diseases. With ever increasing amounts of breast cancer data becoming available, we set out to study if gene expression based predictors of chemotherapy resistance that are specific for breast cancer subtypes can improve upon the performance of generic predictors.
We trained predictors of resistance that were specific for a subtype and generic predictors that were not specific for a particular subtype, i.e. trained on all subtypes simultaneously. Through a rigorous double-loop cross-validation we compared the performance of these two types of predictors on the different subtypes on a large set of tumors all profiled on the same expression platform (n = 394). We evaluated predictors based on either mRNA gene expression or clinical features.
For HER2+, ER− breast cancer, subtype specific predictor based on clinical features outperformed the generic, non-specific predictor. This can be explained by the fact that the generic predictor included HER2 and ER status, features that are predictive over the whole set, but not within this subtype. In all other scenarios the generic predictors outperformed the subtype specific predictors or showed equal performance.
Since it depends on the specific context which type of predictor – subtype specific or generic- performed better, it is highly recommended to evaluate both specific and generic predictors when attempting to predict treatment response in breast cancer.
Inhibitors of the ALK and EGF receptor tyrosine kinases provoke dramatic but short-lived responses in lung cancers harboring EML4-ALK translocations or activating mutations of EGFR, respectively. We used a large-scale RNAi screen to identify MED12, a component of the transcriptional MEDIATOR complex that is mutated in cancers, as a determinant of response to ALK and EGFR inhibitors. MED12 is in part cytoplasmic where it negatively regulates TGF-βR2 through physical interaction. MED12 suppression therefore results in activation of TGF-βR signaling, which is both necessary and sufficient for drug resistance. TGF-β signaling causes MEK/ERK activation, and consequently MED12 suppression also confers resistance to MEK and BRAF inhibitors in other cancers. MED12 loss induces an EMT-like phenotype, which is associated with chemotherapy resistance in colon cancer patients and to gefitinib in lung cancer. Inhibition of TGF-βR signaling restores drug responsiveness in MED12KD cells, suggesting a strategy to treat drug-resistant tumors that have lost MED12.
Cancer develops through a multistep process in which normal cells progress to malignant tumors via the evolution of their genomes as a result of the acquisition of mutations in cancer driver genes. The number, identity and mode of action of cancer driver genes, and how they contribute to tumor evolution is largely unknown. This study deployed the Mouse Mammary Tumor Virus (MMTV) as an insertional mutagen to find both the driver genes and the networks in which they function. Using deep insertion site sequencing we identified around 31000 retroviral integration sites in 604 MMTV-induced mammary tumors from mice with mammary gland-specific deletion of Trp53, Pten heterozygous knockout mice, or wildtype strains. We identified 18 known common integration sites (CISs) and 12 previously unknown CISs marking new candidate cancer genes. Members of the Wnt, Fgf, Fgfr, Rspo and Pdgfr gene families were commonly mutated in a mutually exclusive fashion. The sequence data we generated yielded also information on the clonality of insertions in individual tumors, allowing us to develop a data-driven model of MMTV-induced tumor development. Insertional mutations near Wnt and Fgf genes mark the earliest “initiating” events in MMTV induced tumorigenesis, whereas Fgfr genes are targeted later during tumor progression. Our data shows that insertional mutagenesis can be used to discover the mutational networks, the timing of mutations, and the genes that initiate and drive tumor evolution.
Pancreatic ductal adenocarcinoma (PDA) remains a lethal malignancy despite tremendous progress in its molecular characterization. Indeed, PDA tumors harbor four signature somatic mutations1–4, and a plethora of lower frequency genetic events of uncertain significance5. Here, we used Sleeping Beauty (SB) transposon-mediated insertional mutagenesis6,7 in a mouse model of pancreatic ductal preneoplasia8 to identify genes that cooperate with oncogenic KrasG12D to accelerate tumorigenesis and promote progression. Our screen revealed new candidates and confirmed the importance of many genes and pathways previously implicated in human PDA. Interestingly, the most commonly mutated gene was the X-linked deubiquitinase Usp9x, which was inactivated in over 50% of the tumors. Although prior work had attributed a pro-survival role to USP9X in human neoplasia9, we found instead that loss of Usp9x enhances transformation and protects pancreatic cancer cells from anoikis. Clinically, low USP9X protein and mRNA expression in PDA correlates with poor survival following surgery, and USP9X levels are inversely associated with metastatic burden in advanced disease. Furthermore, chromatin modulation with trichostatin A or 5-aza-2′-deoxycytidine elevates USP9X expression in human PDA cell lines to suggest a clinical approach for certain patients. The conditional deletion of Usp9x cooperated with KrasG12D to rapidly accelerate pancreatic tumorigenesis in mice, validating their genetic interaction. Therefore, we propose USP9X as a major new tumor suppressor gene with prognostic and therapeutic relevance in PDA.
Breast cancer outcome can be predicted using models derived from gene expression data or clinical data. Only a few studies have created a single prediction model using both gene expression and clinical data. These studies often remain inconclusive regarding an obtained improvement in prediction performance. We rigorously compare three different integration strategies (early, intermediate, and late integration) as well as classifiers employing no integration (only one data type) using five classifiers of varying complexity. We perform our analysis on a set of 295 breast cancer samples, for which gene expression data and an extensive set of clinical parameters are available as well as four breast cancer datasets containing 521 samples that we used as independent validation.mOn the 295 samples, a nearest mean classifier employing a logical OR operation (late integration) on clinical and expression classifiers significantly outperforms all other classifiers. Moreover, regardless of the integration strategy, the nearest mean classifier achieves the best performance. All five classifiers achieve their best performance when integrating clinical and expression data. Repeating the experiments using the 521 samples from the four independent validation datasets also indicated a significant performance improvement when integrating clinical and gene expression data. Whether integration also improves performances on other datasets (e.g. other tumor types) has not been investigated, but seems worthwhile pursuing. Our work suggests that future models for predicting breast cancer outcome should exploit both data types by employing a late OR or intermediate integration strategy based on nearest mean classifiers.
Recently, several classifiers that combine primary tumor data, like gene expression data, and secondary data sources, such as protein-protein interaction networks, have been proposed for predicting outcome in breast cancer. In these approaches, new composite features are typically constructed by aggregating the expression levels of several genes. The secondary data sources are employed to guide this aggregation. Although many studies claim that these approaches improve classification performance over single genes classifiers, the gain in performance is difficult to assess. This stems mainly from the fact that different breast cancer data sets and validation procedures are employed to assess the performance. Here we address these issues by employing a large cohort of six breast cancer data sets as benchmark set and by performing an unbiased evaluation of the classification accuracies of the different approaches. Contrary to previous claims, we find that composite feature classifiers do not outperform simple single genes classifiers. We investigate the effect of (1) the number of selected features; (2) the specific gene set from which features are selected; (3) the size of the training set and (4) the heterogeneity of the data set on the performance of composite feature and single genes classifiers. Strikingly, we find that randomization of secondary data sources, which destroys all biological information in these sources, does not result in a deterioration in performance of composite feature classifiers. Finally, we show that when a proper correction for gene set size is performed, the stability of single genes sets is similar to the stability of composite feature sets. Based on these results there is currently no reason to prefer prognostic classifiers based on composite features over single genes classifiers for predicting outcome in breast cancer.
Background and Methods
Formalin Fixed Paraffin Embedded (FFPE) samples represent a valuable resource for cancer research. However, the discovery and development of new cancer biomarkers often requires fresh frozen (FF) samples. Recently, the Whole Genome (WG) DASL (cDNA-mediated Annealing, Selection, extension and Ligation) assay was specifically developed to profile FFPE tissue. However, a thorough comparison of data generated from FFPE RNA and Fresh Frozen (FF) RNA using this platform is lacking. To this end we profiled, in duplicate, 20 FFPE tissues and 20 matched FF tissues and evaluated the concordance of the DASL results from FFPE and matched FF material.
Methodology and Principal Findings
We show that after proper normalization, all FFPE and FF pairs exhibit a high level of similarity (Pearson correlation >0.7), significantly larger than the similarity between non-paired samples. Interestingly, the probes showing the highest correlation had a higher percentage G/C content and were enriched for cell cycle genes. Predictions of gene expression signatures developed on frozen material (Intrinsic subtype, Genomic Grade Index, 70 gene signature) showed a high level of concordance between FFPE and FF matched pairs. Interestingly, predictions based on a 60 gene DASL list (best match with the 70 gene signature) showed very high concordance with the MammaPrint® results.
Conclusions and Significance
We demonstrate that data generated from FFPE material with the DASL assay, if properly processed, are comparable to data extracted from the FF counterpart. Specifically, gene expression profiles for a known set of prognostic genes for a specific disease are highly comparable between two conditions. This opens up the possibility of using both FFPE and FF material in gene expressions analyses, leading to a vast increase in the potential resources available for cancer research.
Accurate staging of colorectal cancer (CRC) with clinicopathological parameters is important for predicting prognosis and guiding treatment but provides no information about organ site of metastases. Patterns of genomic aberrations in primary colorectal tumors may reveal a chromosomal signature for organ specific metastases.
Array Comparative Genomic Hybridization (aCGH) was employed to asses DNA copy number changes in primary colorectal tumors of three distinctive patient groups. This included formalin-fixed, paraffin-embedded tissue of patients who developed liver metastases (LM; n = 36), metastases (PM; n = 37) and a group that remained metastases-free (M0; n = 25).
A novel statistical method for identifying recurrent copy number changes, KC-SMART, was used to find specific locations of genomic aberrations specific for various groups. We created a classifier for organ specific metastases based on the aCGH data using Prediction Analysis for Microarrays (PAM).
Specifically in the tumors of primary CRC patients who subsequently developed liver metastasis, KC-SMART analysis identified genomic aberrations on chromosome 20q. LM-PAM, a shrunken centroids classifier for liver metastases occurrence, was able to distinguish the LM group from the other groups (M0&PM) with 80% accuracy (78% sensitivity and 86% specificity). The classification is predominantly based on chromosome 20q aberrations.
Liver specific CRC metastases may be predicted with a high accuracy based on specific genomic aberrations in the primary CRC tumor. The ability to predict the site of metastases is important for improvement of personalized patient management.
Polycomb repressive complex 1 (PRC1) core member Ring1b/Rnf2, with ubiquitin E3 ligase activity towards histone H2A at lysine 119, is essential for early embryogenesis. To obtain more insight into the role of Ring1b in early development, we studied its function in mouse embryonic stem (ES) cells.
We investigated the effects of Ring1b ablation on transcriptional regulation using Ring1b conditional knockout ES cells and large-scale gene expression analysis. The absence of Ring1b results in aberrant expression of key developmental genes and deregulation of specific differentiation-related pathways, including TGFbeta signaling, cell cycle regulation and cellular communication. Moreover, ES cell markers, including Zfp42/Rex-1 and Sox2, are downregulated. Importantly, retained expression of ES cell regulators Oct4, Nanog and alkaline phosphatase indicates that Ring1b-deficient ES cells retain important ES cell specific characteristics. Comparative analysis of our expression profiling data with previously published global binding studies shows that the genes that are bound by Ring1b in ES cells have bivalent histone marks, i.e. both active H3K4me3 and repressive H3K27me3, or the active H3K4me3 histone mark alone and are associated with CpG-‘rich’ promoters. However, deletion of Ring1b results in deregulation, mainly derepression, of only a subset of these genes, suggesting that additional silencing mechanisms are involved in repression of the other Ring1b bound genes in ES cells.
Ring1b is essential to stably maintain an undifferentiated state of mouse ES cells by repressing genes with important roles during differentiation and development. These genes are characterized by high CpG content promoters and bivalent histone marks or the active H3K4me3 histone mark alone.