Summary: The R/Bioconductor package RamiGO is an R interface to AmiGO that enables visualization of Gene Ontology (GO) trees. Given a list of GO terms, RamiGO uses the AmiGO visualize API to import Graphviz-DOT format files into R, and export these either as images (SVG, PNG) or into Cytoscape for extended network analyses. RamiGO provides easy customization of annotation, highlighting of specific GO terms, colouring of terms by P-value or export of a simplified summary GO tree. We illustrate RamiGO functionalities in a genome-wide gene set analysis of prognostic genes in breast cancer.
Availability and implementation: RamiGO is provided in R/Bioconductor, is open source under the Artistic-2.0 License and is available with a user manual containing installation, operating instructions and tutorials. It requires R version 2.15.0 or higher. URL: http://bioconductor.org/packages/release/bioc/html/RamiGO.html
Supplementary data are available at Bioinformatics online.
Summary: The survcomp package provides functions to assess and statistically compare the performance of survival/risk prediction models. It implements state-of-the-art statistics to (i) measure the performance of risk prediction models; (ii) combine these statistical estimates from multiple datasets using a meta-analytical framework; and (iii) statistically compare the performance of competitive models.
Availability: The R/Bioconductor package survcomp is provided open source under the Artistic-2.0 License with a user manual containing installation, operating instructions and use case scenarios on real datasets. survcomp requires R version 2.13.0 or higher. http://bioconductor.org/packages/release/bioc/html/survcomp.html
Contact: email@example.com; firstname.lastname@example.org
Supplementary Information: Supplementary data are available at Bioinformatics online.
Survival prediction from a large number of covariates is a current focus of statistical and medical research. In this paper, we study a methodology known as the compound covariate prediction performed under univariate Cox proportional hazard models. We demonstrate via simulations and real data analysis that the compound covariate method generally competes well with ridge regression and Lasso methods, both already well-studied methods for predicting survival outcomes with a large number of covariates. Furthermore, we develop a refinement of the compound covariate method by incorporating likelihood information from multivariate Cox models. The new proposal is an adaptive method that borrows information contained in both the univariate and multivariate Cox regression estimators. We show that the new proposal has a theoretical justification from a statistical large sample theory and is naturally interpreted as a shrinkage-type estimator, a popular class of estimators in statistical literature. Two datasets, the primary biliary cirrhosis of the liver data and the non-small-cell lung cancer data, are used for illustration. The proposed method is implemented in R package “compound.Cox” available in CRAN at http://cran.r-project.org/.
Motivation: Meta-analysis of genomics data seeks to identify genes associated with a biological phenotype across multiple datasets; however, merging data from different platforms by their features (genes) is challenging. Meta-analysis using functionally or biologically characterized gene sets simplifies data integration is biologically intuitive and is seen as having great potential, but is an emerging field with few established statistical methods.
Results: We transform gene expression profiles into binary gene set profiles by discretizing results of gene set enrichment analyses and apply a new iterative bi-clustering algorithm (iBBiG) to identify groups of gene sets that are coordinately associated with groups of phenotypes across multiple studies. iBBiG is optimized for meta-analysis of large numbers of diverse genomics data that may have unmatched samples. It does not require prior knowledge of the number or size of clusters. When applied to simulated data, it outperforms commonly used clustering methods, discovers overlapping clusters of diverse sizes and is robust in the presence of noise. We apply it to meta-analysis of breast cancer studies, where iBBiG extracted novel gene set—phenotype association that predicted tumor metastases within tumor subtypes.
Availability: Implemented in the Bioconductor package iBBiG
Mounting evidence suggests that malignant tumors are initiated and maintained by a subpopulation of cancerous cells with biological properties similar to those of normal stem cells. However, descriptions of stem-like gene and pathway signatures in cancers are inconsistent across experimental systems. Driven by a need to improve our understanding of molecular processes that are common and unique across cancer stem cells (CSCs), we have developed the Stem Cell Discovery Engine (SCDE)—an online database of curated CSC experiments coupled to the Galaxy analytical framework. The SCDE allows users to consistently describe, share and compare CSC data at the gene and pathway level. Our initial focus has been on carefully curating tissue and cancer stem cell-related experiments from blood, intestine and brain to create a high quality resource containing 53 public studies and 1098 assays. The experimental information is captured and stored in the multi-omics Investigation/Study/Assay (ISA-Tab) format and can be queried in the data repository. A linked Galaxy framework provides a comprehensive, flexible environment populated with novel tools for gene list comparisons against molecular signatures in GeneSigDB and MSigDB, curated experiments in the SCDE and pathways in WikiPathways. The SCDE is available at http://discovery.hsci.harvard.edu.
GeneSigDB (http://www.genesigdb.org or http://compbio.dfci.harvard.edu/genesigdb/) is a database of gene signatures that have been extracted and manually curated from the published literature. It provides a standardized resource of published prognostic, diagnostic and other gene signatures of cancer and related disease to the community so they can compare the predictive power of gene signatures or use these in gene set enrichment analysis. Since GeneSigDB release 1.0, we have expanded from 575 to 3515 gene signatures, which were collected and transcribed from 1604 published articles largely focused on gene expression in cancer, stem cells, immune cells, development and lung disease. We have made substantial upgrades to the GeneSigDB website to improve accessibility and usability, including adding a tag cloud browse function, facetted navigation and a ‘basket’ feature to store genes or gene signatures of interest. Users can analyze GeneSigDB gene signatures, or upload their own gene list, to identify gene signatures with significant gene overlap and results can be viewed on a dynamic editable heatmap that can be downloaded as a publication quality image. All data in GeneSigDB can be downloaded in numerous formats including .gmt file format for gene set enrichment analysis or as a R/Bioconductor data file. GeneSigDB is available from http://www.genesigdb.org.
The majority of breast cancer deaths result from metastases rather than from direct effects of the primary tumor itself. Recently, Landemaine and colleagues described a six-gene signature purported to predict lung metastasis risk. They analyzed gene expression in 23 metastases from breast cancer patients (5 lung, 18 non-lung) identifying a 21-gene signature. Expression of 16 of these was analyzed in primary breast tumors from 72 patients with known outcome, and six were selected that were predictive of lung metastases: DSC2, TFCP2L1, UGT8, ITGB8, ANP32E, and FERMT1. Despite the value of such a signature, our analysis indicates that this analysis ignored potentially important confounding factors and that their signature is instead a surrogate for molecular subtype.
The primary objective of most gene expression studies is the identification of one or more gene signatures; lists of genes whose transcriptional levels are uniquely associated with a specific biological phenotype. Whilst thousands of experimentally derived gene signatures are published, their potential value to the community is limited by their computational inaccessibility. Gene signatures are embedded in published article figures, tables or in supplementary materials, and are frequently presented using non-standard gene or probeset nomenclature. We present GeneSigDB (http://compbio.dfci.harvard.edu/genesigdb) a manually curated database of gene expression signatures. GeneSigDB release 1.0 focuses on cancer and stem cells gene signatures and was constructed from more than 850 publications from which we manually transcribed 575 gene signatures. Most gene signatures (n = 560) were successfully mapped to the genome to extract standardized lists of EnsEMBL gene identifiers. GeneSigDB provides the original gene signature, the standardized gene list and a fully traceable gene mapping history for each gene from the original transcribed data table through to the standardized list of genes. The GeneSigDB web portal is easy to search, allows users to compare their own gene list to those in the database, and download gene signatures in most common gene identifier formats.
Mutations affecting the BRCT domains of the breast cancer–associated tumor suppressor BRCA1 disrupt the recruitment of this protein to DNA double-strand breaks (DSBs). The molecular structures at DSBs recognized by BRCA1 are presently unknown. We report the interaction of the BRCA1 BRCT domain with RAP80, a ubiquitin-binding protein. RAP80 targets a complex containing the BRCA1-BARD1 (BRCA1-associated ring domain protein 1) E3 ligase and the deubiquitinating enzyme (DUB) BRCC36 to MDC1-γH2AX–dependent lysine6- and lysine63-linked ubiquitin polymers at DSBs. These events are required for cell cycle checkpoint and repair responses to ionizing radiation, implicating ubiquitin chain recognition and turnover in the BRCA1-mediated repair of DSBs.
Numerous feature selection methods have been applied to the identification of differentially expressed genes in microarray data. These include simple fold change, classical t-statistic and moderated t-statistics. Even though these methods return gene lists that are often dissimilar, few direct comparisons of these exist. We present an empirical study in which we compare some of the most commonly used feature selection methods. We apply these to 9 publicly available datasets, and compare, both the gene lists produced and how these perform in class prediction of test datasets.
In this study, we compared the efficiency of the feature selection methods; significance analysis of microarrays (SAM), analysis of variance (ANOVA), empirical bayes t-statistic, template matching, maxT, between group analysis (BGA), Area under the receiver operating characteristic (ROC) curve, the Welch t-statistic, fold change, rank products, and sets of randomly selected genes. In each case these methods were applied to 9 different binary (two class) microarray datasets. Firstly we found little agreement in gene lists produced by the different methods. Only 8 to 21% of genes were in common across all 10 feature selection methods. Secondly, we evaluated the class prediction efficiency of each gene list in training and test cross-validation using four supervised classifiers.
We report that the choice of feature selection method, the number of genes in the genelist, the number of cases (samples) and the noise in the dataset, substantially influence classification success. Recommendations are made for choice of feature selection. Area under a ROC curve performed well with datasets that had low levels of noise and large sample size. Rank products performs well when datasets had low numbers of samples or high levels of noise. The Empirical bayes t-statistic performed well across a range of sample sizes.
A recent publication described a supervised classification method for microarray data: Between Group Analysis (BGA). This method which is based on performing multivariate ordination of groups proved to be very efficient for both classification of samples into pre-defined groups and disease class prediction of new unknown samples. Classification and prediction with BGA are classically performed using the whole set of genes and no variable selection is required. We hypothesize that an optimized selection of highly discriminating genes might improve the prediction power of BGA.
We propose an optimized between-group classification (OBC) which uses a jackknife-based gene selection procedure. OBC emphasizes classification accuracy rather than feature selection. OBC is a backward optimization procedure that maximizes the percentage of between group inertia by removing the least influential genes one by one from the analysis. This selects a subset of highly discriminative genes which optimize disease class prediction. We apply OBC to four datasets and compared it to other classification methods.
OBC considerably improved the classification and predictive accuracy of BGA, when assessed using independent data sets and leave-one-out cross-validation.
The R code is freely available [see Additional file 1] as well as supplementary information [see Additional file 2].
Expression Profiler (EP, http://www.ebi.ac.uk/expressionprofiler) is a web-based platform for microarray gene expression and other functional genomics-related data analysis. The new architecture, Expression Profiler: next generation (EP:NG), modularizes the original design and allows individual analysis-task-related components to be developed by different groups and yet still seamlessly to work together and share the same user interface look and feel. Data analysis components for gene expression data preprocessing, missing value imputation, filtering, clustering methods, visualization, significant gene finding, between group analysis and other statistical components are available from the EBI (European Bioinformatics Institute) web site. The web-based design of Expression Profiler supports data sharing and collaborative analysis in a secure environment. Developed tools are integrated with the microarray gene expression database ArrayExpress and form the exploratory analytical front-end to those data. EP:NG is an open-source project, encouraging broad distribution and further extensions from the scientific community.
Rapid development of DNA microarray technology has resulted in different laboratories adopting numerous different protocols and technological platforms, which has severely impacted on the comparability of array data. Current cross-platform comparison of microarray gene expression data are usually based on cross-referencing the annotation of each gene transcript represented on the arrays, extracting a list of genes common to all arrays and comparing expression data of this gene subset. Unfortunately, filtering of genes to a subset represented across all arrays often excludes many thousands of genes, because different subsets of genes from the genome are represented on different arrays. We wish to describe the application of a powerful yet simple method for cross-platform comparison of gene expression data. Co-inertia analysis (CIA) is a multivariate method that identifies trends or co-relationships in multiple datasets which contain the same samples. CIA simultaneously finds ordinations (dimension reduction diagrams) from the datasets that are most similar. It does this by finding successive axes from the two datasets with maximum covariance. CIA can be applied to datasets where the number of variables (genes) far exceeds the number of samples (arrays) such is the case with microarray analyses.
We illustrate the power of CIA for cross-platform analysis of gene expression data by using it to identify the main common relationships in expression profiles on a panel of 60 tumour cell lines from the National Cancer Institute (NCI) which have been subjected to microarray studies using both Affymetrix and spotted cDNA array technology. The co-ordinates of the CIA projections of the cell lines from each dataset are graphed in a bi-plot and are connected by a line, the length of which indicates the divergence between the two datasets. Thus, CIA provides graphical representation of consensus and divergence between the gene expression profiles from different microarray platforms. Secondly, the genes that define the main trends in the analysis can be easily identified.
CIA is a robust, efficient approach to coupling of gene expression datasets. CIA provides simple graphical representations of the results making it a particularly attractive method for the identification of relationships between large datasets.
We propose to make use of the wealth of underused DNA chip data available in public repositories to study the molecular mechanisms behind the adaptation of cancer cells to hypoxic conditions leading to the metastatic phenotype. We have developed new bioinformatics tools and adapted others to identify with maximum sensitivity those genes which are expressed differentially across several experiments. The comparison of two analytical approaches, based on either Over Representation Analysis or Functional Class Scoring, by a meta-analysis-based approach, led to the retrieval of known information about the biological situation – thus validating the model – but also more importantly to the discovery of the previously unknown implication of the spliceosome, the cellular machinery responsible for mRNA splicing, in the development of metastasis.
Long non-coding RNAs (lncRNAs) are emerging as potent regulators of cell physiology, and recent studies highlight their role in tumor development. However, while established protein-coding oncogenes and tumor suppressors often display striking patterns of focal DNA copy-number alteration in tumors, similar evidence is largely lacking for lncRNAs. Here, we report on a genomic analysis of GENCODE lncRNAs in high-grade serous ovarian adenocarcinoma, based on The Cancer Genome Atlas (TCGA) molecular profiles. Using genomic copy-number data and deep coverage transcriptome sequencing, we derived dual copy-number and expression data for 10,419 lncRNAs across 407 primary tumors. We describe global correlations between lncRNA copy-number and expression, and associate established expression subtypes with distinct lncRNA signatures. By examining regions of focal copy-number change that lack protein-coding targets, we identified an intergenic lncRNA on chromosome 1, OVAL, that shows narrow focal genomic amplification in a subset of tumors. While weakly expressed in most tumors, focal amplification coincided with strong OVAL transcriptional activation. Screening of 16 other cancer types revealed similar patterns in serous endometrial carcinomas. This shows that intergenic lncRNAs can be specifically targeted by somatic copy-number amplification, suggestive of functional involvement in tumor initiation or progression. Our analysis provides testable hypotheses and paves the way for further study of lncRNAs based on TCGA and other large-scale cancer genomics datasets.
Even in genomes lacking operons, a gene's position in the genome influences its potential for expression. The mechanisms by which adjacent genes are co-expressed are still not completely understood. Using lactation and the mammary gland as a model system, we explore the hypothesis that chromatin state contributes to the co-regulation of gene neighborhoods. The mammary gland represents a unique evolutionary model, due to its recent appearance, in the context of vertebrate genomes. An understanding of how the mammary gland is regulated to produce milk is also of biomedical and agricultural importance for human lactation and dairying. Here, we integrate epigenomic and transcriptomic data to develop a comprehensive regulatory model. Neighborhoods of mammary-expressed genes were determined using expression data derived from pregnant and lactating mice and a neighborhood scoring tool, G-NEST. Regions of open and closed chromatin were identified by ChIP-Seq of histone modifications H3K36me3, H3K4me2, and H3K27me3 in the mouse mammary gland and liver tissue during lactation. We found that neighborhoods of genes in regions of uniquely active chromatin in the lactating mammary gland, compared with liver tissue, were extremely rare. Rather, genes in most neighborhoods were suppressed during lactation as reflected in their expression levels and their location in regions of silenced chromatin. Chromatin silencing was largely shared between the liver and mammary gland during lactation, and what distinguished the mammary gland was mainly a small tissue-specific repertoire of isolated, expressed genes. These findings suggest that an advantage of the neighborhood organization is in the collective repression of groups of genes via a shared mechanism of chromatin repression. Genes essential to the mammary gland's uniqueness are isolated from neighbors, and likely have less tolerance for variation in expression, properties they share with genes responsible for an organism's survival.
Breast cancer in young women is more aggressive with a poorer prognosis and overall survival compared to older women diagnosed with the disease. Despite recent research, the underlying biology and molecular alterations that drive the aggressive nature of breast tumors associated with breast cancer in young women have yet to be elucidated. In this study, we performed transcriptomic profile and network analyses of breast tumors arising in Middle Eastern women to identify age-specific gene signatures. Moreover, we studied molecular alterations associated with cancer progression in young women using cross-species comparative genomics approach coupled with copy number alterations (CNA) associated with breast cancers from independent studies. We identified 63 genes specific to tumors in young women that showed alterations distinct from two age cohorts of older women. The network analyses revealed potential critical regulatory roles for Myc, PI3K/Akt, NF-κB, and IL-1 in disease characteristics of breast tumors arising in young women. Cross-species comparative genomics analysis of progression from pre-invasive ductal carcinoma in situ (DCIS) to invasive ductal carcinoma (IDC) revealed 16 genes with concomitant genomic alterations, CCNB2, UBE2C, TOP2A, CEP55, TPX2, BIRC5, KIAA0101, SHCBP1, UBE2T, PTTG1, NUSAP1, DEPDC1, HELLS, CCNB1, KIF4A, and RRM2, that may be involved in tumorigenesis and in the processes of invasion and progression of disease. Array findings were validated using qRT-PCR, immunohistochemistry, and extensive in silico analyses of independently performed microarray datasets. To our knowledge, this study provides the first comprehensive genomic analysis of breast cancer in Middle Eastern women in age-specific cohorts and potential markers for cancer progression in young women. Our data demonstrate that cancer appearing in young women contain distinct biological characteristics and deregulated signaling pathways. Moreover, our integrative genomic and cross-species analysis may provide robust biomarkers for the detection of disease progression in young women, and lead to more effective treatment strategies.
Heterogeneous nuclear ribonucleoprotein C1/C2 (hnRNP C) is a core component of 40S ribonucleoprotein particles that bind pre-mRNAs and influence their processing, stability and export. Breast cancer tumor suppressors BRCA1, BRCA2 and PALB2 form a complex and play key roles in homologous recombination (HR), DNA double strand break (DSB) repair and cell cycle regulation following DNA damage.
PALB2 nucleoprotein complexes were isolated using tandem affinity purification from nuclease-solubilized nuclear fraction. Immunofluorescence was used for localization studies of proteins. siRNA-mediated gene silencing and flow cytometry were used for studying DNA repair efficiency and cell cycle distribution/checkpoints. The effect of hnRNP C on mRNA abundance was assayed using quantitative reverse transcriptase PCR.
Results and Significance
We identified hnRNP C as a component of a nucleoprotein complex containing breast cancer suppressor proteins PALB2, BRCA2 and BRCA1. Notably, other components of the 40S ribonucleoprotein particle were not present in the complex. hnRNP C was found to undergo significant changes of sub-nuclear localization after ionizing radiation (IR) and to partially localize to DNA damage sites. Depletion of hnRNP C substantially altered the normal balance of repair mechanisms following DSB induction, reducing HR usage in particular, and impaired S phase progression after IR. Moreover, loss of hnRNP C strongly reduced the abundance of key HR proteins BRCA1, BRCA2, RAD51 and BRIP1, which can be attributed, at least in part, to the downregulation of their mRNAs due to aberrant splicing. Our results establish hnRNP C as a key regulator of BRCA gene expression and HR-based DNA repair. They also suggest the existence of an RNA regulatory program at sites of DNA damage, which involves a unique function of hnRNP C that is independent of the 40S ribonucleoprotein particles and most other hnRNP proteins.
Although ovarian cancer is often initially chemotherapy-sensitive, the vast majority of tumors eventually relapse and patients die of increasingly aggressive disease. Cancer stem cells are believed to have properties that allow them to survive therapy and may drive recurrent tumor growth. Cancer stem cells or cancer-initiating cells are a rare cell population and difficult to isolate experimentally. Genes that are expressed by stem cells may characterize a subset of less differentiated tumors and aid in prognostic classification of ovarian cancer. The purpose of this study was the genomic identification and characterization of a subtype of ovarian cancer that has stem cell-like gene expression. Using human and mouse gene signatures of embryonic, adult, or cancer stem cells, we performed an unsupervised bipartition class discovery on expression profiles from 145 serous ovarian tumors to identify a stem-like and more differentiated subgroup. Subtypes were reproducible and were further characterized in four independent, heterogeneous ovarian cancer datasets. We identified a stem-like subtype characterized by a 51-gene signature, which is significantly enriched in tumors with properties of Type II ovarian cancer; high grade, serous tumors, and poor survival. Conversely, the differentiated tumors share properties with Type I, including lower grade and mixed histological subtypes. The stem cell-like signature was prognostic within high-stage serous ovarian cancer, classifying a small subset of high-stage tumors with better prognosis, in the differentiated subtype. In multivariate models that adjusted for common clinical factors (including grade, stage, age), the subtype classification was still a significant predictor of relapse. The prognostic stem-like gene signature yields new insights into prognostic differences in ovarian cancer, provides a genomic context for defining Type I/II subtypes, and potential gene targets which following further validation may be valuable in the clinical management or treatment of ovarian cancer.
Quantifying chromosomal instability (CIN) has both prognostic and predictive clinical utility in breast cancer. In order to establish a robust and clinically applicable gene expression-based measure of CIN, we assessed the ability of four qPCR quantified genes selected from the 70-gene Chromosomal Instability (CIN70) expression signature to stratify outcome in patients with grade 2 breast cancer.
AURKA, FOXM1, TOP2A and TPX2 (CIN4), were selected from the CIN70 signature due to their high level of correlation with histological grade and mean CIN70 signature expression in silico. We assessed the ability of CIN4 to stratify outcome in an independent cohort of patients diagnosed between 1999 and 2002. 185 formalin-fixed, paraffin-embedded (FFPE) samples were included in the qPCR measurement of CIN4 expression. In parallel, ploidy status of tumors was assessed by flow cytometry. We investigated whether the categorical CIN4 score derived from the CIN4 signature was correlated with recurrence-free survival (RFS) and ploidy status in this cohort.
We observed a significant association of tumor proliferation, defined by Ki67 and mitotic index (MI), with both CIN4 expression and aneuploidy. The CIN4 score stratified grade 2 carcinomas into good and poor prognostic cohorts (mean RFS: 83.8±4.9 and 69.4±8.2 months, respectively, p = 0.016) and its predictive power was confirmed by multivariate analysis outperforming MI and Ki67 expression.
The first clinically applicable qPCR derived measure of tumor aneuploidy from FFPE tissue, stratifies grade 2 tumors into good and poor prognosis groups.
We recently showed that differential expression of extracellular matrix (ECM) genes delineates four subgroups of breast carcinomas (ECM1, -2, -3- and -4) with different clinical outcome. To further investigate the characteristics of ECM signature and its impact on tumor progression, we conducted unsupervised clustering analyses in 6 additional independent datasets of invasive breast tumors from different platforms for a total of 643 samples. Use of four different clustering algorithms identified ECM3 tumors as an independent group in all datasets tested. ECM3 showed a homogeneous gene pattern, consisting of 58 genes encoding 43 structural ECM proteins. From 26 to 41% of the cases were ECM3-enriched, and analysis of datasets relevant to gene expression in neoplastic or corresponding stromal cells showed that both stromal and breast carcinoma cells can coordinately express ECM3 genes. In in vitro experiments, β-estradiol induced ECM3 gene production in ER-positive breast carcinoma cell lines, whereas TGFβ induced upregulation of the genes leading to ECM3 gene classification, especially in ER-negative breast carcinoma cells and in fibroblasts. Multivariate analysis of distant metastasis-free survival in untreated breast tumor patients revealed a significant interaction between ECM3 and histological grade (p = 0.001). Cox models, estimated separately in grade I–II and grade III tumors, indicated a highly significant association between ECM3 and worse survival probability only in grade III tumors (HR = 3.0, 95% CI = 1.3–7.0, p = 0.0098). Gene Set Enrichment analysis of ECM3 compared to non-ECM3 tumors revealed significant enrichment of epithelial-mesenchymal transition (EMT) genes in both grade I–II and grade III subsets of ECM3 tumors. Thus, ECM3 is a robust cluster that identifies breast carcinomas with EMT features but with accelerated metastatic potential only in the undifferentiated (grade III) phenotype. These findings support the key relevance of neoplastic and stroma interaction in breast cancer progression.
Robust transcriptional signatures in cancer can be identified by data similarity-driven meta-analysis of gene expression profiles. An unbiased data integration and interrogation strategy has not previously been available.
Methods and Findings
We implemented and performed a large meta-analysis of breast cancer gene expression profiles from 223 datasets containing 10,581 human breast cancer samples using a novel data similarity-based approach (iterative EXALT). Cancer gene expression signatures extracted from individual datasets were clustered by data similarity and consolidated into a meta-signature with a recurrent and concordant gene expression pattern. A retrospective survival analysis was performed to evaluate the predictive power of a novel meta-signature deduced from transcriptional profiling studies of human breast cancer. Validation cohorts consisting of 6,011 breast cancer patients from 21 different breast cancer datasets and 1,110 patients with other malignancies (lung and prostate cancer) were used to test the robustness of our findings. During the iterative EXALT analysis, 633 signatures were grouped by their data similarity and formed 121 signature clusters. From the 121 signature clusters, we identified a unique meta-signature (BRmet50) based on a cluster of 11 signatures sharing a phenotype related to highly aggressive breast cancer. In patients with breast cancer, there was a significant association between BRmet50 and disease outcome, and the prognostic power of BRmet50 was independent of common clinical and pathologic covariates. Furthermore, the prognostic value of BRmet50 was not specific to breast cancer, as it also predicted survival in prostate and lung cancers.
We have established and implemented a novel data similarity-driven meta-analysis strategy. Using this approach, we identified a transcriptional meta-signature (BRmet50) in breast cancer, and the prognostic performance of BRmet50 was robust and applicable across a wide range of cancer-patient populations.
Breast cancer cells with the CD44+/CD24− phenotype have been reported to be tumourigenic due to their enhanced capacity for cancer development and their self-renewal potential. The identification of human tumourigenic breast cancer cells in surgical samples has recently received increased attention due to the implications for prognosis and treatment, although limitations exist in the interpretation of these studies. To better identify the CD44+/CD24− cells in routine surgical specimens, 56 primary breast carcinoma cases were analysed by immunofluorescence and confocal microscopy, and the results were compared using flow cytometry analysis to correlate the amount and distribution of the CD44+/CD24− population with clinicopathological features. Using these methods, we showed that the breast carcinoma cells displayed four distinct sub-populations based on the expression pattern of CD44 and CD24. The CD44+/CD24− cells were found in 91% of breast tumours and constituted an average of 6.12% (range, 0.11%–21.23%) of the tumour. A strong correlation was found between the percentage of CD44+/CD24− cells in primary tumours and distant metastasis development (p = 0.0001); in addition, there was an inverse significant association with ER and PGR status (p = 0.002 and p = 0.001, respectively). No relationship was evident with tumour size (T) and regional lymph node (N) status, differentiation grade, proliferative index or HER2 status. In a multivariate analysis, the percentage of CD44+/CD24− cancer cells was an independent factor related to metastasis development (p = 0.004). Our results indicate that confocal analysis of fluorescence-labelled breast cancer samples obtained at surgery is a reliable method to identify the CD44+/CD24− tumourigenic cell population, allowing for the stratification of breast cancer patients into two groups with substantially different relapse rates on the basis of CD44+/CD24− cell percentage.
Receptor Associated Protein 80 (RAP80) is a subunit of the BRCA1-A complex and targets BRCA1 to DNA damage sites in response to DNA double strand breaks. Since mutations of BRCA1 are associated with familial ovarian cancers, we screened 26 ovarian cancer-derived cell lines for RAP80 mutations and found that TOV-21G cells harbor a RAP80 mutation (c.1107G >A). This mutation generates a stop codon at Trp369, which deletes the partial AIR region and the C-terminal zinc fingers of RAP80. Interestingly, both the mutant and wild type alleles of RAP80 lose their expression due to promoter hypermethylation, suggesting that TOV-21G is a RAP80-null cell line. In these cells, not only is the BRCA1-A complex disrupted, but the relocation of the remaining subunits in the BRCA1-A complex including BRCA1, CCDC98, NBA1, BRCC36 and BRE is significantly suppressed. Moreover, TOV-21G cells are hypersensitive to ionizing radiation, which is due to the compromised DNA damage repair capacity in these cells. Reconstitution of TOV-21G cells with wild type RAP80 rescues these cellular defects in response to DNA damage. Thus, our results demonstrate that RAP80 is a scaffold protein in the BRCA1-A complex. Identification of TOV-21G as a RAP80 null tumor cell line will be very useful for the study of the molecular mechanism in DNA damage response.
The Cancer Genome Atlas (TCGA) Network recently comprehensively catalogued the molecular aberrations in 487 high-grade serous ovarian cancers, with much remaining to be elucidated regarding the microRNAs (miRNAs). Here, using TCGA ovarian data, we surveyed the miRNAs, in the context of their predicted gene targets.
Methods and Results
Integration of miRNA and gene patterns yielded evidence that proximal pairs of miRNAs are processed from polycistronic primary transcripts, and that intronic miRNAs and their host gene mRNAs derive from common transcripts. Patterns of miRNA expression revealed multiple tumor subtypes and a set of 34 miRNAs predictive of overall patient survival. In a global analysis, miRNA:mRNA pairs anti-correlated in expression across tumors showed a higher frequency of in silico predicted target sites in the mRNA 3′-untranslated region (with less frequency observed for coding sequence and 5′-untranslated regions). The miR-29 family and predicted target genes were among the most strongly anti-correlated miRNA:mRNA pairs; over-expression of miR-29a in vitro repressed several anti-correlated genes (including DNMT3A and DNMT3B) and substantially decreased ovarian cancer cell viability.
This study establishes miRNAs as having a widespread impact on gene expression programs in ovarian cancer, further strengthening our understanding of miRNA biology as it applies to human cancer. As with gene transcripts, miRNAs exhibit high diversity reflecting the genomic heterogeneity within a clinically homogeneous disease population. Putative miRNA:mRNA interactions, as identified using integrative analysis, can be validated. TCGA data are a valuable resource for the identification of novel tumor suppressive miRNAs in ovarian as well as other cancers.