|Home | About | Journals | Submit | Contact Us | Français|
Breast cancer is a disease with high heterogeneity. Many issues on tumorigenesis and progression are still elusive. It is critical to identify genes that play important roles in the progression of tumors, especially for tumors with poor prognosis such as basal-like breast cancer and tumors in very young women. To facilitate the identification of potential regulatory or driver genes, we present the Breast Cancer Integrative Platform (BCIP, http://omics.bmi.ac.cn/bcancer/). BCIP maintains multi-omics data selected with strict quality control and processed with uniform normalization methods, including gene expression profiles from 9,005 tumor and 376 normal tissue samples, copy number variation information from 3,035 tumor samples, microRNA-target interactions, co-expressed genes, KEGG pathways, and mammary tissue-specific gene functional networks. This platform provides a user-friendly interface integrating comprehensive and flexible analysis tools on differential gene expression, copy number variation, and survival analysis. The prominent characteristic of BCIP is that users can perform analysis by customizing subgroups with single or combined clinical features, including subtypes, histological grades, pathologic stages, metastasis status, lymph node status, ER/PR/HER2 status, TP53 mutation status, menopause status, age, tumor size, therapy responses, and prognosis. BCIP will help to identify regulatory or driver genes and candidate biomarkers for further research in breast cancer.
Breast cancer is a frequently diagnosed carcinoma and is the leading cause of cancer death among females worldwide. An estimated 1,676,600 cases were diagnosed and 521,900 deaths occurred in 2012, accounting for 25% of the total cancer cases and 15% of all the cancer deaths among females1. Breast cancer is a heterogeneous disease with a high degree of diversity in morphology, histology, pathological features, and molecular alterations, such as gene mutations and abnormal expression2. Researches based on gene expression (GE) patterns have classified breast cancer into distinct subgroups corresponding to different prognostic outcomes and therapeutic responses3,4,5,6. A number of studies have focused on the recognition of biomarkers and characterization of gene function in particular breast cancer subgroups7,8,9,10.
Triple-negative breast cancer (TNBC) is a highly aggressive subtype of breast cancer and the vast majority is basal-like phenotype11. Due to its high genetic heterogeneity, TNBC does not possess a common genetic mutation and thus lacks effective targeted therapies12. However, several studies based on GE have identified several critical genes that may be potential druggable targets for the treatment of TNBC13,14,15. For example, MELK has been characterized as an oncogenic kinase essential for basal-like breast cancer (BBC) via a kinome-wide screening, integrative analysis with multiple GE datasets, and further in vitro and in vivo experiments14. BCL11A has also been reported to be a novel TNBC oncogene by in silico analysis of several microarray datasets and subsequent experimental validations15. These studies suggest that GE profiles are important resources for regulatory gene and biomarker identification in breast cancer. Differential expression analysis, copy number variation (CNV) analysis, survival analysis, and co-expression analysis on multiple credible and qualified datasets are effective approaches for recognizing novel regulatory genes and biomarkers.
In order to help researchers use gene expression profiles, some databases and tools have been developed16,17,18,19. However, integrative platforms combined with multi-omics data and customized analysis tools for breast cancer are still lacking. In this study, we developed BCIP, which provides differential expression analysis, copy number variation analysis, survival analysis, co-expression analysis, microRNA (miRNA) regulation analysis, and pathway analysis for query genes. To ensure the reliability of the analysis, we collected and obtained GE profile data on 9,381 samples from 29 datasets with strict quality control and uniform processing. We also incorporated CNV information on 3,035 samples, 324,219 miRNA-target interactions, 286 KEGG pathways, and data from tissue-specific gene functional networks of mammary gland and mammary epithelium. In order to facilitate researchers’ analysis of the specific subgroups they focused on, we developed a comprehensive and flexible interface that permits users to customize subgroups with single or combined clinical features of interest, including subtypes, grades, stages, metastasis status, lymph node status, prognosis, age, tumor size, ER/PR/HER2 status, TP53 mutation status, menopause status, and therapy response. BCIP will be a valuable tool for the identification of regulatory or driver genes in breast cancer.
We initially retrieved and collected data from NCBI Gene Expression Omnibus20 (GEO), European Genome-phenome Archive of EMBL European Bioinformatics Institute21 (EMBL-EBI), and The Cancer Genome Atlas22 (TCGA) with the following criteria: (1) gene microarray or high-throughput sequencing data of RNAs extracted from primary breast tumor or adjacent normal tissues; (2) the sample size of each dataset is no less than 50; (3) clinical information were provided together with the dataset, mainly including subtypes, histological grades, pathologic stages, ER/PR/HER2 status, and prognosis; (4) the dataset was available for download before Jan 1, 2016, which was the latest date we collected the datasets. Finally, we obtained a preliminary collection of 86 independent datasets. To assure adequate specimens in subgrouping, we assessed the sample number demand and removed the datasets with less than 100 samples. The rest 30 datasets include 27 datasets from GEO (measured by Affymetrix microarray), 2 datasets from TCGA (measured by Agilent microarray and Illumina HiSeq), and 1 dataset of the METABRIC from EBI (measured by Illumina HT-12 v3 microarray).
Then we performed quality control, normalization and duplicate removing on all the 30 datasets. Quality control was carried out by simpleaffy and affyPLM R packages on each of the 27 GEO datasets independently. The raw data of each dataset were then normalized, summarized, and log-transformed using robust multi-array average (RMA) function of affy R package. The probe-based expression was converted into GE profiles, and the gene containing multiple probes was represented by the probe with the largest interquartile range across the samples. For the METABRIC dataset, we deleted 12 samples since 8 samples were duplicated in the discovery and validation sets and 4 were represented twice in the validation set. We used the processed expression matrix data of METABRIC directly23. For the TCGA Agilent and RNA-Seq data, we removed 22 samples without matched clinical information and used level_3 log2 normalized data from TCGA directly.
Furthermore, the tumor purity of the samples profiled on Affymetrix platforms was detected through a robust method, ESTIMATE, which uses the ESTIMATE-based tumor purity score developed by Affymetrix data to evaluate tumor purity24. This method was not applied to predict the tumor purity of the samples profiled on Affymetrix Human Genome U133B Array because of the insufficiency of the gene signatures intersection. Depending on the results of tumor purity estimation (Supplementary Figure S1), we eliminated one dataset with the lowest mean tumor purity, in order to reduce noises caused by diverse tumor purity. Finally, GE profiles of a total of 26,339 genes from 9,381 samples of 29 datasets were available for transcriptome analysis.
We compiled a series of clinical features along with each sample for sample subgrouping. For samples with some clinical features that were not initially provided (mainly ER−/+/PR−/+/HER2−/+, TNBC, and PAM50 subtypes) in certain collected datasets, we defined these features using a computational method based on GE profiles. Expressions of ER, PR, and HER2 were respectively fitted by a Gaussian bimodal distribution model and the parameters were estimated via EM algorithm using Mclust function in mclust R package. The expression status for ER, PR, and HER2 were discriminated as positive (ER+, PR+, and HER2+) or negative (ER−, PR−, and HER2−). On the basis of the identification of ER, PR, and HER2 positive or negative expression status, we classified samples into TNBC or non-TNBC subtype6. Samples defined as ER−, PR− and HER2− status were identified as TNBC and otherwise non-TNBC. Molecular classification for PAM50 subtypes was provided in some datasets, and if not, we classified the patients into the five intrinsic breast cancer subtypes using the 50-gene subtype classifier, PAM504. The feature of the prognosis status was classified into good/poor prognosis using the median survival time as the delimitation.
We obtained CNV data for 28,678 genes of 3,035 samples from METABRIC and TCGA. Both the DNA microarray platforms of the 2 datasets are Affymetrix Genome-Wide Human SNP Array 6.0. The numerical values of CNV were processed, summarized, and normalized relying on the relative intensity of probe hybridization on the arrays. Segmented data were converted to the gene level matrix using GISTIC 2.025, which were annotated for gene content based on hg19/GRCh37 for the TCGA data. For the METABRIC data, we generated a patient-by-gene CNV matrix through the processed segment data by matching the overlap of the segments with the gene regions whose annotations and coordinates were given by hg18/Ensembl 54. For more accurate and reliable analysis, we set the gain/loss threshold to 0.1 and −0.1, respectively. When the CNV value of a gene is greater than 0.1, the gene is defined as copy number gain. When the CNV value of a gene is smaller than −0.1, the gene is defined as copy number loss.
All the statistical analysis were performed using R programming platform. An unpaired t test was used for differential GE analysis in Transcriptome Analysis for 2 subgroups. One-way analysis of variance (ANOVA) was used for more than 2 subgroups if GE satisfied the assumption of a normal distribution, and if not, the non-parametric test (Kruskal–Wallis test) was used to assess statistical differences among these subgroups. The survfit function of survival R package was used for survival analysis. Kaplan-Meier curves and log-rank test were used to assess survival differences. In Transcriptome Survival Analysis, we classified patients into 2 groups according to an optimal GE cutoff value based on the Cutoff Finder application26. This program will traverse the GE values of all patients and the optimal cutoff value can minimizes the p value of survival differences. In CNV Survival Analysis, patients are separated into 2 groups according to their CNV status (gain/loss) of the query gene. Hazard ratio (HR) was calculated using Cox proportional hazards regression model. Co-expression analysis was performed using cox function of WGCNA R package. The correlation of GE was evaluated by Pearson correlation coefficient (PCC) as well as false discovery rate (FDR) adjusted p-value. The genes with absolute PCC≥0.3 and adjusted p-value≤0.05 were considered co-expressed in Co-expression Analysis.
BCIP is a gene-centered platform that provides (1) differential expression analysis, survival analysis, and co-expression analysis based on transcriptome data; (2) differential analysis and survival analysis based on CNVs; (3) miRNA regulation analysis on miRNA-target interactions; (4) KEGG pathway analysis; and (5) network analysis on mammary tissue-specific gene function networks (Fig. 1a). BCIP provides a user-friendly interface consisting of four panels: Analysis Type, Sample Subgrouping, Dataset, and Result (Fig. 1b). A gene symbol can be input in the text field where we provide a fuzzy matching function. Users can then select any of 5 analytical categories in the Analysis Type panel, including Transcriptome Analysis, Copy Number Variation Analysis, MicroRNA-target Interaction Analysis, Pathway Analysis, and Gene Functional Network Analysis. After selecting analytical category, users can customize subgroups with single or combined clinical features of interest in the Sample Subgrouping panel. BCIP provides a total of 15 clinical features, including TNBC and non-TNBC subtypes, PAM50 subtypes, histological grades, pathologic stages, metastasis status, lymph node status, ER/PR/HER2 status, TP53 mutation status, menopause status, age, tumor size, therapy responses, and prognosis. The Dataset panel provides all of the available datasets for the selected options in the Analysis Type and Sample Subgrouping. Finally, the Result panel returns corresponding graphical and tabular presentation and analysis results after choosing from the above options.
We collected GE data of breast cancer tissue samples from publicly available databases of GEO, EMBL-EBI and TCGA and obtained 86 datasets. After excluding the datasets with insufficient samples (less than 100) or low tumor purity, we finally retained 29 datasets with the GE profiles of 9,381 samples (Fig. 2a and Supplementary Table S1). The GE profiles are used for differential expression analysis, survival analysis, and co-expression analysis.
CNVs exist pervasively in human genomes and contribute to the diversity and susceptibility of numerous diseases30. It may be an important factor in cancer occurrence and development. A series of studies and attempts have been carried out to explore the impact of CNV on breast cancer31,32,33. We collected and incorporated CNVs information of 3,035 tumor samples from METABRIC and TCGA. BCIP provides differential analysis and survival analyses for CNV data.
miRNAs are small non-coding RNAs that can regulate protein-coding messenger RNAs (mRNAs) at the post-transcriptional level34. The pivotal role of miRNAs is known as a modulator participating in various biological processes. Numerous studies suggest that dysregulation of miRNAs may contribute to the initiation and progression of cancers35, and miRNAs can be regarded as diagnostic signatures or therapeutic biomarkers in breast cancer36,37. To facilitate researchers’ investigations into potential regulation mechanisms of query genes, BCIP provides MicroRNA-target Interaction Analysis, which illustrates miRNAs targeting the query gene. There is a total of 324,219 miRNA-target interactions between 2,619 miRNAs and 14,884 target genes from miRTarBase38 that are maintained in BCIP. All of these interactions are experimentally validated.
It is known that some biological pathways involved in metabolism, apoptosis, and signal transduction play critical roles in cell proliferation and differentiation during tumorigenesis and cancer development. Understanding which pathways a gene participates in will be of great help for researchers in characterizing its functions in breast cancer. We have collected 22,455 linked entries between 6,755 genes and 286 human pathways from the KEGG database39.
Each biomolecule is located in complex biological networks and exerts its functions together with other related molecules. Notably, gene expression as well as gene-gene functional relationships in the complex biological regulation network may be tissue-specific18,40. Identifying a gene’s functional partner in specified tissue can facilitate researchers to infer gene functions and molecular mechanisms. Here we provide mammary tissue-specific gene functional networks analysis. Data on both mammary epithelium and mammary gland gene functional networks were collected and processed from the GIANT webserver40.
For a query gene, BCIP helps to demonstrate its potential as a biomarker or regulatory gene in breast cancer. Take MELK, a promising therapeutic target of BBC reported recently, as an example to demonstrate the utility and advantage of BCIP14. Differential expression analysis shows that MELK has a much higher expression level in tumors than adjacent normal tissues across all of the available datasets (Supplementary Figure S2) and has the highest expression level in basal-like subtype among PAM50 subtypes across all of the datasets (Supplementary Figure S3). When subgroups are customized with tumor grades, we found that higher MELK expression level was significantly associated with higher histological grades among all of the datasets (Supplementary Figure S4). Survival analysis shows that overexpression of MELK is strongly correlated with poor prognosis (Supplementary Figure S5–9). These in silico results indicate that MELK might play roles in BBC.
To provide clues of the possible molecular mechanism of MELK in BBC, we further analyzed the co-expression genes and regulatory miRNAs of MELK. In basal-like subtype of the METABRIC dataset, MELK was significantly co-expressed with 78 genes with PCC>0.6, including CDCA5, TPX2, and CEP55 (Fig. 2e). Several studies have demonstrated that TPX2 and CEP55 are critical molecules for breast cancer migration, invasion, cell proliferation, and metastasis41,42,43. CDCA5 has been reported to play a crucial role in human lung carcinogenesis and has the potential of being a therapeutic target for oral squamous cell carcinoma44,45. These results may be valuable clues for the investigation of potential function and molecular mechanism of MELK in breast cancer. Additionally, we found miRNAs targeting MELK, including hsa-miR-193b-3p and hsa-miR-372-5p (Fig. 4). Previous studies have shown that miR-193b represses cell proliferation and regulates cyclin D1 in melanoma46, and miR-372 suppresses tumor proliferation, invasion, and migration in various tumor types47,48. This result indicates that MELK might be regulated by miR-193b-3p and miR-372-5p in breast cancer.
We have developed BCIP, a user-friendly, open-access, integrative analysis platform that integrates almost 10,000 tumor and normal tissue samples of breast cancer. It will facilitate the identification of potential biomarkers and regulatory genes in breast cancer. Compared with other bioinformatics resources and analysis tools, BCIP has 3 unique characteristics: (i) BCIP incorporates multiple analysis types, including differential expression analysis, copy number variation and survival analysis, gene co-expression analysis, miRNA regulation analysis, KEGG pathways presentation, and mammary tissue-specific gene functional network analysis. All of these analysis tools help to sketch an overview of a gene in breast cancer. (ii) It provides dozens of datasets that are screened from publicly available databases, selected with strict quality control and processed with uniform normalization methods. Users can observe the consistency of the analysis results across multiple datasets, which will be helpful to evaluate the robustness of analysis results. (iii) BCIP permits users to perform analysis in specific breast cancer subgroups that are customized with single or combined clinical features of interest, including molecular subtypes, therapy response, and various clinical features.
Lots of studies have been done to identify biomarkers and to uncover molecular mechanisms of tumorigenesis, cell invasion, and metastasis in breast cancer. However, many tumors with high invasion and poor outcomes, such as TNBC or basal-like tumors, still lack well-defined molecular biomarkers and therapy targets due to the high heterogeneity. BCIP serves as a convenient and efficient platform to identify biomarkers, characterize potential functions and mechanisms of genes in breast cancer. Researchers can find clues for subsequent experiments and clinical analysis. In our future work, we will continue incorporating newly available, credible data into BCIP and provide reliable supports for researchers of breast cancer.
How to cite this article: Wu, J. et al. BCIP: a gene-centered platform for identifying potential regulatory genes in breast cancer. Sci. Rep. 7, 45235; doi: 10.1038/srep45235 (2017).
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by a grant from China National High Technology Research and Development Program 2014AA020604 (X. Ying). The authors gratefully acknowledge the technical assistance for platform construction from Mingming Su, Honglei Li, Jiemin Zou, and Xinlei Zhang. This study makes use of data generated by Molecular Taxonomy of Breast Cancer International Consortium, the authors greatly thank to the METABRIC group for the approval of the data usage. The authors gratefully acknowledge the copyright permission of KEGG for pathway map image used in our article.
The authors declare no competing financial interests.
Author Contributions J.W. and S.H. carried out the majority of the data analysis and platform construction, participated in the platform and web page design, and drafted the manuscript. Y.C. participated in providing technical assistance to platform construction. Z.L. and J.Z. provided guidance for data processing procedures, and participated in part of data analysis. H.Y. and Q.S. contributed to the data collection and arrangement from publicly available databases. N.S. provided guidance to the study. X.Y. conceived the study and platform construction, contributed to the web page design, and revised the manuscript. All authors read and approved the final manuscript.