Motivation: The Illumina Infinium 450 k DNA Methylation Beadchip is a prime candidate technology for Epigenome-Wide Association Studies (EWAS). However, a difficulty associated with these beadarrays is that probes come in two different designs, characterized by widely different DNA methylation distributions and dynamic range, which may bias downstream analyses. A key statistical issue is therefore how best to adjust for the two different probe designs.
Results: Here we propose a novel model-based intra-array normalization strategy for 450 k data, called BMIQ (Beta MIxture Quantile dilation), to adjust the beta-values of type2 design probes into a statistical distribution characteristic of type1 probes. The strategy involves application of a three-state beta-mixture model to assign probes to methylation states, subsequent transformation of probabilities into quantiles and finally a methylation-dependent dilation transformation to preserve the monotonicity and continuity of the data. We validate our method on cell-line data, fresh frozen and paraffin-embedded tumour tissue samples and demonstrate that BMIQ compares favourably with two competing methods. Specifically, we show that BMIQ improves the robustness of the normalization procedure, reduces the technical variation and bias of type2 probe values and successfully eliminates the type1 enrichment bias caused by the lower dynamic range of type2 probes. BMIQ will be useful as a preprocessing step for any study using the Illumina Infinium 450 k platform.
Availability: BMIQ is freely available from http://code.google.com/p/bmiq/.
Supplementary data are available at Bioinformatics online
The cellular phenotype is described by a complex network of molecular interactions. Elucidating network properties that distinguish disease from the healthy cellular state is therefore of critical importance for gaining systems-level insights into disease mechanisms and ultimately for developing improved therapies. By integrating gene expression data with a protein interaction network we here demonstrate that cancer cells are characterised by an increase in network entropy. In addition, we formally demonstrate that gene expression differences between normal and cancer tissue are anticorrelated with local network entropy changes, thus providing a systemic link between gene expression changes at the nodes and their local correlation patterns. In particular, we find that genes which drive cell-proliferation in cancer cells and which often encode oncogenes are associated with reductions in network entropy. These findings may have potential implications for identifying novel drug targets.
BMC Research Notes recently published a research article regarding the use of ligated DNA extracted from formalin-fixed paraffin embedded (FFPE) tissue on the Illumina Infinium methylation platform - “Interpretation of genome-wide infinium methylation data from ligated DNA in formalin-fixed, paraffin-embedded paired tumor and normal tissue” Jasmine et al. BMC Research Notes 2012, 5:117. This article repeatedly refers to our previous work and concludes that methylation data obtained from ligated FFPE extracted DNA should be used with great caution. In this Discussion we review the data analysis performed in Jasmine et al’s paper and suggest limitations which subsequently lead the authors to draw what we believe are incorrect conclusions. Moreover, we continue to analyse genome-wide methylation data from DNA extracted from FFPE tissue successfully on both the HumMeth27 and 450 K arrays.
Oestrogen receptor-α (ER) is the defining and driving transcription factor in the majority of breast cancers and its target genes dictate cell growth and endocrine response, yet genomic understanding of ER function has been restricted to model systems1-3. We now map genome-wide ER binding events, by chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq), in primary breast cancers from patients with different clinical outcome and in distant ER positive (ER+) metastases. We find that drug resistant cancers still have ER-chromatin occupancy, but that ER binding is a dynamic process, with the acquisition of unique ER binding regions in tumours from patients that are likely to relapse. The acquired, poor outcome ER regulatory regions observed in primary tumours reveal gene signatures that predict clinical outcome in ER+ disease exclusively. We find that the differential ER binding programme observed in tumours from patients with poor outcome is not due to the selection of a rare subpopulation of cells, but is due to the FoxA1-mediated reprogramming of ER binding on a rapid time scale. The parallel redistribution of ER and FoxA1 cis-regulatory elements in drug resistant cellular contexts is supported by histological co-expression of ER and FoxA1 in metastatic samples. By establishing transcription factor mapping in primary tumour material, we show that there is plasticity in ER binding capacity, with distinct combinations of cis-regulatory elements linked with the different clinical outcomes.
The 27k Illumina Infinium Methylation Beadchip is a popular high-throughput technology that allows the methylation state of over 27,000 CpGs to be assayed. While feature selection and classification methods have been comprehensively explored in the context of gene expression data, relatively little is known as to how best to perform feature selection or classification in the context of Illumina Infinium methylation data. Given the rising importance of epigenomics in cancer and other complex genetic diseases, and in view of the upcoming epigenome wide association studies, it is critical to identify the statistical methods that offer improved inference in this novel context.
Using a total of 7 large Illumina Infinium 27k Methylation data sets, encompassing over 1,000 samples from a wide range of tissues, we here provide an evaluation of popular feature selection, dimensional reduction and classification methods on DNA methylation data. Specifically, we evaluate the effects of variance filtering, supervised principal components (SPCA) and the choice of DNA methylation quantification measure on downstream statistical inference. We show that for relatively large sample sizes feature selection using test statistics is similar for M and β-values, but that in the limit of small sample sizes, M-values allow more reliable identification of true positives. We also show that the effect of variance filtering on feature selection is study-specific and dependent on the phenotype of interest and tissue type profiled. Specifically, we find that variance filtering improves the detection of true positives in studies with large effect sizes, but that it may lead to worse performance in studies with smaller yet significant effect sizes. In contrast, supervised principal components improves the statistical power, especially in studies with small effect sizes. We also demonstrate that classification using the Elastic Net and Support Vector Machine (SVM) clearly outperforms competing methods like LASSO and SPCA. Finally, in unsupervised modelling of cancer diagnosis, we find that non-negative matrix factorisation (NMF) clearly outperforms principal components analysis.
Our results highlight the importance of tailoring the feature selection and classification methodology to the sample size and biological context of the DNA methylation study. The Elastic Net emerges as a powerful classification algorithm for large-scale DNA methylation studies, while NMF does well in the unsupervised context. The insights presented here will be useful to any study embarking on large-scale DNA methylation profiling using Illumina Infinium beadarrays.
DNA methylation; Classification; Feature selection; Beadarrays
Recently, it has been proposed that epigenetic variation may contribute to the risk of complex genetic diseases like cancer. We aimed to demonstrate that epigenetic changes in normal cells, collected years in advance of the first signs of morphological transformation, can predict the risk of such transformation.
We analyzed DNA methylation (DNAm) profiles of over 27,000 CpGs in cytologically normal cells of the uterine cervix from 152 women in a prospective nested case-control study. We used statistics based on differential variability to identify CpGs associated with the risk of transformation and a novel statistical algorithm called EVORA (Epigenetic Variable Outliers for Risk prediction Analysis) to make predictions.
We observed many CpGs that were differentially variable between women who developed a non-invasive cervical neoplasia within 3 years of sample collection and those that remained disease-free. These CpGs exhibited heterogeneous outlier methylation profiles and overlapped strongly with CpGs undergoing age-associated DNA methylation changes in normal tissue. Using EVORA, we demonstrate that the risk of cervical neoplasia can be predicted in blind test sets (AUC = 0.66 (0.58 to 0.75)), and that assessment of DNAm variability allows more reliable identification of risk-associated CpGs than statistics based on differences in mean methylation levels. In independent data, EVORA showed high sensitivity and specificity to detect pre-invasive neoplasia and cervical cancer (AUC = 0.93 (0.86 to 1) and AUC = 1, respectively).
We demonstrate that the risk of neoplastic transformation can be predicted from DNA methylation profiles in the morphologically normal cell of origin of an epithelial cancer. Having profiled only 0.1% of CpGs in the human genome, studies of wider coverage are likely to yield improved predictive and diagnostic models with the accuracy needed for clinical application.
The ARTISTIC trial is registered with the International Standard Randomised Controlled Trial Number ISRCTN25417821.
Aberrant DNA methylation is an important cancer hallmark, yet the dynamics of DNA methylation changes in human carcinogenesis remain largely unexplored. Moreover, the role of DNA methylation for prediction of clinical outcome is still uncertain and confined to specific cancers. Here we perform the most comprehensive study of DNA methylation changes throughout human carcinogenesis, analysing 27,578 CpGs in each of 1,475 samples, ranging from normal cells in advance of non-invasive neoplastic transformation to non-invasive and invasive cancers and metastatic tissue. We demonstrate that hypermethylation at stem cell PolyComb Group Target genes (PCGTs) occurs in cytologically normal cells three years in advance of the first morphological neoplastic changes, while hypomethylation occurs preferentially at CpGs which are heavily Methylated in Embryonic Stem Cells (MESCs) and increases significantly with cancer invasion in both the epithelial and stromal tumour compartments. In contrast to PCGT hypermethylation, MESC hypomethylation progresses significantly from primary to metastatic cancer and defines a poor prognostic signature in four different gynaecological cancers. Finally, we associate expression of TET enzymes, which are involved in active DNA demethylation, to MESC hypomethylation in cancer. These findings have major implications for cancer and embryonic stem cell biology and establish the importance of systemic DNA hypomethylation for predicting prognosis in a wide range of different cancers.
DNA methylation is an important chemical modification of DNA that can affect and regulate the activity of genes in human tissue. Abnormal DNA methylation and its subsequent effects on gene activity are a hallmark of cancer, yet when precisely these DNA methylation changes occur and how they contribute to the development of cancer remains largely unexplored. In this work we measure the methylation state of DNA at over 14,000 genes in over 1,475 samples, including normal and benign cells, invasive cancers, and metastatic cancer tissue. Using cervical cancer as a model, we show that gain of abnormal methylation at genes typically un-methylated in stem cells can be detected up to 3 years in advance of the appearance of pre-cancerous cells, while those genes typically methylated in stem cells lose this methylation progressively throughout cancer development. Furthermore, we discover that this process of methylation loss during cancer progression is a marker of poor disease outcome common to all four major women-specific cancers: breast, ovarian, endometrial, and cervical cancers. Finally we demonstrate the relationship between loss of methylation and cancer-specific over-production of a specific protein known to play an active role in removing methylation from DNA. Taken together these findings highlight the complex nature of DNA methylation dynamics in cancer development as well as their potential exploitation for clinical gain.
A substantial proportion of lymph node-negative patients who receive adjuvant chemotherapy do not derive any benefit from this aggressive and potentially toxic treatment. However, standard histopathological indices cannot reliably detect patients at low risk of relapse or distant metastasis. In the past few years several prognostic gene expression signatures have been developed and shown to potentially outperform histopathological factors in identifying low-risk patients in specific breast cancer subgroups with predictive values of around 90%, and therefore hold promise for clinical application. We envisage that further improvements and insights may come from integrative expression pathway analyses that dissect prognostic signatures into modules related to cancer hallmarks.
Recent multi-dimensional approaches to the study of complex disease have revealed powerful insights into how genetic and epigenetic factors may underlie their aetiopathogenesis. We examined genotype-epigenotype interactions in the context of Type 2 Diabetes (T2D), focussing on known regions of genomic susceptibility. We assayed DNA methylation in 60 females, stratified according to disease susceptibility haplotype using previously identified association loci. CpG methylation was assessed using methylated DNA immunoprecipitation on a targeted array (MeDIP-chip) and absolute methylation values were estimated using a Bayesian algorithm (BATMAN). Absolute methylation levels were quantified across LD blocks, and we identified increased DNA methylation on the FTO obesity susceptibility haplotype, tagged by the rs8050136 risk allele A (p = 9.40×10−4, permutation p = 1.0×10−3). Further analysis across the 46 kb LD block using sliding windows localised the most significant difference to be within a 7.7 kb region (p = 1.13×10−7). Sequence level analysis, followed by pyrosequencing validation, revealed that the methylation difference was driven by the co-ordinated phase of CpG-creating SNPs across the risk haplotype. This 7.7 kb region of haplotype-specific methylation (HSM), encapsulates a Highly Conserved Non-Coding Element (HCNE) that has previously been validated as a long-range enhancer, supported by the histone H3K4me1 enhancer signature. This study demonstrates that integration of Genome-Wide Association (GWA) SNP and epigenomic DNA methylation data can identify potential novel genotype-epigenotype interactions within disease-associated loci, thus providing a novel route to aid unravelling common complex diseases.
Elucidating the activation pattern of molecular pathways across a given tumour type is a key challenge necessary for understanding the heterogeneity in clinical response and for developing novel more effective therapies. Gene expression signatures of molecular pathway activation derived from perturbation experiments in model systems as well as structural models of molecular interactions ("model signatures") constitute an important resource for estimating corresponding activation levels in tumours. However, relatively few strategies for estimating pathway activity from such model signatures exist and only few studies have used activation patterns of pathways to refine molecular classifications of cancer.
Here we propose a novel network-based method for estimating pathway activation in tumours from model signatures. We find that although the pathway networks inferred from cancer expression data are highly consistent with the prior information contained in the model signatures, that they also exhibit a highly modular structure and that estimation of pathway activity is dependent on this modular structure. We apply our methodology to a panel of 438 estrogen receptor negative (ER-) and 785 estrogen receptor positive (ER+) breast cancers to infer activation patterns of important cancer related molecular pathways.
We show that in ER negative basal and HER2+ breast cancer, gene expression modules reflecting T-cell helper-1 (Th1) and T-cell helper-2 (Th2) mediated immune responses play antagonistic roles as major risk factors for distant metastasis. Using Boolean interaction Cox-regression models to identify non-linear pathway combinations associated with clinical outcome, we show that simultaneous high activation of Th1 and low activation of a TGF-beta pathway module defines a subtype of particularly good prognosis and that this classification provides a better prognostic model than those based on the individual pathways. In ER+ breast cancer, we find that simultaneous high MYC and RAS activity confers significantly worse prognosis than either high MYC or high RAS activity alone. We further validate these novel prognostic classifications in independent sets of 173 ER- and 567 ER+ breast cancers.
We have proposed a novel method for pathway activity estimation in tumours and have shown that pathway modules antagonize or synergize to delineate novel prognostic subtypes. Specifically, our results suggest that simultaneous modulation of T-helper differentiation and TGF-beta pathways may improve clinical outcome of hormone insensitive breast cancers over treatments that target only one of these pathways.
Diabetic nephropathy is a serious complication of diabetes mellitus and is associated with considerable morbidity and high mortality. There is increasing evidence to suggest that dysregulation of the epigenome is involved in diabetic nephropathy. We assessed whether epigenetic modification of DNA methylation is associated with diabetic nephropathy in a case-control study of 192 Irish patients with type 1 diabetes mellitus (T1D). Cases had T1D and nephropathy whereas controls had T1D but no evidence of renal disease.
We performed DNA methylation profiling in bisulphite converted DNA from cases and controls using the recently developed Illumina Infinium® HumanMethylation27 BeadChip, that enables the direct investigation of 27,578 individual cytosines at CpG loci throughout the genome, which are focused on the promoter regions of 14,495 genes.
Singular Value Decomposition (SVD) analysis indicated that significant components of DNA methylation variation correlated with patient age, time to onset of diabetic nephropathy, and sex. Adjusting for confounding factors using multivariate Cox-regression analyses, and with a false discovery rate (FDR) of 0.05, we observed 19 CpG sites that demonstrated correlations with time to development of diabetic nephropathy. Of note, this included one CpG site located 18 bp upstream of the transcription start site of UNC13B, a gene in which the first intronic SNP rs13293564 has recently been reported to be associated with diabetic nephropathy.
This high throughput platform was able to successfully interrogate the methylation state of individual cytosines and identified 19 prospective CpG sites associated with risk of diabetic nephropathy. These differences in DNA methylation are worthy of further follow-up in replication studies using larger cohorts of diabetic patients with and without nephropathy.
The statistical study of biological networks has led to important novel biological insights, such as the presence of hubs and hierarchical modularity. There is also a growing interest in studying the statistical properties of networks in the context of cancer genomics. However, relatively little is known as to what network features differ between the cancer and normal cell physiologies, or between different cancer cell phenotypes.
Based on the observation that frequent genomic alterations underlie a more aggressive cancer phenotype, we asked if such an effect could be detectable as an increase in the randomness of local gene expression patterns. Using a breast cancer gene expression data set and a model network of protein interactions we derive constrained weighted networks defined by a stochastic information flux matrix reflecting expression correlations between interacting proteins. Based on this stochastic matrix we propose and compute an entropy measure that quantifies the degree of randomness in the local pattern of information flux around single genes. By comparing the local entropies in the non-metastatic versus metastatic breast cancer networks, we here show that breast cancers that metastasize are characterised by a small yet significant increase in the degree of randomness of local expression patterns. We validate this result in three additional breast cancer expression data sets and demonstrate that local entropy better characterises the metastatic phenotype than other non-entropy based measures. We show that increases in entropy can be used to identify genes and signalling pathways implicated in breast cancer metastasis and provide examples of de-novo discoveries of gene modules with known roles in apoptosis, immune-mediated tumour suppression, cell-cycle and tumour invasion. Importantly, we also identify a novel gene module within the insulin growth factor signalling pathway, alteration of which may predispose the tumour to metastasize.
These results demonstrate that a metastatic cancer phenotype is characterised by an increase in the randomness of the local information flux patterns. Measures of local randomness in integrated protein interaction mRNA expression networks may therefore be useful for identifying genes and signalling pathways disrupted in one phenotype relative to another. Further exploration of the statistical properties of such integrated cancer expression and protein interaction networks will be a fruitful endeavour.
Recent studies have shown that DNA methylation (DNAm) markers in peripheral blood may hold promise as diagnostic or early detection/risk markers for epithelial cancers. However, to date no study has evaluated the diagnostic and predictive potential of such markers in a large case control cohort and on a genome-wide basis.
By performing genome-wide DNAm profiling of a large ovarian cancer case control cohort, we here demonstrate that active ovarian cancer has a significant impact on the DNAm pattern in peripheral blood. Specifically, by measuring the methylation levels of over 27,000 CpGs in blood cells from 148 healthy individuals and 113 age-matched pre-treatment ovarian cancer cases, we derive a DNAm signature that can predict the presence of active ovarian cancer in blind test sets with an AUC of 0.8 (95% CI (0.74–0.87)). We further validate our findings in another independent set of 122 post-treatment cases (AUC = 0.76 (0.72–0.81)). In addition, we provide evidence for a significant number of candidate risk or early detection markers for ovarian cancer. Furthermore, by comparing the pattern of methylation with gene expression data from major blood cell types, we here demonstrate that age and cancer elicit common changes in the composition of peripheral blood, with a myeloid skewing that increases with age and which is further aggravated in the presence of ovarian cancer. Finally, we show that most cancer and age associated methylation variability is found at CpGs located outside of CpG islands.
Our results underscore the potential of DNAm profiling in peripheral blood as a tool for detection or risk-prediction of epithelial cancers, and warrants further in-depth and higher CpG coverage studies to further elucidate this role.
Acquired somatic mutations are responsible for approximately 90% of breast tumours. However, only one somatic aberration, amplification of the HER2 locus, is currently used to define a clinical subtype, one that accounts for approximately 10% to 15% of breast tumours. In recent years, a number of mutational profiling studies have attempted to further identify clinically relevant mutations. While these studies have confirmed the oncogenic or tumour suppressor role of many known suspects, they have exposed complexity as a main feature of the breast cancer mutational landscape (the 'muta-ome'). The two defining features of this complexity are (a) a surprising richness of low-frequency mutants contrasting with the relative rarity of high-frequency events and (b) the relatively large number of somatic genomic aberrations (approximately 20 to 50) driving an average tumour. Structural features of this complex landscape have begun to emerge from follow-up studies that have tackled the complexity by integrating the spectrum of genomic mutations with a variety of complementary biological knowledge databases. Among these structural features are the growing links between somatic gene disruptions and those conferring breast cancer risk, mutually exclusive coexistence and synergistic mutational patterns, and a clearly non-random distribution of mutations implicating specific molecular pathways in breast tumour initiation and progression. Recognising that a shift from a gene-centric to a pathway-centric approach is necessary, we envisage that further progress in identifying clinically relevant genomic aberration patterns and associated breast cancer subtypes will require not only multi-dimensional integrative analyses that combine mutational and functional profiles, but also larger profiling studies that use second- and third-generation sequencing technologies in order to fill out the important gaps in the current mutational landscape.
Patients with primary operable oestrogen receptor (ER) negative (-) breast cancer account for about 30% of all cases and generally have a worse prognosis than ER-positive (+) patients. Nevertheless, a significant proportion of ER- cases have favourable outcomes and could potentially benefit from a less aggressive course of therapy. However, identification of such patients with a good prognosis remains difficult and at present is only possible through examining histopathological factors.
Building on a previously identified seven-gene prognostic immune response module for ER- breast cancer, we developed a novel statistical tool based on Mixture Discriminant Analysis in order to build a classifier that could accurately identify ER- patients with a good prognosis.
We report the construction of a seven-gene expression classifier that accurately predicts, across a training cohort of 183 ER- tumours and six independent test cohorts (a total of 469 ER- tumours), ER- patients of good prognosis (in test sets, average predictive value = 94% [range 85 to 100%], average hazard ratio = 0.15 [range 0.07 to 0.36] p < 0.000001) independently of lymph node status and treatment.
This seven-gene classifier could be used in a polymerase chain reaction-based clinical assay to identify ER- patients with a good prognosis, who may therefore benefit from less aggressive treatment regimens.
The recent whole-genome scan for breast cancer has revealed the FGFR2 (fibroblast growth factor receptor 2) gene as a locus associated with a small, but highly significant, increase in the risk of developing breast cancer. Using fine-scale genetic mapping of the region, it has been possible to narrow the causative locus to a haplotype of eight strongly linked single nucleotide polymorphisms (SNPs) spanning a region of 7.5 kilobases (kb) in the second intron of the FGFR2 gene. Here we describe a functional analysis to define the causative SNP, and we propose a model for a disease mechanism. Using gene expression microarray data, we observed a trend of increased FGFR2 expression in the rare homozygotes. This trend was confirmed using real-time (RT) PCR, with the difference between the rare and the common homozygotes yielding a Wilcox p-value of 0.028. To elucidate which SNPs might be responsible for this difference, we examined protein–DNA interactions for the eight most strongly disease-associated SNPs in different breast cell lines. We identify two cis-regulatory SNPs that alter binding affinity for transcription factors Oct-1/Runx2 and C/EBPβ, and we demonstrate that both sites are occupied in vivo. In transient transfection experiments, the two SNPs can synergize giving rise to increased FGFR2 expression. We propose a model in which the Oct-1/Runx2 and C/EBPβ binding sites in the disease-associated allele are able to lead to an increase in FGFR2 gene expression, thereby increasing the propensity for tumour formation.
Recently, a number of whole-genome association studies have identified genes that predispose individuals to common diseases such as cancer. The challenge now is to understand how the identified risk loci contribute to disease, since the majority of these loci are located within introns (which are discarded after transcription) and intergenic regions, and therefore do not change the coding region of nearby genes. This manuscript describes how two single–base pair changes in intron 2 of the FGFR2 (fibroblast growth factor receptor 2) gene, “the top hit” of the breast cancer susceptibility study, exert their function. We find that the changes alter the binding of two transcription factors and cause an increase in FGFR2 gene expression, thus providing a molecular explanation for the risk phenotype. This is the first functional study, to our knowledge, of the risk loci identified for breast cancer in a whole-genome scan and demonstrates that these studies can be used as valid starting points for studying the underlying biology of cancer.
Recent whole-genome scans have identified novel risk genes for many common diseases, challenging researchers to determine how these genes contribute to disease. A new study provides molecular insights into a breast cancer risk factor.
High resolution array-CGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer, and provides a genome-wide list of common copy number alterations associated with aberrant expression and poor prognosis.
The characterization of copy number alteration patterns in breast cancer requires high-resolution genome-wide profiling of a large panel of tumor specimens. To date, most genome-wide array comparative genomic hybridization studies have used tumor panels of relatively large tumor size and high Nottingham Prognostic Index (NPI) that are not as representative of breast cancer demographics.
We performed an oligo-array-based high-resolution analysis of copy number alterations in 171 primary breast tumors of relatively small size and low NPI, which was therefore more representative of breast cancer demographics. Hierarchical clustering over the common regions of alteration identified a novel subtype of high-grade estrogen receptor (ER)-negative breast cancer, characterized by a low genomic instability index. We were able to validate the existence of this genomic subtype in one external breast cancer cohort. Using matched array expression data we also identified the genomic regions showing the strongest coordinate expression changes ('hotspots'). We show that several of these hotspots are located in the phosphatome, kinome and chromatinome, and harbor members of the 122-breast cancer CAN-list. Furthermore, we identify frequently amplified hotspots on 8q22.3 (EDD1, WDSOF1), 8q24.11-13 (THRAP6, DCC1, SQLE, SPG8) and 11q14.1 (NDUFC2, ALG8, USP35) associated with significantly worse prognosis. Amplification of any of these regions identified 37 samples with significantly worse overall survival (hazard ratio (HR) = 2.3 (1.3-1.4) p = 0.003) and time to distant metastasis (HR = 2.6 (1.4-5.1) p = 0.004) independently of NPI.
We present strong evidence for the existence of a novel subtype of high-grade ER-negative tumors that is characterized by a low genomic instability index. We also provide a genome-wide list of common copy number alteration regions in breast cancer that show strong coordinate aberrant expression, and further identify novel frequently amplified regions that correlate with poor prognosis. Many of the genes associated with these regions represent likely novel oncogenes or tumor suppressors.
A feature selection method was used in an analysis of three major microarray expression datasets to identify molecular subclasses and prognostic markers in estrogen receptor-negative breast cancer, showing that it is a heterogeneous disease with at least four main subtypes.
Estrogen receptor (ER)-negative breast cancer specimens are predominantly of high grade, have frequent p53 mutations, and are broadly divided into HER2-positive and basal subtypes. Although ER-negative disease has overall worse prognosis than does ER-positive breast cancer, not all ER-negative breast cancer patients have poor clinical outcome. Reliable identification of ER-negative tumors that have a good prognosis is not yet possible.
We apply a recently proposed feature selection method in an integrative analysis of three major microarray expression datasets to identify molecular subclasses and prognostic markers in ER-negative breast cancer. We find a subclass of basal tumors, characterized by over-expression of immune response genes, which has a better prognosis than the rest of ER-negative breast cancers. Moreover, we show that, in contrast to ER-positive tumours, the majority of prognostic markers in ER-negative breast cancer are over-expressed in the good prognosis group and are associated with activation of complement and immune response pathways. Specifically, we identify an immune response related seven-gene module and show that downregulation of this module confers greater risk for distant metastasis (hazard ratio 2.02, 95% confidence interval 1.2-3.4; P = 0.009), independent of lymph node status and lymphocytic infiltration. Furthermore, we validate the immune response module using two additional independent datasets.
We show that ER-negative basal breast cancer is a heterogeneous disease with at least four main subtypes. Furthermore, we show that the heterogeneity in clinical outcome of ER-negative breast cancer is related to the variability in expression levels of complement and immune response pathway genes, independent of lymphocytic infiltration.
The quantity of mRNA transcripts in a cell is determined by a complex interplay of cooperative and counteracting biological processes. Independent Component Analysis (ICA) is one of a few number of unsupervised algorithms that have been applied to microarray gene expression data in an attempt to understand phenotype differences in terms of changes in the activation/inhibition patterns of biological pathways. While the ICA model has been shown to outperform other linear representations of the data such as Principal Components Analysis (PCA), a validation using explicit pathway and regulatory element information has not yet been performed. We apply a range of popular ICA algorithms to six of the largest microarray cancer datasets and use pathway-knowledge and regulatory-element databases for validation. We show that ICA outperforms PCA and clustering-based methods in that ICA components map closer to known cancer-related pathways, regulatory modules, and cancer phenotypes. Furthermore, we identify cancer signalling and oncogenic pathways and regulatory modules that play a prominent role in breast cancer and relate the differential activation patterns of these to breast cancer phenotypes. Importantly, we find novel associations linking immune response and epithelial–mesenchymal transition pathways with estrogen receptor status and histological grade, respectively. In addition, we find associations linking the activity levels of biological pathways and transcription factors (NF1 and NFAT) with clinical outcome in breast cancer. ICA provides a framework for a more biologically relevant interpretation of genomewide transcriptomic data. Adopting ICA as the analysis tool of choice will help understand the phenotype–pathway relationship and thus help elucidate the molecular taxonomy of heterogeneous cancers and of other complex genetic diseases.
The amount of a given transcript or protein in a cell is determined by a balance of expression and repression in a complex network of biological processes. This delicate balance is compromised in complex genetic diseases such as cancer by alterations in the activation patterns of functionally important biological processes known as pathways. Over the last years, a large number of microarray experiments profiling the expression levels of more than 20,000 human genes in hundreds of tumor samples have shown that most cancer types are heterogeneous diseases, each characterized by many different expression subtypes. The biological and clinical goal is to explain the observed tumor and clinical heterogeneity in terms of specific patterns of altered pathways. The bioinformatic challenge is therefore to devise mathematical tools that explicitly attempt to infer these altered pathways. To this end, we applied a signal processing tool in a meta-analysis of breast cancer, encompassing more than 800 tumor specimens derived from four different patient cohorts, and showed that this algorithm significantly outperforms popular standard bioinformatics tools in identifying altered pathways underlying breast cancer. These results show that the same tool could be applied to other complex human genetic diseases to better elucidate the underlying altered pathways.
A consensus prognostic classifier for estrogen receptor positive breast tumors has been developed and shown to be valid in nearly 900 samples across different microarray platforms.
A consensus prognostic gene expression classifier is still elusive in heterogeneous diseases such as breast cancer.
Here we perform a combined analysis of three major breast cancer microarray data sets to hone in on a universally valid prognostic molecular classifier in estrogen receptor (ER) positive tumors. Using a recently developed robust measure of prognostic separation, we further validate the prognostic classifier in three external independent cohorts, confirming the validity of our molecular classifier in a total of 877 ER positive samples. Furthermore, we find that molecular classifiers may not outperform classical prognostic indices but that they can be used in hybrid molecular-pathological classification schemes to improve prognostic separation.
The prognostic molecular classifier presented here is the first to be valid in over 877 ER positive breast cancer samples and across three different microarray platforms. Larger multi-institutional studies will be needed to fully determine the added prognostic value of molecular classifiers when combined with standard prognostic factors.
Post-translational modification of histones resulting in chromatin remodelling plays a key role in the regulation of gene expression. Here we report characteristic patterns of expression of 12 members of 3 classes of chromatin modifier genes in 6 different cancer types: histone acetyltransferases (HATs)- EP300, CREBBP, and PCAF; histone deacetylases (HDACs)- HDAC1, HDAC2, HDAC4, HDAC5, HDAC7A, and SIRT1; and histone methyltransferases (HMTs)- SUV39H1and SUV39H2. Expression of each gene in 225 samples (135 primary tumours, 47 cancer cell lines, and 43 normal tissues) was analysedby QRT-PCR, normalized with 8 housekeeping genes, and given as a ratio by comparison with a universal reference RNA.
This involved a total of 13,000 PCR assays allowing for rigorous analysis by fitting a linear regression model to the data. Mutation analysis of HDAC1, HDAC2, SUV39H1, and SUV39H2 revealed only two out of 181 cancer samples (both cell lines) with significant coding-sequence alterations. Supervised analysis and Independent Component Analysis showed that expression of many of these genes was able to discriminate tumour samples from their normal counterparts. Clustering based on the normalized expression ratios of the 12 genes also showed that most samples were grouped according to tissue type. Using a linear discriminant classifier and internal cross-validation revealed that with as few as 5 of the 12 genes, SIRT1, CREBBP, HDAC7A, HDAC5 and PCAF, most samples were correctly assigned.
The expression patterns of HATs, HDACs, and HMTs suggest these genes are important in neoplastic transformation and have characteristic patterns of expression depending on tissue of origin, with implications for potential clinical application.
Inferring molecular pathway activity is an important step towards reducing the complexity of genomic data, understanding the heterogeneity in clinical outcome, and obtaining molecular correlates of cancer imaging traits. Increasingly, approaches towards pathway activity inference combine molecular profiles (e.g gene or protein expression) with independent and highly curated structural interaction data (e.g protein interaction networks) or more generally with prior knowledge pathway databases. However, it is unclear how best to use the pathway knowledge information in the context of molecular profiles of any given study.
We present an algorithm called DART (Denoising Algorithm based on Relevance network Topology) which filters out noise before estimating pathway activity. Using simulated and real multidimensional cancer genomic data and by comparing DART to other algorithms which do not assess the relevance of the prior pathway information, we here demonstrate that substantial improvement in pathway activity predictions can be made if prior pathway information is denoised before predictions are made. We also show that genes encoding hubs in expression correlation networks represent more reliable markers of pathway activity. Using the Netpath resource of signalling pathways in the context of breast cancer gene expression data we further demonstrate that DART leads to more robust inferences about pathway activity correlations. Finally, we show that DART identifies a hypothesized association between oestrogen signalling and mammographic density in ER+ breast cancer.
Evaluating the consistency of prior information of pathway databases in molecular tumour profiles may substantially improve the subsequent inference of pathway activity in clinical tumour specimens. This de-noising strategy should be incorporated in approaches which attempt to infer pathway activity from prior pathway models.
Several gene expression signatures have been proposed and demonstrated to be predictive of outcome in breast cancer. In the present article we address the following issues: Do these signatures perform similarly? Are there (common) molecular processes reported by these signatures? Can better prognostic predictors be constructed based on these identified molecular processes?
We performed a comprehensive analysis of the performance of nine gene expression signatures on seven different breast cancer datasets. To better characterize the functional processes associated with these signatures, we enlarged each signature by including all probes with a significant correlation to at least one of the genes in the original signature. The enrichment of functional groups was assessed using four ontology databases.
The classification performance of the nine gene expression signatures is very similar in terms of assigning a sample to either a poor outcome group or a good outcome group. Nevertheless the concordance in classification at the sample level is low, with only 50% of the breast cancer samples classified in the same outcome group by all classifiers. The predictive accuracy decreases with the number of poor outcome assignments given to a sample. The best classification performance was obtained for the group of patients with only good outcome assignments. Enrichment analysis of the enlarged signatures revealed 11 functional modules with prognostic ability. The combination of the RNA-splicing and immune modules resulted in a classifier with high prognostic performance on an independent validation set.
The study revealed that the nine signatures perform similarly but exhibit a large degree of discordance in prognostic group assignment. Functional analyses indicate that proliferation is a common cellular process, but that other functional categories are also enriched and show independent prognostic ability. We provide new evidence of the potentially promising prognostic impact of immunity and RNA-splicing processes in breast cancer.