The method is designed for combined analysis of datasets from two types of genome-wide arrays: DNA genomic microarrays and RNA expression microarrays. These arrays provide copy number and expression quantitative data, respectively. The analysis places both types of signals along the genome, taking into account the gene loci for the CN data and the GE data. The rationale of the method is to search for copy number alterations with a major influence in the expression levels of the genes encoded. As a distinctive element from other integrative approaches we do not consider only SNPs or genes individually. We take into account the gene loci following the strategy described in [11
], that is based on the application of the same smoothing and segmentation algorithm to CN and GE in order to establish comparable regions. Once we get the smoothed segments, we perform two independent analyses for each gene loci: a signal correlation analysis and an alteration frequency analysis. (The workflow described in Materials and Methods, presented in last figure, illustrates the procedure of the method including these two independent analyses).
Analysis of correlation between gene expression and copy number levels
The method matches the CN and GE segmented signals within each chromosomal region -i.e., the log2 ratio signals of the corresponding segments- and selects the gene loci that show a significant correlation. These loci can be considered candidate hotspots. In Figure we present the results of this analysis done for the GBM dataset, marking in purple the number of gene loci with Pearson Correlation Coefficient r ≥ 0.60 (that corresponds to a Bonferroni-adjusted p-value < 0.005). Such cutoff (r ≥ 0.60) includes around 55 % of the human gene loci, providing a good coverage with a significant p-value. Setting more stringent cutoffs reduces the coverage too much: r ≥ 0.70 includes only ~26 % of the gene loci; r ≥ 0.80 includes only ~6 %.
Figure 1 Density distribution of the correlation coefficients between GE and CN for the GBM dataset. Purple represents the number of gene loci that present significant correlation (r ≥ 0.60) between gene expression and copy number signals, counted considering (more ...)
The number of probes in the SNP arrays -used to calculate the segmented signals for CN- is large and uniform along the genome. However, in the expression arrays some genomic regions do not have enough allocated gene loci and the number of probes is sparse. This fact is a problem when a GE segment includes outliers (i.e. gene locus which have expression levels very different from the mean of their neighbours). To solve this problem, we look for statistically significant outliers within the GE segments -which were at least in 1/3 of the samples- and we recalculate the signal correlation between their unsegmented GE and the corresponding CN segments. In this way, we find a new set of gene loci with correlation r ≥ 0.60, which is added to the initial set of candidate hotspots identified. This step of the procedure is important to recover some gene loci with quite significant correlation (e.g. EGFR or SEC61G), which were missed in the first step due to the described problem.
Analysis of frequencies for the categorical states Up-Gain and Down-Loss
The method also proposes to find the genomic regions that present a significant GE and CN alteration in the same direction. To assess this, we included a second selective step based on stratification of the segmented data. The genomic regions are stratified in several categories: up-regulation (U), down-regulation (D) or no-change (N) for expression; and gain (G), loss (L) or no-change (N) for copy number. This approach allows a discretization of the genomic regions into 9 different categories as shown in Figure (inserted table): U-G, N-G, D-G, U-N, N-N, D-N, U-L, N-L, D-L. Figure also presents the empirical cumulative distributions for these 9 categories of the GBM samples per gene loci, counting the frequency of samples for all the gene loci in each category. As expected, the distributions show that the "no change" (i.e. N-N, neutral-neutral) is the most frequent state. The analysis of distributions also finds some regions that show a clear correspondence between GE and CN alterations: i.e. the scenario where GE up-regulation is observed co-located with a CN gain (U-G category) and the scenario where GE down-regulation is co-located with a CN loss (D-L category). Our interest focuses on these regions, since they are the ones altered in the same way in both types of data. The analysis of the empirical frequency distributions for the U-G and D-L categories allows identifying the frequency cutoffs that correspond to the 10% upper quantiles. These cutoffs were: 13 samples for U-G and 11 samples for D-L (out of 64 in GBM dataset). The set up of these thresholds identifies those genomic regions that are the most frequently assigned to such altered categories (U-G and D-L) in the studied dataset.
Figure 2 Observed frequency distributions of the 9 categorical states for the GBM dataset: U-G, N-G, D-G, U-N, N-N, D-N, U-L, N-L, D-L. Boxplots corresponding to the distributions of number of samples assigned to each category for all the gene loci. Insert: Contingency (more ...)
Genome-wide identification of hotspots: candidate key genomic regions
Our method identifies candidate key regions that show high correlation between CN and GE and that are frequently altered in the same direction, in both types of signals. The overlapping between the regions with the most significant correlation and the ones with the highest frequencies of simultaneous alteration (CN and GE) along the genome, will constitute hotspots where putative driver genes are likely to be encoded.
Figure presents the combined view of GE and CN alterations on the complete genome obtained for the GBM dataset. The graph shows the alteration frequency, either in CN or in GE independently, along all the genome (22 human chromosomes). The dark colors correspond to GE up-regulated regions (red) or down-regulated regions (green), and the light colors -placed on top- correspond to CN gains (pale red) and losses (pale green). These results show that the method finds the alterations previously described for CN in GBM cancer [12
]. In fact, the most frequent alterations in glioblastoma are the gain of chromosome 7 and the loss of chromosome 10. Our analysis finds such alterations in CN, and also finds their correlation with GE up-regulation for chromosome 7 and with GE down-regulation for chromosome 10. Figure presents a detailed view of the alterations that occur in chromosome 7. It includes a profile of the regions with significant correlation (purple dots along the chromosome) and a profile of the frequency of U-G
regions (pale red). They cover nearly the complete chromosome. A figure with the representation for all the 22 human chromosomes for the GBM samples is included as Additional File 3
Figure 3 Combined view of GE and CN alterations obtained for the GBM dataset. The regions for all the human chromosomes (chr) that are altered either in CN or in GE are presented along the whole genome keeping the proportional size of the chromosomes. The graph (more ...)
Figure 4 Detailed view of chromosome 7 showing the CN and GE correlation and the U-G category frequency for the GBM dataset. The genomic regions for chromosome 7 are represented in X-axis. Blue and purple dots show the correlation coefficients between CN and GE (more ...)
Key genomic regions found for the 64 paired GBM cancer samples
As shown in Figure , the method presented in this work allows the identification of relevant altered genomic regions suffering significant changes in most of the GBM samples. The results also show that many of the detected CN alterations and GE changes overlap along the genome. These regions can be proposed as relevant "hotspots". In Table and Table we present a detailed description of the common genomic regions found in GBM; indicating the correlation and frequency of the U-G
regions (Table , which includes 19 regions), and the D-L
regions (Table , which includes 24 regions). The tables include the correlation between GE and CN for each region (as average correlation for all the gene loci); and the percentage of samples -frequency- in each region, counting only the samples where simultaneous GE and CN alterations occur: either up gene expression and gain in copy number (U-G
) or down gene expression and loss in copy number (D-L
). The regions detected are in the chromosomes that suffer the most significant changes in GBM samples: U-G
, chr 7 and chr 20; D-L
chr 10, chr 13, chr 14 and chr 22. The tables also include the genes enclosed in these regions. The most remarkable changes correspond to a large part of chr 7 (U-G
) and to a large part of chr 10 (D-L
). Two important genes are precisely located in these chromosomes: EGFR (in chr 7) usually up-regulated and PTEN (in chr 10) usually down-regulated [12
]. PTEN is not found in our analysis, but it has been reported an absence of PTEN alterations in more than half of de novo
glioblastomas and more than 90 % of glioblastomas developed from a pre-existing lower grade gliomas [14
], which has been linked to the presence of additional tumor suppressor genes on chr 10, such as LGI1 [15
] and MXI1 [16
]. We found these two genes in regions 8 and 10 of the D-L
list (Table ), and we observed a very variable profile of PTEN in the GBM samples. These facts may indicate that PTEN is not the best genomic marker for this altered region. By contrast, we found RB1 tumor suppressor in region 13 of the D-L
list; and this gene -included in chr 13- is a clear candidate to drive the alteration of tumor cells. With respect to EGFR, it has the highest U-G
frequency observed (60.9%, Table ) and therefore the method reveals this gene locus as the most common GE up-regulated and CN gained in the GBM samples. The alteration of EGFR can be associated with other genes that regulate its function, also found by the method. This is the case of VOPP1 and RAB11FIP2. VOPP1 is also known as ECOP (EGFR-coamplified and overexpressed protein) or GASP (Glioblastoma-amplified secreted protein), and is found in region 12 of the U-G
list (Table ). RAB11FIP2 is a suppressor of the endocytic internalization of EGFR and it is found in region 10 of the D-L
list (Table ) [17
]. The presence of these genes in the hotspots found for GBM supports the value of the method described. There are many other interesting genes in the identified altered genomic regions, that can be useful for further investigations on the disease studied.
Significant U-G regions with the associated genes.
Significant D-L regions with the associated genes.
Complete information corresponding to the genes found in the significant U-G
regions and D-L
regions is included respectively as supplementary material in Additional-file-1
(for the data corresponding to Table ) and Additional-file-2
(for the data corresponding to Table ).