The widespread occurrence of alternative splicing and tissue-specific transcript isoforms adds another layer of complexity to our understanding of human variation. Recent publications have estimated that as many as 95% of human multi-exon genes are influenced by alternative splicing [26
] - a significant increase from past estimates of 74% [28
]. In addition to custom designed isoform-sensitive microarrays [29
], commercial technologies such as the Affymetrix GeneChip Human Exon 1.0 Array have made research in these areas more accessible. The most recent addition to the Affymetrix product line, the GeneChip Human Gene 1.0 Array, is a whole-transcript microarray designed to target the entire length of each gene with the purpose of optimizing gene expression profiling. However, the probe placement strategy of the Gene Array may also make it suitable for detection of alternative splicing. The goal of our study was to compare the performance of the Gene Array with the Exon Array and subsequently determine whether the Gene Array was capable of detecting differentially expressed isoforms. We mention two benefits that such knowledge would provide researchers. Firstly, the ability of older 3' targeting gene expression to detect alternative splicing and isoform variants was limited mainly due to probe placement. By contrast, the Gene Array interrogates the entire length of the gene and has been shown to be an excellent platform for measuring gene expression [12
]. The whole-transcript design yields the potential for examining expression of individual exons and consequently, as already demonstrated in the well-studied Exon Array, the profiling of isoform variation. Second, while the Gene Array is more economical than the Exon Array, the study of alternative splicing can become more accessible by providing researchers an additional application to go along with gene expression studies.
We modelled our analysis approach after our previously successful comparative analysis of the Exon Array with other 3' arrays outlined in Bemmo et al. [11
]. Our study was made possible by having datasets of both platforms assayed on the commercially available MAQC RNA samples, consisting of a high quality biological sample set derived from the human brain and human universal reference. The MAQC is ideal and valuable for benchmarking purposes and for detection of transcript isoform variation, due to the high degree of alternative splicing that occurs in brain tissue.
We first compared the platforms at the gene expression level and concluded that the Gene Array results are highly concordant with the Exon Array results, reaffirming the utility of the Gene Array as a gene expression profiling tool. Noting that the majority of discordant genes between the two platforms were weakly expressed, we used a detection threshold cutoff to filter and correct for these genes. Interestingly, the correlation gradually decreased as the threshold increased past 30, our selected optimal cutoff. With our improved correlation results, we were encouraged to consider comparisons at the exon level.
Studying the Gene Array at the exon level posed a significant challenge as no known approach to summarize probe hybridization intensities to reflect exon expression has been developed for the Gene Array. To overcome this, we modified the Gene Array's Probe Group File such that the probe groupings correspond roughly to an exon rather than a gene. The modified Gene Array groups could then be subjected to the same PLIER summarization step as the Exon Array data. This resulted in full summarization of probe sets consisting of two or more probes, and simple normalization and background correction for single-probe groupings. This approach makes the Gene Array and Exon Array analyses directly comparable. We have made available this modified file for download in Additional File 5
An interesting question that was raised in this study was whether the reduced probe coverage could sufficiently profile exon expression levels. We first observed that the majority of the exons were targeted by one or two probes on the Gene Array, and that of these, the calculated fold changes within our datasets were highly correlated with those on the Exon Array. A similar result was observed when the number of probes per probe set was reduced on the Exon Array and re-summarized. As the majority of the Gene Array probes were generally selected for consistency with the Exon Array as well as uniquely matching to the human genome, we would expect that the Gene Array to be optimized for effective probe hybridizations. This suggests that in general exon expression levels on the Gene Array may be sufficiently estimated with fewer than four probes.
To shed light on the potential application of the Gene Array to detect transcript isoform variation, we used two previously described genes, ELAVL1
, as examples for comparison. The Gene Array demonstrated high reproducibility in exon expression levels and detected the same splicing events as the Exon Array, as seen by their overall similar fold change pattern when visualized on a plot. Interestingly, in the case of ELAVL1
, the gene level fold changes were not in agreement. Here, the long isoform was targeted by twice as many probe sets than the short isoform. Since half of these interrogated the extended 3' region of the isoform, the overall gene level expression summarization as calculated by PLIER is heavily influenced by the individual expression measurements of its probes. In this context, gene level fold changes are not meaningful in describing isoform events. Despite these differences, visualization of the exon level probe sets provided supporting evidence that ELAVL1
contains differentially expressed isoforms. This reiterates the importance of using careful visualization to examine exon level expression for isoform variation, as previously noted by Bemmo et al. [11
]. Such findings can potentially be of significant biological value that may warrant further research.
To better understand the splicing detection differences, we used splicing index analysis to compare the performance of the two platforms. Variations or inaccuracies in expression estimates can have a large impact on the SI, making direct comparisons of such values to determine inter-platform reproducibility a challenge. However, from our analysis we conclude that the isoform-level results, as determined using SI analysis, have a good degree of correspondence between platforms. This further suggests that on a whole-genome scale, the Gene Array may be a valuable tool for profiling alternative splicing.
We note two limitations of the Gene Array. Firstly, with the exception of the analysis on detecting known splicing events illustrated in Table , we considered only the probes that were matched with the Exon Array. This ensures that we are considering the same Exon Array probe set genomic boundaries. It should be noted that the probes unique to the Gene Array that were omitted may provide informative expression data. However, their inclusion would require determining whether they can be accurately mapped within an exon region defined by an external annotation source, which we did not do for the sake of this comparative analysis. In addition, in order to maintain high confidence results, speculative Exon Array probe sets not from the core design were excluded in this part of our analysis. Secondly, the Gene Array is less comprehensive than the Exon Array as it mainly targets only well-annotated genes. As a result, the Gene Array may be limited in its potential to identify novel isoforms in genes that have not been well-studied and/or annotated by database curators. As expected, from comparing the probe locations of the two platforms with the Known Alt Events track on the UCSC Genome Browser, the Exon Array targets a higher proportion of annotated events. Notably, by further comparing the results of the MAQC data analysis and the Alt Events database, both platforms demonstrate the ability to effectively detect these known splicing events, with comparable false positive rates.
Our approach for exon level summarization on the Gene Array relies on utilizing existing Affymetrix software (i.e. PowerTools) and analysis pipelines. As it simply involves the replacement of a standard annotation file, it is relatively easy for users to make their own modifications. Users are also free to supply their own parameter settings for the summarization step to suit their analytical needs.
In addition to microarray studies, an emerging technology in genomics is next-generation sequencing. In particular, high-throughput RNA sequencing (RNA-Seq) enables the profiling of the entire transcriptome and produces both sequence and gene expression information. RNA-Seq holds a number of advantages to microarray solutions including more precise expression measurements with fewer biases and the ability to discover novel transcripts and isoforms [30
]. While the cost of this technology is rapidly falling, currently they are still considerably more expensive than microarrays. We expect to see microarrays to continue to be in use and look forward to RNA-Seq adding greater power to transcriptome analysis.
We also note that while this manuscript was under review, another method for differential splicing detection using the Gene Array was published [32
]. This provides further support for Gene Arrays to be potentially used as a cost-effective platform for alternative splicing discovery.