Exon array data are analyzed in two separate but parallel tracks. One examines exon-level data, allowing for the determination of alternate exon use. The other uses the summarized exon data for gene-level or differential expression analysis. This paper focuses on the exon-level analysis pre- and post-analysis filtration steps. To obtain meaningful splicing information and reduce the effect of multiple testing, a number of different filtering steps can be applied. Several different methodologies have been suggested.
13,18,25 In this analysis, we compared some of the suggested pre- and post-analysis filtration methods. One limitation to this study is the small data set used in the analysis. However, the initial understanding and interpretation of the filtration results were greatly facilitated by the size of this data set. We feel that even with these stated limitations, the results are of great interest and will be informative for those researchers embarking on exon array analysis.
One of the most important pre-analysis filters is the removal of probe sets with the potential for cross-hybridization. Xing et al.
15 showed that this was a major cause of false predictions of differential alternative splicing. Fortunately, exon array probe sets are classified based on annotational confidence. The core meta-probe set targets exons of well-annotated RefSeq genes with a hybridization target designation of unique. Thus, probes in a probe set match only one sequence perfectly in the putatively transcribed array design, allowing us to use probe sets that show no cross-hybridization.
26 This results in 228,871 probe sets, which generate data for 17,881 TCs (). Probe set sequence is also impacted by the presence of a single nucleotide polymorphism (SNP), which can cause a false-positive call.
27 Partek Genomics Suite provides a list of probe sets containing known SNPs, allowing for them to be filtered out. For this particular data set, the results were not impacted by this filtration. It did reduce the data set to 222,207 probe sets, which grouped into 17,648 TCs. The number of probe sets in a TC can also impact the interpretation of alternative splicing events. We found that when <5 were present in a TC, it was difficult to interpret the results. For this analysis, we used all TCs that had ≥5 probe sets.
In all exon analysis workflows, a mandatory step is the removal of any probe sets that are not expressed in at least one group, preventing the lack of expression being mistaken for alternative splicing. This can be done by filtering, using DABG
P values or by setting a log
2 signal intensity cutoff value. Different stringencies will determine the number of probe sets passing the filtration condition (). The decision as to which methodology to use is dependent on the number of targets one wants in the final list. This relates to the aim of the analysis. For example, discovery of new splicing events would not be well served by using the KAS filter. It is clear that the pre-analysis filtration step alone does not reduce the data set to a manageable size for visual examination of alternative splicing. However, as pointed out in several publications, the
P value for calling an exon alternatively spliced is not obvious.
18 Again, it is dependent on the number of targets one wants to see in the final list. Multiple test correction was applied to the TC
P values generated by the ANOVA using the conservative Bonferroni correction. Other methodologies, such as Dunn-Sidak, false discovery rate, and q value, can also be used. We know that this does not change the order of the
P values and once again, only impacts the number of TCs in the final list.
In an attempt to increase the percentage of alternative splicing events that were considered likely, three post-analysis filters were implemented. These realized significant data reduction and did impact the number of alternative splicing events considered true-positives by visual inspection (% Y). Comparison of these in relation to the two mandatory pre-analysis, low-stringency filtration steps (DABG or signal intensity) showed FC > EE > KAS. The binomial distribution test showed that Level 1 + FC had a higher Y detection rate compared with all other filtration steps (
Supplemental file 2). Other suggestions for second-level filtration include removing TCs with high (>5) or low (<0.2) exon/gene-intensity ratios.
18 For this data set, neither of these had any impact in reducing the TC numbers. These ratios are affected by cross-hybridization and nonlinear responses of probe sets, respectively, the former of which have been removed already from the data set.
Several classes of false-positives are caught in the statistical analysis, and manual inspection of the data in a genomic context is required for their detection. For this analysis, we chose to use the UCSC RefSeq database for our annotation.
23 Inclusion into this database requires significant verification of isoforms, and it is considered a highly conservative database. Other databases are available and if used, would be expected to change the annotations of our experimental data (
Supplemental file 1). The impact of using a different database for annotation is illustrated in
Supplemental file 3.
Additional file 3 - Figure 1 annotated to the UCSC Known Genes dataset.
Visual inspection is impacted by the data base chosen for transcript annotation. This figure allows comparisons to be made between annotations using the highly conservative UCSC Refseq database and their Known Genes database. All other legend annotations are the same as for Figure 1 in the manuscript.
Bemmo et al.
6 noted that probe sets at the 3′ and 5′ ends of TCs showed response properties that were different from the rest of the transcript. They attributed this to increased GC content in 5′-end probe sets as a result of proximity to unmethylated promoters and CpG islands. They postulate that the 3′ effects are likely caused by fewer random-primed, first-strand synthesis events, resulting in lower signal intensities. Our analysis shows that edge effects are present in >50% of TCs having an alternative splicing event. They account for approximately 75% of the false-positive calls, and 5′ edge effects are clearly out-numbering the 3′ effects (). Probe-level intensity is known to be significantly dependent on the GC content of the sequence. So, Partek Genomics Suite models these effects and attempts to remove it before background correction. Comparison of results with and without the probe sequence adjustments showed no differences in the numbers of TCs identified as significant for the first level of data filtration. It also did not impact the number of edge effects identified, suggesting that it may be best just to remove probe sets known to be associated with high GC content. Other false-positive results can be attributed to poorly hybridizing or nonresponsive probe sets. However, this scenario could also be interpreted, as the exon is not included in both groups, which is not a false-positive event. Currently, it is not possible to differentiate between the two. Many of the probe sets causing these types of artifacts are difficult to correct using the filtration methods suggested in analytical workflows. As more exon array data are accumulated and deposited in public repositories, it will be possible to build lists of poorly behaving probe sets across different tissues, cell types, etc., facilitating their removal prior to analysis. This concept can be extended to building lists of intronic, intergenic, and exonic probe sets for systematic analyses. This would make it possible to explore novel regions of transcription and overcome some of the issues surrounding technically generated false-positives. This has been implemented partly in the X:MAP software, a BioConductor/R package,
28 where the probe set can be mapped to exons or introns. By comparing different systematic analyses, one would be able to build a more complete picture of the true alternative splicing events and reduce the dependence on visual inspection. Gathering these probe set lists is a rather daunting task, as genomic annotation is far from complete and ever-changing.
29,30 However, if exon array data are going to contribute to this knowledge, this is a necessary step. It will aid the discovery of new, alternative splice events and find transcription events outside of known genes.
It is evident that data interpretation of exon-level results currently requires that it be overlaid onto a genomic profile. Using the highly conservative RefSeq annotation of the 866 TCs identified as being alternatively spliced (log
2 signal, >3 filter), only 35.5% or 281/791 (excluding all false-positive or N calls) were known to show alternative splicing. To balance this, we also provide alternative splicing information from AceView,
31 which gives alternative splicing information from transcript reconstruction by merging cDNA and genome sequences, allowing discovery of thousands of transcript variants. Using this database, 740/791 (93.6%) had been identified as having splice variants. The truth probably lies somewhere in between.
This paper illustrates the power of exon arrays in identifying different and possibly novel alternative splicing events (such as exon skipping and intron retention) and differential UTR use (e.g., transcription initiation and alternative polyadenylation). The current workflow-filtering steps are a tradeoff between the number of TCs that can be visually assessed to determine “false-positives”, arising from technical probe set behaviors or resulting from multiple testing, and true positives. With the perceived importance of alternative splicing at an all-time high, the need for robust and accurate tools capable of fully interrogating the complete transcriptome is critical to facilitating a clearer understanding of the regulation of biological processes. There is a large degree of imprecision when dealing with the human transcriptome as it pertains to alternative splicing, emphasizing the need for more research, which is now facilitated easily using exon arrays.