We have developed a simple method for distinguishing alternative splicing from changes in gene expression, which could be applied to many types of microarray data. This method also provides a way for clustering genes and samples according to their alternative splicing patterns, producing a regulatory picture that is distinct from hierarchical clustering of the same microarray data according to gene expression. To test these methods, we have applied them to a small experimental dataset generated using a microarray platform and amplification methods shown previously to be sensitive to tissue-specific changes in alternative splicing (14
). However, we have not analyzed any aspects of the biological questions or interest of this experimental dataset, which, for reasons of space, will be presented elsewhere.
Alternative splicing represents a qualitative change in a gene's expression—production of a different form of the gene product, bearing a different combination of functional elements in its sequence. Thus, it is useful to develop high-throughput methods for detecting such qualitative changes on a genome-wide basis. Fundamentally, alternative splicing provides a very different source of information for tracking regulatory events, monitoring cellular differentiation and classifying different tissues, than has been considered by traditional gene expression analysis. For example, a gene product may be switched from one splice form to another splice form with very different functional properties, while leaving the total amount of gene product unchanged. Although current microarray analysis would be unlikely to detect such a regulatory event, our qualitative analysis could make such changes clear, on a genome-wide scale.
It is also possible that alternative splicing can cause confusion or misinterpretation in existing gene expression analyses. For example, if a gene has two major splice forms that are regulated quite differently, the gene's expression can only be represented accurately by two distinct numbers (one value for the expression level of each form). Seeking to extract a single number as the ‘expression level’ of the gene, may actually be inappropriate in this case. At a minimum, this measurement is likely to be confounded by apparently inconsistent behaviors of different probes for the gene. Indeed, gene expression analyses such as the dChip software (18
) are likely to systematically exclude as unreliable, any microarray probes that reveal strong patterns of alternative splicing, since these will diverge from the consistent expression profile across samples observed in the majority of probes for the gene. Thus, one benefit of our qualitative analysis approach is recognition that some probes are not simply unreliable, but contain additional information about the gene's regulation that was not captured by its total expression level.
To avoid such problems, it is essential to distinguish alternative splicing from gene expression within microarray data analysis. Our method provides a simple and general way to do this, but with a number of caveats. First of all, our approach should be considered a discovery method for detecting possible alternative splicing, rather than a validation method, which ensures that a given result is definitely due to alternative splicing. A number of other effects might give rise to systematic variation within the probes for a single gene. For example, if a subset of the probes cross-hybridize to transcripts from a paralogous gene, changes in expression of that paralogous gene would produce the kind of systematic variation (anti-correlation of tissue log-ratios) that our method detects. The nature of alternative splicing probe design makes it difficult to exclude such cross-hybridization entirely. Since alternative splice detection requires probes that match specific exons and splice junctions, probe selection is tightly constrained, and it is often not possible to completely avoid sequences that have a match somewhere else in the human genomic sequence. To weigh the evidence for true alternative splicing versus cross-hybridization to other genes, detailed consideration of the specific gene structure and likely splice forms for the gene in question, its paralogs, and other factors are required, which our method does not take into account. Second, we consider our method to be a qualitative analysis (identification of the presence or absence of changes in splicing), which we consider to be useful in its own right, rather than a quantitative method. Many additional kinds of statistical analysis would be required for such a method. Moreover, alternative splicing arrays have required different amplification protocols than those ordinarily used for expression arrays, because of the necessity of coverage across the full length of the gene (including the 5′ end) (14
). There are many questions about the quantitative accuracy and reproducibility of the amplification protocol, which need to be addressed more fully as a prerequisite to reliable quantitative analysis. For example, the amplification method used in this study [and also in previous work (16
)] yields substantial quantitative differences versus measurements made from total RNA (25
), although it does provide greatly improved coverage over the full length of transcripts (14
Despite these technical challenges, there is now broadly reproducible evidence that alternative splicing can be detected using microarrays. Hu et al
) used standard Affymetrix array designs to search for evidence of alternative splicing in 1600 rat genes, by performing hybridizations with 10 normal tissue samples. A total of 268 genes (17%) showed signs of alternative splicing, and RT–PCR validation indicated that about half of these represented genuine alternative splice events. Other studies have focused on individual genes with known alternative splicing patterns, to demonstrate that microarray technology can detect these events. Clark et al
) used a cDNA spotted array to demonstrate successful detection of experimentally induced intron-retention in a number of Saccharomyces cerevisiae
genes containing introns. Yeakley et al
) described detection of alternative splicing in six human genes using a fiber-optic microarray platform. Wang et al
) reported analysis of quantification of distinct splice forms of two human genes (CD44
), using the well-known Affymetrix microarray platform. Castle et al
) reported studies of two genes (RB1
), examining in great detail the experimental factors determining probe response as a function of distance from an exon junction, position with the gene and so on. For example, they have analyzed in detail the effect of probe length on accurate detection of both exons and splice junctions. Kampa et al
) used Affymetrix microarrays to look for novel transcripts, and provided evidence that most human genes show evidence of more than one distinct isoform (26
). By far, the largest microarray study was performed recently by Johnson et al
) using exon–exon junction probes to detect exon skipping. This study included probes for over 10 000 human genes and examined 52 distinct tissue samples. For genes in which alternative splice forms had not been previously reported by expressed sequence tag (EST) studies, about half were reported to show microarray evidence of exon-skipping. Validation by RT–PCR suggested that 45% of these positive candidates were genuinely alternatively spliced, indicating new discovery of alternative splicing in a large number of genes (estimated 798 in this study alone). Our work has used a similar microarray platform (Agilent microarrays), but has examined a variety of different types of alternative splicing including exon skipping, alternative 3′ and alternative 5′ splice site usage, alternative initiation and alternative termination.
In comparison with these extensive experimental studies, relatively little has been published on bioinformatics methodology for general detection and analysis of alternative splicing from microarray data. Wang et al
) describe a detailed method for quantitating distinct splice forms of a gene, and tested it both on a mixture of two isoforms, and a mixture of three isoforms. This method was designed for quantification of well-known isoforms, as the authors emphasized: ‘This algorithm is intended for splice variant typing, not discovery’. Johnson et al
) apparently analyzed their microarray data by fitting the probe intensities to a model of probe sensitivity, based on a single value representing total expression of the gene, and then identifying probes with strong ‘residuals’, indicating a poor fit to this model (16
). Both the Wang et al
. and Johnson et al
. methods are based on constructing a sophisticated model of probe sensitivity and comparing this model to the actual probe behaviors. Our approach is somewhat different. It compares the behavior of probes for a gene in one tissue versus their behavior in other tissues (instead of to a model), and adopts a simpler method of detection designed for discovery of novel alternative splicing. The use of normalization versus tissue-averaged ‘pool’ intensities largely removes probe sensitivity and total gene expression from consideration, enabling our analysis to focus on distinguishing three qualitatively different cases: uncorrelated, random scatter (no evidence of alternative splicing); anti-correlation (the two samples differ in splicing); and correlation (the two sample have the same splicing, compared with other tissues that have different splicing). Computation of this correlation factor for all possible pairs of replicate arrays allows direct assessment of its statistical significance. This simple approach works for ab initio
discovery of a wide variety of types of alternative splicing (not just exon skipping), and could be applied to many kinds of microarray designs and data.
One important foundation for the detection of complex phenomena such as alternative splicing is high-quality hybridization data displaying good specificity, reproducibility and signal-to-noise. Our data validate previous reports of the advantages of the Agilent array platform, which makes longer probe sequences possible (36–40 nt in this study). We wish to emphasize that all the data presented in this paper are raw microarray hybridization intensities directly reflecting the quality of the experimental data. Each data point shown in our figures is the signal from a single spot (on a single microarray), in contrast with common practices such as averaging up to four replicate arrays to suppress noise, or using data from up to 40 hybridization spots per array to obtain a single expression signal. The reproducibility of our data across four replicate arrays (with dye-swaps) indicates a good level of signal-to-noise, taking into account both variation between arrays and variation between different experiments and labeling. The reproducibility of our data for each gene across many different tissues shows that this level of signal-to-noise is also well above the level of variation between different experiments and samples. The use of longer probe sequences [60 nt total, including a specific probe sequence of 36–40 nt on top of a base of 20–24 bases of poly(T) to raise the probe sequence off the surface of the array] appears to work well for clear, reproducible detection of alternative splicing. The Agilent array platform's reproducibility (comparing a single spot between replicate arrays) and consistency (comparing the absolute intensities of many probes for an individual gene) provide a good foundation for detecting alternative splicing. The development of amplification and labeling methods that give robust coverage over the full length of each gene (as opposed to just the 3′ end) has also been crucial to reliable detection of tissue-specific alternative splicing (14
Our approach has many deficiencies that need to be filled. For example, in this paper, we have de-emphasized quantification in favor of qualitative analysis, as a way of stressing the distinct character of alternative splicing when compared with increases or decreases in total gene expression. However, the next stage of analysis requires accurate estimation of the amounts of each distinct splice form. This is clearly more challenging than accurate estimation of the total amount of mRNA for a gene. Based on identification of the individual sets of probes that distinguish different splice forms, it is possible to measure the amounts of each splice form. For example, Wang et al
) have described a matrix-based method for estimating the amounts of distinct transcript forms given a set of individual probes that distinguish them.