|Home | About | Journals | Submit | Contact Us | Français|
With the rapidly declining cost of data generation, and the accumulation of massive datasets, molecular biology is entering an era in which incisive analysis of existing data will play an increasingly prominent role in the discovery of new biological phenomena and the elucidation of molecular mechanisms. Here, we discuss resources of publicly available sequencing data most useful for interrogating the mechanisms of gene expression. Existing next-generation sequence datasets, however, come with significant challenges in the form of technical and bioinformatic artifacts, which we discuss in detail. We also recount several breakthroughs made largely through the analysis of existing data, primarily in the RNA field.
Recent technological breakthroughs in DNA sequencing have vastly accelerated the rate and greatly reduced the cost of generating high-throughput molecular data. The cost of nucleotide sequencing, for example, is falling faster than even Moore’s law for integrated circuits (www.genome.gov/sequencingcosts). Given sufficient download bandwidth, storage capacity and a desktop computer, anyone with the appropriate tools can analyze these existing molecular datasets to address myriad questions in molecular biology. Furthermore, increasing accessibility to supercomputers and cloud computing allows sophisticated analyses to be performed by an ever greater number of scientists.
Because individual laboratories or consortia producing and analyzing large-scale datasets do not (and typically cannot) explore every possible hypothesis that is supported by their data, opportunities abound to test new ideas that may have not been considered previously. In terms of individual datasets, the opportunities are vast. The NCBI Gene Expression Omnibus (GEO) has archived more than 32,000 microarray and sequencing studies that comprise over 800,000 samples since 2001 (Barrett et al., 2013; http://www.ncbi.nlm.nih.gov/geo/). The Sequence Read Archive (SRA), which maintains sequence data that are either submitted directly to the SRA or extracted from GEO submissions, currently hosts over 1 petabase (Kodama et al., 2011; http://www.ncbi.nlm.nih.gov/sra) and is one of the largest datasets hosted by Google (www.dnanexus.com). Clearly, access to molecular data has never been greater.
Before diving head first into this immense sea of data, it is essential to first identify publicly available datasets that constitute the equivalent of a properly controlled experiment. For example, can datasets be identified that are derived from matched biological samples? In some cases, answers to these questions can be easily obtained from metadata associated with each dataset at the GEO and SRA databases. In other cases, however, this information may be difficult to find, incomplete, or even incorrect. Furthermore, expert technical knowledge of the experimental procedures used to generate the datasets is required to assess potential technical artifacts and other caveats. Below, we identify particularly useful collections of publicly available data and discuss common technical artifacts that should be taken into consideration when analyzing next generation sequencing (NGS) datasets. To demonstrate the utility of analyzing existing data, we highlight successful approaches that have generated new ideas regarding the mechanisms that regulate gene expression.
Related dataset collections that are published together as a resource provide one solution to the problem of identifying comparable datasets. For example, Keji Zhao’s group at the NIH has generated one of the most comprehensive ChIP-seq studies of epigenomic information from a single human cell-type: resting CD4+ T cells (Barski et al., 2007; Schones et al., 2008; Wang et al., 2008b). These particular datasets have been analyzed by many other investigators to identify specific chromatin marks that combine to constitute “chromatin states” (Ernst and Kellis, 2010; Hon et al., 2009), domains (Shu et al., 2011), and boundaries (Wang et al., 2012), or those marks which are best correlated with tissue-specific gene expression (Pekowska et al., 2010; Visel et al., 2009) or gene architecture (Andersson et al., 2009; Hon et al., 2009; Huff et al., 2010; Schwartz et al., 2009; Spies et al., 2009; Tilgner et al., 2009). The studies demonstrate the range of information that can be gleaned from a single large, coherent collection of datasets. One reason that these studies were so successful was because all of the datasets were generated by a single group, using a consistent method, from a single cell type. Identifying similarly coherent datasets among the vast sea of the GEO, SRA and other data repositories, however, can be a challenge.
Initiatives such as the 1000 Genomes (Clarke et al., 2012; The 1000 Genomes Project Consortium et al., 2010), The Cancer Genome Atlas, ENCODE (ENCODE Consortium et al., 2012), modENCODE (modENCODE Consortium et al., 2010; Gerstein et al., 2010), and the Epigenomics Roadmap (Chadwick, 2012) projects provide additional resources that are ideal for data integration (Table 1). Because these initiatives are specifically tasked with providing high-quality resources, common sets of biological samples, reagents, and methods are defined for each project. In the case of ENCODE and modENCODE datasets, established standards must be met for data to be released (Consortium, 2011; http://genome.ucsc.edu/ENCODE/protocols/dataStandards/). For example, the modENCODE project has evaluated the specificity and efficiency of commercial antibodies that are commonly used to generate ChIP-seq datasets (of which, fewer than 75% passed muster) (Egelhofer et al., 2010). Thus, when using data from one of these public projects, users can be reasonably confident that the quality of the data meets or exceeds certain standards.
These initiatives also generate important control datasets as resources. For example, it is most appropriate to align sequence reads from RNA-seq and ChIP-seq experiments to the cell line-, strain- or individual-specific genome rather than to a generic reference genome. Otherwise, single-nucleotide polymorphisms (SNPs) may be mistaken for RNA editing sites and copy-number variations (CNVs) may be mistaken for differential gene expression or changes in chromatin structure (Pickrell et al., 2011; Schrider et al., 2011). Accordingly, many of these initiatives have resequenced the genomes of the cell lines, strains, and individuals used in the projects.
Lastly, more than 1,500 curated databases are described in the Nucleic Acids Research online Molecular Biology Database Collection, many of which collect and integrate existing data to produce user-friendly, searchable websites (Fernandez-Suarez and Galperin, 2013). As the field of bioinformatics has evolved to more effectively tackle specific questions in biology (Butte, 2009), cutting-edge databases have been designed to place mechanistic hypotheses within arms reach of investigators by automating novel data integration strategies. For example, the HaploReg database integrates user-defined genome-wide association study results with linkage disequilibrium (The 1000 Genomes Project Consortium, 2010), sequence conservation information (Lindblad-Toh et al., 2011), and chromatin structure (Ernst et al., 2011) to link disease-associated genetic variation with putative regulatory elements (Ward and Kellis, 2012). Similarly, many of the large initiatives (e.g. ENCODE, modENCODE, etc.) also provide unified exploratory tools (e.g. UCSC and IGV genome browsers) that allow straight-forward evaluation of genomic regions of interest.
New advances in NGS technologies are greatly expanding the current volume and the range of existing data (Metzker, 2010). As there is no evidence that innovations in sequencing technology are slowing down, it can only be anticipated that the pace of generating sequence data will continue to increase and the cost will decrease. By the start of 2012, approximately 75,000 genomic, 15,000 transcriptome and 15,000 epigenomic submissions had been contributed to the SRA (Figure 1A). However, that volume of data represents only the tip of the iceberg as transcriptome and epigenomic applications will be applied to include a greater range of cell-types and species. Indeed, the number of transcriptome and epigenomic submissions have been steadily increasing, particularly in recent years (Figure 1B).
Transcriptome and epigenomic applications have been applied most liberally to humans and model organisms such as mouse, fly, worm and yeast (Figure 1C). While the transcriptome datasets consist almost entirely of RNA-seq experiments, the epigenomic datasets are generated using a large collection of methods that interrogate various aspects of chromatin structure. The epigenomic experiments include methods to assess DNA accessibility (Boyle et al., 2008), DNA methylation (Laird, 2010), the genomic locations of transcription factors and chromatin marks (ChIP-seq) (Park, 2009; Schones and Zhao, 2008), and nucleosome positions (MNase-seq) (Jiang and Pugh, 2009) (Figure 1A, red bar graph). Additionally, specialized applications have also been submitted to the SRA that assess chromatin conformation (de Wit and de Laat, 2012; Dekker et al., 2002; Fullwood et al., 2010), RNA:protein interactions (Licatalosi and Darnell, 2010; Ule et al., 2005), RNA polymerase elongation (Churchman and Weissman, 2011; Core et al., 2008), and ribosome occupancy (Ingolia et al., 2009). In theory, any process related to nucleic acid metabolism can be assessed with the proper biochemical preparation, which makes NGS applications a rich and powerful source for integrative data analysis (Hawkins et al., 2010).
Furthermore, NGS datasets are extraordinarily rich. The sequence reads from a single experiment can provide a vast array of quantitative, positional and sequence information. For instance, RNA-seq datasets provide sufficient information to measure mRNA expression levels and alternative splicing, to identify transcriptional start site and polyadenylation sites, and to identify instances of RNA editing. In certain cases, allele-specific gene expression, allele-specific splicing, and even trans-splicing can be measured. Further, RNA-seq can be used as discovery tool to annotate novel coding and non-coding transcripts as well as chimeric transcripts that result from genomic rearrangements (Figure 2) (Martin and Wang, 2011; McManus et al., 2010; Ozsolak and Milos, 2011; Wang et al., 2009). Though these datasets are very rich, they must be analyzed carefully. Essentially every step involved in generating NGS data introduces detectable, sometimes substantial, biases or errors (Figure 3). This presents a particular challenge for data integration, since different sequencing platforms, biochemical procedures, and data processing methods are associated with unique caveats. With the proper controls, however, these effects can often be identified and accounted for in downstream analyses.
NGS platforms have been in use long enough that biases attributable to library construction and sequencing have been evaluated in great detail. Data generated by the Illumina platform, for example, are subject to base-call errors that increase with read position due to phasing issues (Dohm et al., 2008) and underrepresentation of high and low GC-content reads (Dohm et al., 2008; Risso et al., 2011). As these technical issues are well characterized, popular analysis packages attempt to correct for such nucleotide biases. For example, Cufflinks, a popular program used to measure differential gene expression, empirically determines nucleotide biases present in RNA-seq datasets and corrects for them (Meacham et al., 2011; Trapnell et al., 2012). While this strategy vastly improves comparisons between independently generated datasets, and even from different sequencing platforms, third-generation sequencing platforms, such as those from IonTorrent, PacificBiosciences, and Oxford Nanopore (and other lurking companies), use radically different chemistries, the biases of which will need to be identified and remedied.
Less recognized are the myriad experiment-specific biases or artifacts that are introduced at nearly every step involved in preparing libraries and sequencing them. For example, it has clearly been shown that RNA-seq libraries prepared using random-hexamer priming display a systematic non-templated sequence profile at the beginning of reads which is primarily due to first-strand synthesis (Hansen et al., 2010). This technical artifact appears to be a major source of the controversial ~10,000 “RNA DNA differences” (proposed RNA editing sites) identified from human cell lines that were recently reported (Li et al., 2011) and subsequently called into question (Kleinman and Majewski, 2012; Lin et al., 2012; Pickrell et al., 2012). Similarly, ChIP-seq experiments over-represent regions of open chromatin, which can create false positives (Chen et al., 2012).
Another technical artifact associated with library preparations is template-switching that occurs during the amplification steps, which can give rise to molecules that did not exist in the initial biological sample. For RNA-seq experiments, this type of artifact can generate ‘chimeric’ RNAs that appear to be synthesized by trans-splicing or some unknown biological process (Gingeras, 2009; McManus et al., 2010). Sequence data from control libraries that assess the frequency of template-switching are absolutely essential to distinguish biologically derived chimeric RNAs from those generated by artifactually (McManus et al., 2010).
In addition to experimental artifacts, bioinformatic artifacts can severely impact data interpretation. A major source of these artifacts is the mappability of NGS sequence reads, which are typically 25–100 nucleotides in length. Mapping these sequences to a reference genome can be particularly problematic due to the plethora of repetitive elements present in most genomes. Repetitive elements such as LINEs and SINEs have always presented difficulties for correctly mapping sequences, but the short size of NGS reads significantly amplifies this problem – as read length decreases, so does the number of unique regions that can be mapped within a reference genome (Treangen and Salzberg, 2012). Consequently, ‘mappability’ differs depending on read length. Such biases can create illusory non-random associations with biological features (e.g. exons) in ChIP- and MNase-seq experiments. For example, with 32 bp reads, tiny but common genomic features, such as coding starts, ends, exons and splice sites accumulate greater read densities than other local features (e.g. introns) (Schwartz et al., 2011). Thus, mappability must be carefully considered when interpreting any type of alignment data.
A second bioinformatic artifact is caused by genetic variation that has not been accounted for. CNVs that differ between experimental samples and reference genomes, for example, can create false-positive enrichment regions in ChIP-seq experiments (Pickrell et al., 2011). Studies of RNA editing are particularly susceptible to high false-discovery rates if SNPs are not accounted for in the analysis. In the case of RNA DNA differences (Li et al., 2011), 55% match the genome of at least one of the 27 individual genomes used in the original analysis, which suggests that the relatively low coverage (2–6×) of these genomes was not sufficient to identify and eliminate confounding SNPs (Schrider et al., 2011). Thus, a matched reference genome – with sufficiently deep coverage – should be used when mapping short reads since experimental samples such as cell lines, strains, and individuals may differ in their SNPs, CNVs, and chromosome number. Even then, care must be taken to ensure that interesting results are not correlated with regions of shallow sequence depth.
NGS technologies are rapidly evolving. Consequently, robust computational methods lag behind this moving target. Thus, data generated by the early adoption of exciting new technologies should be evaluated, first and foremost, with critical attention to sequence biases. It is our opinion that novel phenomena will increasingly be discovered by the use of existing data. However, these phenomena are only as compelling as the support for an underlying mechanism. The vast potential for technical artifacts in NGS data are more than enough reason enough for caution, particularly since some technical artifacts are, in fact, not random and may correlate strongly with known biological features. Here, we are reminded of a sage warning made by Daniel MacArthur in reference to remarkable results obtained by analyzing large NGS datasets: “The more surprising a result seems to be, the less likely it is to be true” (MacArthur et al., 2012). We implore data analysts to heed this warning and perform extensive validation of remarkable findings, or they may indeed fall victim to MacArthur’s rule.
The Watson-Crick model of DNA – the double helix – was developed largely through model building informed by existing data (Watson and Crick, 1953b, a). Although structural evidence supporting the model required meticulous experimentation (Franklin and Gosling, 1953; Wilkins et al., 1953), the model alone suggested the basis of genetic inheritance as well as DNA replication and recombination (Watson and Crick, 1953a, b). Despite lacking any knowledge of the complex protein machines responsible for replication and recombination, the central predictions were, nonetheless, essentially proven within a decade (Alberts, 2003). Thus, the double helix exemplifies how the insightful analysis of existing data can revolutionize a field.
There are numerous examples where insights into the molecular mechanisms of various biological processes have been gleaned from analyzing existing data. Below we highlight several of these and attempt to highlight the general aspects of each study as a lesson for how each approach can applied to other problems. We focus on problems related to various aspects of RNA biology, though these approaches can be used for other molecular processes as well.
Functionally important sequence elements are expected to be conserved over time. Thus, one way of investigating a particular process is to identify conserved sequence elements using alignments of multiple whole genome sequences. Conservation plots can be generated from such alignments using several software packages that calculate nucleotide substitution rates. Conveniently, conservation scores based on whole-genome alignments using phastCons (Siepel et al., 2005), phyloP (Pollard et al., 2010), or SiPhy (Garber et al., 2009) can be downloaded from the UCSC genome browser for many model organisms, including yeast, worm, fly and various mammals including mouse and human (Kent et al., 2002). Analyzing the identity, characteristics, and locations of conserved sequences can provide tremendous mechanistic insight. Below, we highlight two such examples, in the fields of RNA editing and alternative splicing, that utilized this approach.
Adenosine to inosine (A-to-I) editing of RNA is an evolutionarily conserved process catalyzed by the ADAR family of adenosine deaminases (reviewed in Rieder and Reenan, 2011). A mystery that dogged the field for some time was the paucity of known endogenous RNA targets – only a few chance discoveries had been described – despite evidence that the inosine content of mRNA isolated from brain tissues might be as high as 1 in every 17,000 nucleotides (Paul and Bass, 1998). Armed with the knowledge that ADAR mutations resulted in neurological defects (Palladino et al., 2000) and that ADAR editing required an RNA duplex formed between the targeted region (containing the edited adenosine) and a complementary sequence (Higuchi et al., 1993), Hoopengardener and colleagues (2003) searched for new targets of RNA editing among the neuronally expressed genes of Drosophila. In the case of para, which encodes a Na+ channel, the exon containing a known editing site is very highly conserved near the editing site, as is a region in the adjacent intron which basepairs with the edited exon. Based on this observation, Hoopengardner et al. (2003) reasoned that they might be able to identify new RNA editing targets by identifying very highly conserved exons. They therefore assessed 914 neuronally expressed genes to identify exons with a high level of sequence constraint between Drosophila melanogaster and Drosophila pseudoobscura. This approach proved highly productive, as 16 new editing sites were identified that were validated by cDNA sequencing (Hoopengardner et al., 2003) (Figure 4A illustrates one such example). Importantly, this use of comparative genomics demonstrated a previously unanticipated degree of phylogenetic conservation between A-to-I editing sites, solidified the RNA duplex-dependent mechanism of ADAR function, and provided a facile bioinformatic strategy for editing-site identification. Indeed, improved variations of this approach using existing EST:genome alignments (Levanon et al., 2004) or archived sequence chromatograms (Zaranek et al., 2010) have now greatly increased the number of high-confidence A-to-I editing candidates (reviewed in Wulff et al., 2011).
A similar approach has been used to uncover novel mechanisms that regulate alternative splicing. In one particularly illustrative case, the Drosophila Dscam gene, conservation within introns was used to uncover a novel mechanism of mutually exclusive splicing. Dscam is a text book example of the importance of alternative splicing in increasing protein diversity, as it may generate over 38,000 different protein isoforms (Schmucker et al., 2000). Each time the Dscam gene is transcribed, the pre-mRNA is spliced such that each mRNA contains one and only one exon from each of four exon clusters (exons 4, 6, 9, and 17, specifically). But at the time of its discovery, no previously described mechanism of mutually exclusive splicing could explain how the many variable exons of Dscam are spliced in a mutually exclusive manner.
Compared to exons, which are often highly conserved due to their coding potential, intronic regions typically have little conservation except at sites that have non-coding function, such as RNA splicing. Nucleotide alignments from 15 insect species revealed two types of conserved sequence elements in the introns of the exon 6 cluster. The first element, the docking site, was located between exon 5 and the first exon 6 variant. Importantly, the docking site was found only once in the exon 6 cluster, but was present in every species examined, even species that diverged over 450 million years ago. Selector sequences, the second type of sequence element, were located in introns upstream of each exon 6 variant. Based on their complementary sequences, the docking site and selector sites appeared to base-pair with one another; however, one and only one of the selector sequences could base-pair with the docking site at a time (Figure 4B). Thus, base pairing of one selector sequence with the docking site would be predicted to promote inclusion of that exon, while simultaneously inhibiting the splicing of the 47 other exon 6 variants. Though this mechanism was discovered purely by comparative genomics and bioinformatics, the elegance and universal conservation among insects lent credence to the proposed mechanism. Experimental confirmation of the model was subsequently obtained using mutagenized BACs containing the entire Dscam gene (May et al., 2011). Further demonstrating the power of this approach, additional docking site:selector sequences within the other Dscam clusters and even other alternatively-spliced genes have been identified largely on the basis of intronic conservation (Yang et al., 2011).
These two examples illustrate how conservation can be used to identify functional elements that provided insight into the mechanisms of RNA editing and alternative splicing. However, these approaches can be used to study many other processes. For instance, candidate functions have been assigned to ~60% of the conserved sequences in mammals (Lindblad-Toh et al., 2011), yet 40% of these elements have unknown functions. Moreover, another 10,000 regions of mammalian coding sequences are predicted to have overlapping functions; yet again, these functions are mostly unknown. Thus, a fruitful avenue of research is to use existing multiple sequence alignments to identify the conserved sequence elements associated with a gene or process of interest. The function of conserved sequences will likely require experimental approaches to determine their functions, but they may also be inferred based on their sequence features or locations alone.
More recently, analyses of existing data have made a significant impact on the burgeoning study of co-transcriptional splicing. A growing body of evidence now supports the notion that transcription and splicing are not only concurrent, but also coupled, such that transcriptional dynamics profoundly influence RNA splicing (Neugebauer, 2002). For example, the elongation rate of RNA polymerase can influence the propensity for exon skipping (Kornblihtt et al., 2004). Until recently, however, little was known about the relationship between co-transcriptional splicing and the chromatin context in which it takes place. By integrating existing epigenomic datasets with known splicing patterns, recent studies have generated exciting new hypotheses that intimately connect chromatin structure to RNA splicing.
A major challenge associated with studying chromatin structure is its immense complexity. Even the most fundamental unit of chromatin, the nucleosome, can differ between genomic regions in occupancy, positioning, and myriad post-translational modifications (a.k.a. chromatin marks). It has long been observed in S. cerevisiae, for instance, that the chromatin mark, histone H3 lysine 36 trimethyation (H3K36me3), is enriched within the bodies of active genes (reviewed in Li et al., 2007). Thus H3K36me3, and many other marks, are thought to be intimately associated with transcriptional processes. Questions concerning whether chromatin marks might affect, or be affected, by splicing were rarely discussed until genome-wide ChIP surveys in C. elegans demonstrated higher levels of H3K36me3 at exons compared to nearby introns within the same gene (Kolasinska-Zwierz et al., 2009). As this finding was entirely unanticipated, Kolasinska-Zwierz et al. turned to publicly available H3K36me3 ChIP-seq data (Barski et al., 2007) to test the relevance of their findings to humans. By aggregating all of the H3K36me3 sequence reads that aligned near exons, (a so-called “metagene” analysis), the authors observed strikingly similar H3K36me3 enrichment at the average human exon (Figure 5A and 5B, blue panel).
Furthermore, integrating H3K36me3 ChIP data with annotations of alternative and constitutive exons, revealed a potential connection between the degree of chromatin marking and alternative splicing. A modest, but suggestive decrease in H3K36me3 reads at alternatively spliced exons was observed in both worm and mouse (Kolasinska-Zwierz et al., 2009). Similar analyses have been somewhat conflicting (Hon et al., 2009; Spies et al., 2009), but this may simply indicate that the differences between constitutive and alternatively-spliced exons are subtle. Nonetheless, the notion that chromatin marks, and in particular, H3K36me3 can affect alternative splicing is consistent with several recently reported experiments. Indeed, two H3K36me3-binding proteins, MRG15 and PSIP1, have been implicated in mediating alternative splicing regulation by the splicing factors PTB and SRSF1 (ASF/SF2), respectively (Luco et al., 2010; Pradeepa et al., 2012). Additional marks, such as H3K4me3 and H3 acetylations, have also been shown to influence splicing (Gunderson and Johnson, 2009; Sims et al., 2007). Most likely, chromatin marks function in splicing by modulating RNA polymerase elongation rates or by recruiting specific splicing factors to active genes (Luco et al., 2011 and references therein; Nilsen and Graveley, 2010). Intriguingly, depletion of the only known H3K36me3 methyltransferase influences the splicing of only a small, but significant, subset of PTB-dependent splicing events (Luco et al., 2010). Such gene-specific affects might also explain why genome-wide correlations between H3K36me3 and alternative splicing have been modest.
Numerous studies have since analyzed more than 41 chromatin marks using publicly available epigenomic datasets, which have yielded a strong consensus: chromatin structure reflects gene architecture. In humans, three types of exon/intron boundaries have been shown to be associated with particular chromatin marks: 1) H3K4me3 and H3K9Ac throughout the length of the first exon (Bieberstein et al., 2012), 2) H3K79me2 (among several other marks) throughout the length of the first intron (Huff et al., 2010), and most notably 3) H3K36me3 enrichment at internal exons (Andersson et al., 2009; Hon et al., 2009; Huff et al., 2010; Spies et al., 2009). Because these chromatin marks are intimately associated with transcription, these results suggest a much closer connection between the splicing and transcription machineries than previously thought.
Similarly, analysis of published MNase-seq data (Schones et al., 2008; Valouev et al., 2008), revealed that nucleosomes themselves were also highly associated with internal exons (Andersson et al., 2009; Huff et al., 2010; Schwartz et al., 2009; Spies et al., 2009; Tilgner et al., 2009). Based on this observation, a novel mechanism for exon definition has been proposed (but yet to be proven), whereby nucleosome occupied exons serve as RNA polymerase “speed bumps” that provide additional time for the spliceosomal machinery to recognize nearby splice sites (Schwartz et al., 2009; Tilgner et al., 2009).
In theory, elevated nucleosome occupancy at exons alone could explain the previously identified enrichment of H3K36me3 at exons (Schwartz et al., 2009; Tilgner et al., 2009). However, bioinformatic analyses comparing nucleosome occupancy at exons, exon-like composition regions (ECRs), and pseudoexons, have uncovered evidence that the mechanisms determining nucleosome occupancy and H3K36me3 enrichment at exons are distinct. To discern which aspects of exon sequence might be necessary and sufficient for high nucleosome occupancy, sequence characteristics of exons were analyzed separately for high nucleosome occupancy. In this case, the average ECR, which is not flanked by splice sites but does have the same GC-content of the average exon, displayed nucleosome occupancies equal to that of the average exon (Spies et al., 2009) (Figure 5B, brown panel). Conversely, the average pseudoexon, which has lower GC-content than the average exon, was depleted for nucleosome occupancy (Tilgner et al., 2009). Thus, the DNA sequence composition of exons alone may be sufficient for exon-like nucleosome occupancy. Lastly, ECRs were not enriched for H3K36me3, which suggests that exon marking reflects some aspect of splicing rather than exon-like sequence composition (Huff et al., 2010) (Figure 5B, brown panel). Further supporting a role for the spliceosome in specifying chromatin structure, recent experiments in which splicing was inhibited by splice-site mutations or Spliceostatin exposure, have indeed caused changes in H3K36me3 marking (De Almeida et al., 2011; Kim et al., 2011).
The above analyses demonstrate the power of data integration to establish new connections between related processes whose mechanistic links may yet be unclear. In some cases, these relationships can be readily observed at single genes using a genome browser to facilitate comparisons between datasets. In such cases, moving from observations at single loci to genome-wide analyses can be accomplished by aggregating values from genome-wide datasets at specific features of interest (Figure 5). Genome-wide relationships can also be revealed by plotting all relevant loci in a high-density heatmap that is aligned and sorted to highlight features of interest (Hawkins et al., 2010). The later approach is particularly useful in instances where summary statistics like the genome-wide averages might be deceptive (e.g., the mean of a bimodal distribution). Thus, by integrating and aggregating existing data, a mere anecdote can be transformed into an global principle.
Publicly available data can also be used to assess the prevalence and functional consequences of previously ignored biological phenomena. In animals, alternatively spliced genes are the norm; more than 92% of human genes produce at least one alternatively spliced transcript (Pan et al., 2008; Wang et al., 2008a). While many of these alternatively spliced transcripts are predicted to encode functionally distinct protein isoforms, others encode protein isoforms whose biological relevance is questionable. Thus, the perennial question: which of these splicing events are regulated and which are stochastic?
For instance, alternative splicing may introduce premature termination codons (PTCs) that target the message for degradation by nonsense mediated decay (NMD). Such unproductively spliced transcripts could be regulated to function as post-transcriptional on/off switches, or merely splicing mistakes in need of triage. Only a few genes were previously known to be regulated by unproductive splicing. Publicly available EST data, on the other hand, suggested that nearly one-third of alternatively spliced transcripts were potential NMD targets (Lewis et al., 2003). This unexpectedly high prevalence brought new attention to the hypothesis that unproductive splicing might post-transcriptionally regulate the expression of entire classes of genes. Controversy initially surrounded the original EST-based estimates because experiments that depleted NMD factors to identify stabilized unproductively spliced transcripts yielded more conservative estimates. By microarray, only ~10% of cassette exons substantially elicited NMD (Pan et al., 2006) and tissue-specific regulation was found to be rare. More recent approaches using RNA-seq, which allow for all major forms of splicing to be considered, have brought these numbers closer to initial estimates, but only for some tissues (Weischenfeldt et al., 2012). Nonetheless, the question as to whether a significant portion of unproductive splicing regulates the expression of entire classes of genes was answered through analyses that showed that unproductively spliced transcripts were enriched for genes encoding splicing factors and other RNA-binding proteins (Ni et al., 2007; Saltzman et al., 2008). Another experimental study demonstrated that the entire family of human SR proteins was associated with unproductive splicing (Lareau et al., 2007). These studies satisfyingly confirmed and extended previous reports of autoregulation by unproductive splicing (reviewed in McGlincy and Smith, 2008). Thus, as with the previous examples, an initial breakthrough was achieved through the analysis of existing data, followed by further refinement and proof through experimental studies.
The functional consequences of alternative splicing decisions that produce nearly identical protein isoforms has also been assessed using publicly available datasets. In this case, introns ending in NAGNAG (a tandem duplication of the 3′ splice sited NAG) have been previously shown to be alternatively spliced such that their protein isoforms differ by only a single amino acid based on publicly available EST data (Hiller et al., 2004; Hiller and Platzer, 2008). However, whether these small differences are regulated or stochastic has been questioned (Chern et al., 2006). By first analyzing their own experimental data, the authors confirmed that there is broad use of NAGNAG splicing in human and mouse tissues. Motivated by these findings, the authors also mined the extensive collection of RNA-seq datasets generated by the Drosophila and C. elegans modENCODE projects. Strikingly, 500 NAGNAGs splice sites were found to be alternatively spliced in at least one of 30 developmental time points in Drosophila, while NAGNAG splicing in C. elegans was considerably less dynamic. Approximately 5–14% of alternatively spliced NAGNAGs were found to be developmentally regulated and conserved, such that the most dynamically spliced NAGNAGs were associated with the greatest intronic sequence conservation. While the mechanisms regulating NAGNAG splicing remains unclear, these analyses provide the best evidence to date that even small changes in splicing are commonly regulated (at least in some animals).
The above examples demonstrate the utility of large datasets to assess the prevalence of intriguing phenomena. With large collections of datasets, the prevalence of tissue-specific or developmental regulation can be estimated with tremendous breadth. In such pursuits, however, it is essential to accurately assess the false discovery rate of the analysis. In the above example of NAGNAG splicing, the mean false discovery rates for technical and biological replicates were estimated at a very reasonable 4.4% and 1.1%, respectively. This is another area where heeding MacArthur’s rule is well advised.
Here we have highlighted successful strategies for supervised data mining of existing data. Although not discussed at length here, unsupervised methods like machine learning algorithms also demonstrate great promise for researchers seeking to unbiasedly analyze large datasets. A machine learning algorithm has recently been used to write a first draft of the splicing code - a set of rules so expansive that it is best referenced with the aid of a computer (Barash et al., 2010). This approach led Barash et al. to identify novel sequence motifs associated with regulated alternative splicing as well as a new example of developmentally regulated unproductive splicing. As the quantity of available datasets increases, it is almost certain that unsupervised methods will become more broadly used for analyses.
In a recent poll, most scientists reported that they “rarely” accessed data or used datasets from the published literature for their original research papers (Science Staff, 2011). However, this will undoubtedly change. The opportunities and challenges of ‘Big Data’ are not only being felt in the biological sciences, but in society at large (Lohr, 2012). Thus, students will gain transferable skills from exercises that teach basic workflows using large datasets and scripting languages.
At a minimum, we envision teaching exercises that require biology students to devise an experimental design using only existing data, access relevant datasets from archives, parse and integrate data using programing languages such as Perl, Python, Ruby or R, and apply an appropriate visualization technique. Laboratory protocols for the use of analytic software currently exist to aid these pursuits (e.g., Cufflinks) (Trapnell et al., 2012). Such exercises will empower students to explore and assess the quantitative data published in the manuscripts that they read, which can no longer be assessed at a glance like the qualitative gel-based results on which molecular biology was founded. Ultimately, it will be equally important to know how to write code as it is to pipette.
We thank members of the Graveley lab for discussions and comments on the manuscript. Work in the Graveley lab is supported by NIH grants R01GM067842, R01GM095296, and U54HG007005 to B.R.G. and U54HG006994 and U01HG004271 to Susan E. Celniker and a Ruth L. Kirschstein National Research Service Award (F32GM105264) to A.M.P.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.