Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Mol Cell. Author manuscript; available in PMC 2014 February 21.
Published in final edited form as:
PMCID: PMC3590807

New insights from existing sequence data: generating breakthroughs without a pipette


With the rapidly declining cost of data generation, and the accumulation of massive datasets, molecular biology is entering an era in which incisive analysis of existing data will play an increasingly prominent role in the discovery of new biological phenomena and the elucidation of molecular mechanisms. Here, we discuss resources of publicly available sequencing data most useful for interrogating the mechanisms of gene expression. Existing next-generation sequence datasets, however, come with significant challenges in the form of technical and bioinformatic artifacts, which we discuss in detail. We also recount several breakthroughs made largely through the analysis of existing data, primarily in the RNA field.


Recent technological breakthroughs in DNA sequencing have vastly accelerated the rate and greatly reduced the cost of generating high-throughput molecular data. The cost of nucleotide sequencing, for example, is falling faster than even Moore’s law for integrated circuits ( Given sufficient download bandwidth, storage capacity and a desktop computer, anyone with the appropriate tools can analyze these existing molecular datasets to address myriad questions in molecular biology. Furthermore, increasing accessibility to supercomputers and cloud computing allows sophisticated analyses to be performed by an ever greater number of scientists.

Because individual laboratories or consortia producing and analyzing large-scale datasets do not (and typically cannot) explore every possible hypothesis that is supported by their data, opportunities abound to test new ideas that may have not been considered previously. In terms of individual datasets, the opportunities are vast. The NCBI Gene Expression Omnibus (GEO) has archived more than 32,000 microarray and sequencing studies that comprise over 800,000 samples since 2001 (Barrett et al., 2013; The Sequence Read Archive (SRA), which maintains sequence data that are either submitted directly to the SRA or extracted from GEO submissions, currently hosts over 1 petabase (Kodama et al., 2011; and is one of the largest datasets hosted by Google ( Clearly, access to molecular data has never been greater.

Before diving head first into this immense sea of data, it is essential to first identify publicly available datasets that constitute the equivalent of a properly controlled experiment. For example, can datasets be identified that are derived from matched biological samples? In some cases, answers to these questions can be easily obtained from metadata associated with each dataset at the GEO and SRA databases. In other cases, however, this information may be difficult to find, incomplete, or even incorrect. Furthermore, expert technical knowledge of the experimental procedures used to generate the datasets is required to assess potential technical artifacts and other caveats. Below, we identify particularly useful collections of publicly available data and discuss common technical artifacts that should be taken into consideration when analyzing next generation sequencing (NGS) datasets. To demonstrate the utility of analyzing existing data, we highlight successful approaches that have generated new ideas regarding the mechanisms that regulate gene expression.

Identifying useful datasets

Related dataset collections that are published together as a resource provide one solution to the problem of identifying comparable datasets. For example, Keji Zhao’s group at the NIH has generated one of the most comprehensive ChIP-seq studies of epigenomic information from a single human cell-type: resting CD4+ T cells (Barski et al., 2007; Schones et al., 2008; Wang et al., 2008b). These particular datasets have been analyzed by many other investigators to identify specific chromatin marks that combine to constitute “chromatin states” (Ernst and Kellis, 2010; Hon et al., 2009), domains (Shu et al., 2011), and boundaries (Wang et al., 2012), or those marks which are best correlated with tissue-specific gene expression (Pekowska et al., 2010; Visel et al., 2009) or gene architecture (Andersson et al., 2009; Hon et al., 2009; Huff et al., 2010; Schwartz et al., 2009; Spies et al., 2009; Tilgner et al., 2009). The studies demonstrate the range of information that can be gleaned from a single large, coherent collection of datasets. One reason that these studies were so successful was because all of the datasets were generated by a single group, using a consistent method, from a single cell type. Identifying similarly coherent datasets among the vast sea of the GEO, SRA and other data repositories, however, can be a challenge.

Initiatives such as the 1000 Genomes (Clarke et al., 2012; The 1000 Genomes Project Consortium et al., 2010), The Cancer Genome Atlas, ENCODE (ENCODE Consortium et al., 2012), modENCODE (modENCODE Consortium et al., 2010; Gerstein et al., 2010), and the Epigenomics Roadmap (Chadwick, 2012) projects provide additional resources that are ideal for data integration (Table 1). Because these initiatives are specifically tasked with providing high-quality resources, common sets of biological samples, reagents, and methods are defined for each project. In the case of ENCODE and modENCODE datasets, established standards must be met for data to be released (Consortium, 2011; For example, the modENCODE project has evaluated the specificity and efficiency of commercial antibodies that are commonly used to generate ChIP-seq datasets (of which, fewer than 75% passed muster) (Egelhofer et al., 2010). Thus, when using data from one of these public projects, users can be reasonably confident that the quality of the data meets or exceeds certain standards.

Table 1
Summary of selected datasets available from ongoing genome projects

These initiatives also generate important control datasets as resources. For example, it is most appropriate to align sequence reads from RNA-seq and ChIP-seq experiments to the cell line-, strain- or individual-specific genome rather than to a generic reference genome. Otherwise, single-nucleotide polymorphisms (SNPs) may be mistaken for RNA editing sites and copy-number variations (CNVs) may be mistaken for differential gene expression or changes in chromatin structure (Pickrell et al., 2011; Schrider et al., 2011). Accordingly, many of these initiatives have resequenced the genomes of the cell lines, strains, and individuals used in the projects.

Lastly, more than 1,500 curated databases are described in the Nucleic Acids Research online Molecular Biology Database Collection, many of which collect and integrate existing data to produce user-friendly, searchable websites (Fernandez-Suarez and Galperin, 2013). As the field of bioinformatics has evolved to more effectively tackle specific questions in biology (Butte, 2009), cutting-edge databases have been designed to place mechanistic hypotheses within arms reach of investigators by automating novel data integration strategies. For example, the HaploReg database integrates user-defined genome-wide association study results with linkage disequilibrium (The 1000 Genomes Project Consortium, 2010), sequence conservation information (Lindblad-Toh et al., 2011), and chromatin structure (Ernst et al., 2011) to link disease-associated genetic variation with putative regulatory elements (Ward and Kellis, 2012). Similarly, many of the large initiatives (e.g. ENCODE, modENCODE, etc.) also provide unified exploratory tools (e.g. UCSC and IGV genome browsers) that allow straight-forward evaluation of genomic regions of interest.

NGS datasets: prospects and best practices

New advances in NGS technologies are greatly expanding the current volume and the range of existing data (Metzker, 2010). As there is no evidence that innovations in sequencing technology are slowing down, it can only be anticipated that the pace of generating sequence data will continue to increase and the cost will decrease. By the start of 2012, approximately 75,000 genomic, 15,000 transcriptome and 15,000 epigenomic submissions had been contributed to the SRA (Figure 1A). However, that volume of data represents only the tip of the iceberg as transcriptome and epigenomic applications will be applied to include a greater range of cell-types and species. Indeed, the number of transcriptome and epigenomic submissions have been steadily increasing, particularly in recent years (Figure 1B).

Figure 1
An overview of the publicly available data at the Sequence Read Archive (SRA) based on user-submitted metadata

Transcriptome and epigenomic applications have been applied most liberally to humans and model organisms such as mouse, fly, worm and yeast (Figure 1C). While the transcriptome datasets consist almost entirely of RNA-seq experiments, the epigenomic datasets are generated using a large collection of methods that interrogate various aspects of chromatin structure. The epigenomic experiments include methods to assess DNA accessibility (Boyle et al., 2008), DNA methylation (Laird, 2010), the genomic locations of transcription factors and chromatin marks (ChIP-seq) (Park, 2009; Schones and Zhao, 2008), and nucleosome positions (MNase-seq) (Jiang and Pugh, 2009) (Figure 1A, red bar graph). Additionally, specialized applications have also been submitted to the SRA that assess chromatin conformation (de Wit and de Laat, 2012; Dekker et al., 2002; Fullwood et al., 2010), RNA:protein interactions (Licatalosi and Darnell, 2010; Ule et al., 2005), RNA polymerase elongation (Churchman and Weissman, 2011; Core et al., 2008), and ribosome occupancy (Ingolia et al., 2009). In theory, any process related to nucleic acid metabolism can be assessed with the proper biochemical preparation, which makes NGS applications a rich and powerful source for integrative data analysis (Hawkins et al., 2010).

Furthermore, NGS datasets are extraordinarily rich. The sequence reads from a single experiment can provide a vast array of quantitative, positional and sequence information. For instance, RNA-seq datasets provide sufficient information to measure mRNA expression levels and alternative splicing, to identify transcriptional start site and polyadenylation sites, and to identify instances of RNA editing. In certain cases, allele-specific gene expression, allele-specific splicing, and even trans-splicing can be measured. Further, RNA-seq can be used as discovery tool to annotate novel coding and non-coding transcripts as well as chimeric transcripts that result from genomic rearrangements (Figure 2) (Martin and Wang, 2011; McManus et al., 2010; Ozsolak and Milos, 2011; Wang et al., 2009). Though these datasets are very rich, they must be analyzed carefully. Essentially every step involved in generating NGS data introduces detectable, sometimes substantial, biases or errors (Figure 3). This presents a particular challenge for data integration, since different sequencing platforms, biochemical procedures, and data processing methods are associated with unique caveats. With the proper controls, however, these effects can often be identified and accounted for in downstream analyses.

Figure 2
RNA-seq datasets are information rich
Figure 3
Stages at which artifacts, errors, and biases can be introduced in NGS experiments and analysis

NGS platforms have been in use long enough that biases attributable to library construction and sequencing have been evaluated in great detail. Data generated by the Illumina platform, for example, are subject to base-call errors that increase with read position due to phasing issues (Dohm et al., 2008) and underrepresentation of high and low GC-content reads (Dohm et al., 2008; Risso et al., 2011). As these technical issues are well characterized, popular analysis packages attempt to correct for such nucleotide biases. For example, Cufflinks, a popular program used to measure differential gene expression, empirically determines nucleotide biases present in RNA-seq datasets and corrects for them (Meacham et al., 2011; Trapnell et al., 2012). While this strategy vastly improves comparisons between independently generated datasets, and even from different sequencing platforms, third-generation sequencing platforms, such as those from IonTorrent, PacificBiosciences, and Oxford Nanopore (and other lurking companies), use radically different chemistries, the biases of which will need to be identified and remedied.

Less recognized are the myriad experiment-specific biases or artifacts that are introduced at nearly every step involved in preparing libraries and sequencing them. For example, it has clearly been shown that RNA-seq libraries prepared using random-hexamer priming display a systematic non-templated sequence profile at the beginning of reads which is primarily due to first-strand synthesis (Hansen et al., 2010). This technical artifact appears to be a major source of the controversial ~10,000 “RNA DNA differences” (proposed RNA editing sites) identified from human cell lines that were recently reported (Li et al., 2011) and subsequently called into question (Kleinman and Majewski, 2012; Lin et al., 2012; Pickrell et al., 2012). Similarly, ChIP-seq experiments over-represent regions of open chromatin, which can create false positives (Chen et al., 2012).

Another technical artifact associated with library preparations is template-switching that occurs during the amplification steps, which can give rise to molecules that did not exist in the initial biological sample. For RNA-seq experiments, this type of artifact can generate ‘chimeric’ RNAs that appear to be synthesized by trans-splicing or some unknown biological process (Gingeras, 2009; McManus et al., 2010). Sequence data from control libraries that assess the frequency of template-switching are absolutely essential to distinguish biologically derived chimeric RNAs from those generated by artifactually (McManus et al., 2010).

In addition to experimental artifacts, bioinformatic artifacts can severely impact data interpretation. A major source of these artifacts is the mappability of NGS sequence reads, which are typically 25–100 nucleotides in length. Mapping these sequences to a reference genome can be particularly problematic due to the plethora of repetitive elements present in most genomes. Repetitive elements such as LINEs and SINEs have always presented difficulties for correctly mapping sequences, but the short size of NGS reads significantly amplifies this problem – as read length decreases, so does the number of unique regions that can be mapped within a reference genome (Treangen and Salzberg, 2012). Consequently, ‘mappability’ differs depending on read length. Such biases can create illusory non-random associations with biological features (e.g. exons) in ChIP- and MNase-seq experiments. For example, with 32 bp reads, tiny but common genomic features, such as coding starts, ends, exons and splice sites accumulate greater read densities than other local features (e.g. introns) (Schwartz et al., 2011). Thus, mappability must be carefully considered when interpreting any type of alignment data.

A second bioinformatic artifact is caused by genetic variation that has not been accounted for. CNVs that differ between experimental samples and reference genomes, for example, can create false-positive enrichment regions in ChIP-seq experiments (Pickrell et al., 2011). Studies of RNA editing are particularly susceptible to high false-discovery rates if SNPs are not accounted for in the analysis. In the case of RNA DNA differences (Li et al., 2011), 55% match the genome of at least one of the 27 individual genomes used in the original analysis, which suggests that the relatively low coverage (2–6×) of these genomes was not sufficient to identify and eliminate confounding SNPs (Schrider et al., 2011). Thus, a matched reference genome – with sufficiently deep coverage – should be used when mapping short reads since experimental samples such as cell lines, strains, and individuals may differ in their SNPs, CNVs, and chromosome number. Even then, care must be taken to ensure that interesting results are not correlated with regions of shallow sequence depth.

NGS technologies are rapidly evolving. Consequently, robust computational methods lag behind this moving target. Thus, data generated by the early adoption of exciting new technologies should be evaluated, first and foremost, with critical attention to sequence biases. It is our opinion that novel phenomena will increasingly be discovered by the use of existing data. However, these phenomena are only as compelling as the support for an underlying mechanism. The vast potential for technical artifacts in NGS data are more than enough reason enough for caution, particularly since some technical artifacts are, in fact, not random and may correlate strongly with known biological features. Here, we are reminded of a sage warning made by Daniel MacArthur in reference to remarkable results obtained by analyzing large NGS datasets: “The more surprising a result seems to be, the less likely it is to be true” (MacArthur et al., 2012). We implore data analysts to heed this warning and perform extensive validation of remarkable findings, or they may indeed fall victim to MacArthur’s rule.

Successful uses of existing data

The Watson-Crick model of DNA – the double helix – was developed largely through model building informed by existing data (Watson and Crick, 1953b, a). Although structural evidence supporting the model required meticulous experimentation (Franklin and Gosling, 1953; Wilkins et al., 1953), the model alone suggested the basis of genetic inheritance as well as DNA replication and recombination (Watson and Crick, 1953a, b). Despite lacking any knowledge of the complex protein machines responsible for replication and recombination, the central predictions were, nonetheless, essentially proven within a decade (Alberts, 2003). Thus, the double helix exemplifies how the insightful analysis of existing data can revolutionize a field.

There are numerous examples where insights into the molecular mechanisms of various biological processes have been gleaned from analyzing existing data. Below we highlight several of these and attempt to highlight the general aspects of each study as a lesson for how each approach can applied to other problems. We focus on problems related to various aspects of RNA biology, though these approaches can be used for other molecular processes as well.

Identifying functional elements through conservation

Functionally important sequence elements are expected to be conserved over time. Thus, one way of investigating a particular process is to identify conserved sequence elements using alignments of multiple whole genome sequences. Conservation plots can be generated from such alignments using several software packages that calculate nucleotide substitution rates. Conveniently, conservation scores based on whole-genome alignments using phastCons (Siepel et al., 2005), phyloP (Pollard et al., 2010), or SiPhy (Garber et al., 2009) can be downloaded from the UCSC genome browser for many model organisms, including yeast, worm, fly and various mammals including mouse and human (Kent et al., 2002). Analyzing the identity, characteristics, and locations of conserved sequences can provide tremendous mechanistic insight. Below, we highlight two such examples, in the fields of RNA editing and alternative splicing, that utilized this approach.

Adenosine to inosine (A-to-I) editing of RNA is an evolutionarily conserved process catalyzed by the ADAR family of adenosine deaminases (reviewed in Rieder and Reenan, 2011). A mystery that dogged the field for some time was the paucity of known endogenous RNA targets – only a few chance discoveries had been described – despite evidence that the inosine content of mRNA isolated from brain tissues might be as high as 1 in every 17,000 nucleotides (Paul and Bass, 1998). Armed with the knowledge that ADAR mutations resulted in neurological defects (Palladino et al., 2000) and that ADAR editing required an RNA duplex formed between the targeted region (containing the edited adenosine) and a complementary sequence (Higuchi et al., 1993), Hoopengardener and colleagues (2003) searched for new targets of RNA editing among the neuronally expressed genes of Drosophila. In the case of para, which encodes a Na+ channel, the exon containing a known editing site is very highly conserved near the editing site, as is a region in the adjacent intron which basepairs with the edited exon. Based on this observation, Hoopengardner et al. (2003) reasoned that they might be able to identify new RNA editing targets by identifying very highly conserved exons. They therefore assessed 914 neuronally expressed genes to identify exons with a high level of sequence constraint between Drosophila melanogaster and Drosophila pseudoobscura. This approach proved highly productive, as 16 new editing sites were identified that were validated by cDNA sequencing (Hoopengardner et al., 2003) (Figure 4A illustrates one such example). Importantly, this use of comparative genomics demonstrated a previously unanticipated degree of phylogenetic conservation between A-to-I editing sites, solidified the RNA duplex-dependent mechanism of ADAR function, and provided a facile bioinformatic strategy for editing-site identification. Indeed, improved variations of this approach using existing EST:genome alignments (Levanon et al., 2004) or archived sequence chromatograms (Zaranek et al., 2010) have now greatly increased the number of high-confidence A-to-I editing candidates (reviewed in Wulff et al., 2011).

Figure 4
Comparative approaches reveal insights into the mechanisms of RNA editing and alternative splicing

A similar approach has been used to uncover novel mechanisms that regulate alternative splicing. In one particularly illustrative case, the Drosophila Dscam gene, conservation within introns was used to uncover a novel mechanism of mutually exclusive splicing. Dscam is a text book example of the importance of alternative splicing in increasing protein diversity, as it may generate over 38,000 different protein isoforms (Schmucker et al., 2000). Each time the Dscam gene is transcribed, the pre-mRNA is spliced such that each mRNA contains one and only one exon from each of four exon clusters (exons 4, 6, 9, and 17, specifically). But at the time of its discovery, no previously described mechanism of mutually exclusive splicing could explain how the many variable exons of Dscam are spliced in a mutually exclusive manner.

Compared to exons, which are often highly conserved due to their coding potential, intronic regions typically have little conservation except at sites that have non-coding function, such as RNA splicing. Nucleotide alignments from 15 insect species revealed two types of conserved sequence elements in the introns of the exon 6 cluster. The first element, the docking site, was located between exon 5 and the first exon 6 variant. Importantly, the docking site was found only once in the exon 6 cluster, but was present in every species examined, even species that diverged over 450 million years ago. Selector sequences, the second type of sequence element, were located in introns upstream of each exon 6 variant. Based on their complementary sequences, the docking site and selector sites appeared to base-pair with one another; however, one and only one of the selector sequences could base-pair with the docking site at a time (Figure 4B). Thus, base pairing of one selector sequence with the docking site would be predicted to promote inclusion of that exon, while simultaneously inhibiting the splicing of the 47 other exon 6 variants. Though this mechanism was discovered purely by comparative genomics and bioinformatics, the elegance and universal conservation among insects lent credence to the proposed mechanism. Experimental confirmation of the model was subsequently obtained using mutagenized BACs containing the entire Dscam gene (May et al., 2011). Further demonstrating the power of this approach, additional docking site:selector sequences within the other Dscam clusters and even other alternatively-spliced genes have been identified largely on the basis of intronic conservation (Yang et al., 2011).

These two examples illustrate how conservation can be used to identify functional elements that provided insight into the mechanisms of RNA editing and alternative splicing. However, these approaches can be used to study many other processes. For instance, candidate functions have been assigned to ~60% of the conserved sequences in mammals (Lindblad-Toh et al., 2011), yet 40% of these elements have unknown functions. Moreover, another 10,000 regions of mammalian coding sequences are predicted to have overlapping functions; yet again, these functions are mostly unknown. Thus, a fruitful avenue of research is to use existing multiple sequence alignments to identify the conserved sequence elements associated with a gene or process of interest. The function of conserved sequences will likely require experimental approaches to determine their functions, but they may also be inferred based on their sequence features or locations alone.

Functional relationships identified through data integration

More recently, analyses of existing data have made a significant impact on the burgeoning study of co-transcriptional splicing. A growing body of evidence now supports the notion that transcription and splicing are not only concurrent, but also coupled, such that transcriptional dynamics profoundly influence RNA splicing (Neugebauer, 2002). For example, the elongation rate of RNA polymerase can influence the propensity for exon skipping (Kornblihtt et al., 2004). Until recently, however, little was known about the relationship between co-transcriptional splicing and the chromatin context in which it takes place. By integrating existing epigenomic datasets with known splicing patterns, recent studies have generated exciting new hypotheses that intimately connect chromatin structure to RNA splicing.

A major challenge associated with studying chromatin structure is its immense complexity. Even the most fundamental unit of chromatin, the nucleosome, can differ between genomic regions in occupancy, positioning, and myriad post-translational modifications (a.k.a. chromatin marks). It has long been observed in S. cerevisiae, for instance, that the chromatin mark, histone H3 lysine 36 trimethyation (H3K36me3), is enriched within the bodies of active genes (reviewed in Li et al., 2007). Thus H3K36me3, and many other marks, are thought to be intimately associated with transcriptional processes. Questions concerning whether chromatin marks might affect, or be affected, by splicing were rarely discussed until genome-wide ChIP surveys in C. elegans demonstrated higher levels of H3K36me3 at exons compared to nearby introns within the same gene (Kolasinska-Zwierz et al., 2009). As this finding was entirely unanticipated, Kolasinska-Zwierz et al. turned to publicly available H3K36me3 ChIP-seq data (Barski et al., 2007) to test the relevance of their findings to humans. By aggregating all of the H3K36me3 sequence reads that aligned near exons, (a so-called “metagene” analysis), the authors observed strikingly similar H3K36me3 enrichment at the average human exon (Figure 5A and 5B, blue panel).

Figure 5
From observations at individual genes to genome-wide analyses

Furthermore, integrating H3K36me3 ChIP data with annotations of alternative and constitutive exons, revealed a potential connection between the degree of chromatin marking and alternative splicing. A modest, but suggestive decrease in H3K36me3 reads at alternatively spliced exons was observed in both worm and mouse (Kolasinska-Zwierz et al., 2009). Similar analyses have been somewhat conflicting (Hon et al., 2009; Spies et al., 2009), but this may simply indicate that the differences between constitutive and alternatively-spliced exons are subtle. Nonetheless, the notion that chromatin marks, and in particular, H3K36me3 can affect alternative splicing is consistent with several recently reported experiments. Indeed, two H3K36me3-binding proteins, MRG15 and PSIP1, have been implicated in mediating alternative splicing regulation by the splicing factors PTB and SRSF1 (ASF/SF2), respectively (Luco et al., 2010; Pradeepa et al., 2012). Additional marks, such as H3K4me3 and H3 acetylations, have also been shown to influence splicing (Gunderson and Johnson, 2009; Sims et al., 2007). Most likely, chromatin marks function in splicing by modulating RNA polymerase elongation rates or by recruiting specific splicing factors to active genes (Luco et al., 2011 and references therein; Nilsen and Graveley, 2010). Intriguingly, depletion of the only known H3K36me3 methyltransferase influences the splicing of only a small, but significant, subset of PTB-dependent splicing events (Luco et al., 2010). Such gene-specific affects might also explain why genome-wide correlations between H3K36me3 and alternative splicing have been modest.

Numerous studies have since analyzed more than 41 chromatin marks using publicly available epigenomic datasets, which have yielded a strong consensus: chromatin structure reflects gene architecture. In humans, three types of exon/intron boundaries have been shown to be associated with particular chromatin marks: 1) H3K4me3 and H3K9Ac throughout the length of the first exon (Bieberstein et al., 2012), 2) H3K79me2 (among several other marks) throughout the length of the first intron (Huff et al., 2010), and most notably 3) H3K36me3 enrichment at internal exons (Andersson et al., 2009; Hon et al., 2009; Huff et al., 2010; Spies et al., 2009). Because these chromatin marks are intimately associated with transcription, these results suggest a much closer connection between the splicing and transcription machineries than previously thought.

Similarly, analysis of published MNase-seq data (Schones et al., 2008; Valouev et al., 2008), revealed that nucleosomes themselves were also highly associated with internal exons (Andersson et al., 2009; Huff et al., 2010; Schwartz et al., 2009; Spies et al., 2009; Tilgner et al., 2009). Based on this observation, a novel mechanism for exon definition has been proposed (but yet to be proven), whereby nucleosome occupied exons serve as RNA polymerase “speed bumps” that provide additional time for the spliceosomal machinery to recognize nearby splice sites (Schwartz et al., 2009; Tilgner et al., 2009).

In theory, elevated nucleosome occupancy at exons alone could explain the previously identified enrichment of H3K36me3 at exons (Schwartz et al., 2009; Tilgner et al., 2009). However, bioinformatic analyses comparing nucleosome occupancy at exons, exon-like composition regions (ECRs), and pseudoexons, have uncovered evidence that the mechanisms determining nucleosome occupancy and H3K36me3 enrichment at exons are distinct. To discern which aspects of exon sequence might be necessary and sufficient for high nucleosome occupancy, sequence characteristics of exons were analyzed separately for high nucleosome occupancy. In this case, the average ECR, which is not flanked by splice sites but does have the same GC-content of the average exon, displayed nucleosome occupancies equal to that of the average exon (Spies et al., 2009) (Figure 5B, brown panel). Conversely, the average pseudoexon, which has lower GC-content than the average exon, was depleted for nucleosome occupancy (Tilgner et al., 2009). Thus, the DNA sequence composition of exons alone may be sufficient for exon-like nucleosome occupancy. Lastly, ECRs were not enriched for H3K36me3, which suggests that exon marking reflects some aspect of splicing rather than exon-like sequence composition (Huff et al., 2010) (Figure 5B, brown panel). Further supporting a role for the spliceosome in specifying chromatin structure, recent experiments in which splicing was inhibited by splice-site mutations or Spliceostatin exposure, have indeed caused changes in H3K36me3 marking (De Almeida et al., 2011; Kim et al., 2011).

The above analyses demonstrate the power of data integration to establish new connections between related processes whose mechanistic links may yet be unclear. In some cases, these relationships can be readily observed at single genes using a genome browser to facilitate comparisons between datasets. In such cases, moving from observations at single loci to genome-wide analyses can be accomplished by aggregating values from genome-wide datasets at specific features of interest (Figure 5). Genome-wide relationships can also be revealed by plotting all relevant loci in a high-density heatmap that is aligned and sorted to highlight features of interest (Hawkins et al., 2010). The later approach is particularly useful in instances where summary statistics like the genome-wide averages might be deceptive (e.g., the mean of a bimodal distribution). Thus, by integrating and aggregating existing data, a mere anecdote can be transformed into an global principle.

Estimating the frequency and functional consequences of poorly characterized biological phenomena

Publicly available data can also be used to assess the prevalence and functional consequences of previously ignored biological phenomena. In animals, alternatively spliced genes are the norm; more than 92% of human genes produce at least one alternatively spliced transcript (Pan et al., 2008; Wang et al., 2008a). While many of these alternatively spliced transcripts are predicted to encode functionally distinct protein isoforms, others encode protein isoforms whose biological relevance is questionable. Thus, the perennial question: which of these splicing events are regulated and which are stochastic?

For instance, alternative splicing may introduce premature termination codons (PTCs) that target the message for degradation by nonsense mediated decay (NMD). Such unproductively spliced transcripts could be regulated to function as post-transcriptional on/off switches, or merely splicing mistakes in need of triage. Only a few genes were previously known to be regulated by unproductive splicing. Publicly available EST data, on the other hand, suggested that nearly one-third of alternatively spliced transcripts were potential NMD targets (Lewis et al., 2003). This unexpectedly high prevalence brought new attention to the hypothesis that unproductive splicing might post-transcriptionally regulate the expression of entire classes of genes. Controversy initially surrounded the original EST-based estimates because experiments that depleted NMD factors to identify stabilized unproductively spliced transcripts yielded more conservative estimates. By microarray, only ~10% of cassette exons substantially elicited NMD (Pan et al., 2006) and tissue-specific regulation was found to be rare. More recent approaches using RNA-seq, which allow for all major forms of splicing to be considered, have brought these numbers closer to initial estimates, but only for some tissues (Weischenfeldt et al., 2012). Nonetheless, the question as to whether a significant portion of unproductive splicing regulates the expression of entire classes of genes was answered through analyses that showed that unproductively spliced transcripts were enriched for genes encoding splicing factors and other RNA-binding proteins (Ni et al., 2007; Saltzman et al., 2008). Another experimental study demonstrated that the entire family of human SR proteins was associated with unproductive splicing (Lareau et al., 2007). These studies satisfyingly confirmed and extended previous reports of autoregulation by unproductive splicing (reviewed in McGlincy and Smith, 2008). Thus, as with the previous examples, an initial breakthrough was achieved through the analysis of existing data, followed by further refinement and proof through experimental studies.

The functional consequences of alternative splicing decisions that produce nearly identical protein isoforms has also been assessed using publicly available datasets. In this case, introns ending in NAGNAG (a tandem duplication of the 3′ splice sited NAG) have been previously shown to be alternatively spliced such that their protein isoforms differ by only a single amino acid based on publicly available EST data (Hiller et al., 2004; Hiller and Platzer, 2008). However, whether these small differences are regulated or stochastic has been questioned (Chern et al., 2006). By first analyzing their own experimental data, the authors confirmed that there is broad use of NAGNAG splicing in human and mouse tissues. Motivated by these findings, the authors also mined the extensive collection of RNA-seq datasets generated by the Drosophila and C. elegans modENCODE projects. Strikingly, 500 NAGNAGs splice sites were found to be alternatively spliced in at least one of 30 developmental time points in Drosophila, while NAGNAG splicing in C. elegans was considerably less dynamic. Approximately 5–14% of alternatively spliced NAGNAGs were found to be developmentally regulated and conserved, such that the most dynamically spliced NAGNAGs were associated with the greatest intronic sequence conservation. While the mechanisms regulating NAGNAG splicing remains unclear, these analyses provide the best evidence to date that even small changes in splicing are commonly regulated (at least in some animals).

The above examples demonstrate the utility of large datasets to assess the prevalence of intriguing phenomena. With large collections of datasets, the prevalence of tissue-specific or developmental regulation can be estimated with tremendous breadth. In such pursuits, however, it is essential to accurately assess the false discovery rate of the analysis. In the above example of NAGNAG splicing, the mean false discovery rates for technical and biological replicates were estimated at a very reasonable 4.4% and 1.1%, respectively. This is another area where heeding MacArthur’s rule is well advised.

Unsupervised methods

Here we have highlighted successful strategies for supervised data mining of existing data. Although not discussed at length here, unsupervised methods like machine learning algorithms also demonstrate great promise for researchers seeking to unbiasedly analyze large datasets. A machine learning algorithm has recently been used to write a first draft of the splicing code - a set of rules so expansive that it is best referenced with the aid of a computer (Barash et al., 2010). This approach led Barash et al. to identify novel sequence motifs associated with regulated alternative splicing as well as a new example of developmentally regulated unproductive splicing. As the quantity of available datasets increases, it is almost certain that unsupervised methods will become more broadly used for analyses.

Analytical skills needed for the future

In a recent poll, most scientists reported that they “rarely” accessed data or used datasets from the published literature for their original research papers (Science Staff, 2011). However, this will undoubtedly change. The opportunities and challenges of ‘Big Data’ are not only being felt in the biological sciences, but in society at large (Lohr, 2012). Thus, students will gain transferable skills from exercises that teach basic workflows using large datasets and scripting languages.

At a minimum, we envision teaching exercises that require biology students to devise an experimental design using only existing data, access relevant datasets from archives, parse and integrate data using programing languages such as Perl, Python, Ruby or R, and apply an appropriate visualization technique. Laboratory protocols for the use of analytic software currently exist to aid these pursuits (e.g., Cufflinks) (Trapnell et al., 2012). Such exercises will empower students to explore and assess the quantitative data published in the manuscripts that they read, which can no longer be assessed at a glance like the qualitative gel-based results on which molecular biology was founded. Ultimately, it will be equally important to know how to write code as it is to pipette.


We thank members of the Graveley lab for discussions and comments on the manuscript. Work in the Graveley lab is supported by NIH grants R01GM067842, R01GM095296, and U54HG007005 to B.R.G. and U54HG006994 and U01HG004271 to Susan E. Celniker and a Ruth L. Kirschstein National Research Service Award (F32GM105264) to A.M.P.


Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.


  • Alberts B. DNA replication and recombination. Nature. 2003;421:431–435. [PubMed]
  • Andersson R, Enroth S, Rada-Iglesias A, Wadelius C, Komorowski J. Nucleosomes are well positioned in exons and carry characteristic histone modifications. Genome Research. 2009;19:1732–1741. [PubMed]
  • Barash Y, Calarco JA, Gao W, Pan Q, Wang X, Shai O, Blencowe BJ, Frey BJ. Deciphering the splicing code. Nature. 2010;465:53–59. [PubMed]
  • Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2013;41:D991–995. [PMC free article] [PubMed]
  • Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K. High-Resolution Profiling of Histone Methylations in the Human Genome. Cell. 2007;129:823–837. [PubMed]
  • Bieberstein NI, Carrillo Oesterreich F, Straube K, Neugebauer KM. First exon length controls active chromatin signatures and transcription. Cell reports. 2012;2:62–68. [PubMed]
  • Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, Furey TS, Crawford GE. High-resolution mapping and characterization of open chromatin across the genome. Cell. 2008;132:311–322. [PMC free article] [PubMed]
  • Butte AJ. Translational bioinformatics applications in genome medicine. Genome Medicine. 2009;1:64–64. [PMC free article] [PubMed]
  • Chadwick LH. The NIH Roadmap Epigenomics Program data resource. Epigenomics. 2012;4:317–324. [PMC free article] [PubMed]
  • Chen Y, Negre N, Li Q, Mieczkowska JO, Slattery M, Liu T, Zhang Y, Kim TK, He HH, Zieba J, et al. Systematic evaluation of factors influencing ChIP-seq fidelity. Nat Methods 2012 [PMC free article] [PubMed]
  • Chern TM, van Nimwegen E, Kai C, Kawai J, Carninci P, Hayashizaki Y, Zavolan M. A simple physical model predicts small exon length variations. PLoS Genet. 2006;2:e45. [PMC free article] [PubMed]
  • Churchman LS, Weissman JS. Nascent transcript sequencing visualizes transcription at nucleotide resolution. Nature. 2011;469:368–373. [PMC free article] [PubMed]
  • Clarke L, Zheng-Bradley X, Smith R, Kulesha E, Xiao C, Toneva I, Vaughan B, Preuss D, Leinonen R, Shumway M, et al. The 1000 Genomes Project: data management and community access. Nat Methods. 2012;9:459–462. [PMC free article] [PubMed]
  • Core LJ, Waterfall JJ, Lis JT. Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science. 2008;322:1845–1848. [PMC free article] [PubMed]
  • De Almeida SF, Grosso AR, Koch F, Fenouil R, Carvalho S, Andrade J, Levezinho H, Gut M, Eick D, Gut I, et al. Splicing enhances recruitment of methyltransferase HYPB/Setd2 and methylation of histone H3 Lys36. Nature Structural & Molecular Biology. 2011;18:977–983. [PubMed]
  • de Wit E, de Laat W. A decade of 3C technologies: insights into nuclear organization. Genes Dev. 2012;26:11–24. [PubMed]
  • Dekker J, Rippe K, Dekker M, Kleckner N. Capturing chromosome conformation. Science. 2002;295:1306–1311. [PubMed]
  • Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Research. 2008;36:e105–e105. [PMC free article] [PubMed]
  • ENCODE Project Consortium. A user’s guide to the encyclopedia of DNA elements (ENCODE) PLoS Biol. 2011;9:e1001046. [PMC free article] [PubMed]
  • ENCODE Project Consortium. Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, Epstein CB, Frietze S, Harrow J, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. [PMC free article] [PubMed]
  • Egelhofer TA, Minoda A, Klugman S, Lee K, Kolasinska-Zwierz P, Alekseyenko AA, Cheung MS, Day DS, Gadel S, Gorchakov AA, et al. An assessment of histone-modification antibody quality. Nature Structural & Molecular Biology. 2010;18:91–93. [PMC free article] [PubMed]
  • Ernst J, Kellis M. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nature Biotechnology. 2010;28:817–825. [PMC free article] [PubMed]
  • Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang L, Issner R, Coyne M, et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011;473:43–49. [PMC free article] [PubMed]
  • Fernandez-Suarez XM, Galperin MY. The 2013 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Res. 2013;41:D1–7. [PMC free article] [PubMed]
  • Franklin RE, Gosling RG. Molecular configuration in sodium thymonucleate. Nature. 1953;171:740–741. [PubMed]
  • Fullwood MJ, Han Y, Wei CL, Ruan X, Ruan Y. Chromatin interaction analysis using paired-end tag sequencing. In: Ausubel Frederick M, et al., editors. Current protocols in molecular biology. Unit 21. Chapter 21. 2010. pp. 15pp. 21–25. [PubMed]
  • Garber M, Guttman M, Clamp M, Zody MC, Friedman N, Xie X. Identifying novel constrained elements by exploiting biased substitution patterns. Bioinformatics (Oxford, England) 2009;25:i54–62–i54–62. [PMC free article] [PubMed]
  • Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. Genomes Project C. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. [PMC free article] [PubMed]
  • Gerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, Arshinoff BI, Liu T, Yip KY, Robilotto R, Rechtsteiner A, Ikegami K, et al. Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project. Science. 2010;330:1775–1787. [PMC free article] [PubMed]
  • Gingeras TR. Implications of chimaeric non-co-linear transcripts. Nature. 2009;461:206–211. [PMC free article] [PubMed]
  • Gunderson FQ, Johnson TL. Acetylation by the transcriptional coactivator Gcn5 plays a novel role in co-transcriptional spliceosome assembly. PLoS Genet. 2009;5:e1000682. [PMC free article] [PubMed]
  • Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research. 2010;38:e131–e131. [PMC free article] [PubMed]
  • Hawkins RD, Hon GC, Ren B. Next-generation genomics: an integrative approach. Nat Rev Genet. 2010;11:476–486. [PMC free article] [PubMed]
  • Higuchi M, Single FN, Köhler M, Sommer B, Sprengel R, Seeburg PH. RNA editing of AMPA receptor subunit GluR-B: A base-paired intron-exon structure determines position and efficiency. Cell. 1993;75:1361–1370. [PubMed]
  • Hiller M, Huse K, Szafranski K, Jahn N, Hampe J, Schreiber S, Backofen R, Platzer M. Widespread occurrence of alternative splicing at NAGNAG acceptors contributes to proteome plasticity. Nat Genet. 2004;36:1255–1257. [PubMed]
  • Hiller M, Platzer M. Widespread and subtle: alternative splicing at short-distance tandem sites. Trends in genetics : TIG. 2008;24:246–255. [PubMed]
  • Hon G, Wang W, Ren B. Discovery and Annotation of Functional Chromatin Signatures in the Human Genome. PLoS Comput Biol. 2009;5:e1000566–e1000566. [PMC free article] [PubMed]
  • Hoopengardner B, Bhalla T, Staber C, Reenan R. Nervous System Targets of RNA Editing Identified by Comparative Genomics. Science. 2003;301:832–836. [PubMed]
  • Huff JT, Plocik AM, Guthrie C, Yamamoto KR. Reciprocal intronic and exonic histone modification regions in humans. Nature Structural & Molecular Biology. 2010;17:1495–1499. [PMC free article] [PubMed]
  • Ingolia NT, Ghaemmaghami S, Newman JR, Weissman JS. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science. 2009;324:218–223. [PMC free article] [PubMed]
  • Jiang C, Pugh BF. Nucleosome positioning and gene regulation: advances through genomics. Nat Rev Genet. 2009;10:161–172. [PMC free article] [PubMed]
  • Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Research. 2002;12:996–1006. [PubMed]
  • Kim S, Kim H, Fong N, Erickson B, Bentley DL. Pre-mRNA splicing is a determinant of histone H3K36 methylation. Proceedings of the National Academy of Sciences of the United States of America. 2011;108:13564–13569. [PubMed]
  • Kleinman CL, Majewski J. Comment on “Widespread RNA and DNA sequence differences in the human transcriptome” Science. 2012;335:1302. author reply 1302. [PubMed]
  • Kodama Y, Shumway M, Leinonen R. and on behalf of the International Nucleotide Sequence Database, C. The sequence read archive: explosive growth of sequencing data. Nucleic Acids Research. 2011;40:D54–D56–D54–D56. [PMC free article] [PubMed]
  • Kolasinska-Zwierz P, Down T, Latorre I, Liu T, Liu XS, Ahringer J. Differential chromatin marking of introns and expressed exons by H3K36me3. Nature Genetics. 2009;41:376–381. [PMC free article] [PubMed]
  • Kornblihtt AR, de la Mata M, Fededa JP, Munoz MJ, Nogues G. Multiple links between transcription and splicing. RNA (New York, NY) 2004;10:1489–1498. [PubMed]
  • Laird PW. Principles and challenges of genome-wide DNA methylation analysis. Nat Rev Genet. 2010;11:191–203. [PubMed]
  • Lareau LF, Inada M, Green RE, Wengrod JC, Brenner SE. Unproductive splicing of SR genes associated with highly conserved and ultraconserved DNA elements. Nature. 2007;446:926–929. [PubMed]
  • Levanon EY, Eisenberg E, Yelin R, Nemzer S, Hallegger M, Shemesh R, Fligelman ZY, Shoshan A, Pollock SR, Sztybel D, et al. Systematic identification of abundant A-to-I editing sites in the human transcriptome. Nat Biotech. 2004;22:1001–1005. [PubMed]
  • Levanon EY, Hallegger M, Kinar Y, Shemesh R, Djinovic-Carugo K, Rechavi G, Jantsch MF, Eisenberg E. Evolutionarily conserved human targets of adenosine to inosine RNA editing. Nucleic Acids Research. 2005;33:1162–1168. [PMC free article] [PubMed]
  • Lewis BP, Green RE, Brenner SE. Evidence for the widespread coupling of alternative splicing and nonsense-mediated mRNA decay in humans. Proceedings of the National Academy of Sciences of the United States of America. 2003;100:189–192. [PubMed]
  • Li B, Carey M, Workman JL. The role of chromatin during transcription. Cell. 2007;128:707–719. [PubMed]
  • Li M, Wang IX, Li Y, Bruzel A, Richards AL, Toung JM, Cheung VG. Widespread RNA and DNA Sequence Differences in the Human Transcriptome. Science 2011 [PMC free article] [PubMed]
  • Licatalosi DD, Darnell RB. RNA processing and its regulation: global insights into biological networks. Nature Reviews. Genetics. 2010;11:75–87. [PMC free article] [PubMed]
  • Lin W, Piskol R, Tan MH, Li JB. Comment on “Widespread RNA and DNA sequence differences in the human transcriptome” Science. 2012;335:1302. author reply 1302. [PubMed]
  • Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, Kheradpour P, Ernst J, Jordan G, Mauceli E, et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature. 2011;478:476–482. [PMC free article] [PubMed]
  • Lohr S. Big Data’s Impact in the World. The New York Times; 2012.
  • Luco RF, Allo M, Schor IE, Kornblihtt AR, Misteli T. Epigenetics in alternative pre-mRNA splicing. Cell. 2011;144:16–26. [PMC free article] [PubMed]
  • Luco RF, Pan Q, Tominaga K, Blencowe BJ, Pereira-Smith OM, Misteli T. Regulation of alternative splicing by histone modifications. Science (New York, NY) 2010;327:996–1000. [PMC free article] [PubMed]
  • MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB, et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science. 2012;335:823–828. [PMC free article] [PubMed]
  • Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. 2011;12:671–682. [PubMed]
  • May GE, Olson S, McManus CJ, Graveley BR. Competing RNA secondary structures are required for mutually exclusive splicing of the Dscam exon 6 cluster. RNA. 2011;17:222–229. [PubMed]
  • McGlincy NJ, Smith CW. Alternative splicing resulting in nonsense-mediated mRNA decay: what is the meaning of nonsense? Trends Biochem Sci. 2008;33:385–393. [PubMed]
  • McManus CJ, Duff MO, Eipper-Mains J, Graveley BR. Global analysis of trans-splicing in Drosophila. Proc Natl Acad Sci U S A. 2010;107:12975–12979. [PubMed]
  • Meacham F, Boffelli D, Dhahbi J, Martin DI, Singer M, Pachter L. Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics. 2011;12:451–451. [PMC free article] [PubMed]
  • Metzker ML. Sequencing technologies - the next generation. Nature reviews. Genetics. 2010;11:31–46. [PubMed]
  • Roy S, Ernst J, Kharchenko PV, Kheradpour P, Negre N, Eaton ML, Landolin JM, Bristow CA, Ma L, et al. modEncode Consortium. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010;330:1787–1797. [PMC free article] [PubMed]
  • Neugebauer KM. On the importance of being co-transcriptional. Journal of Cell Science. 2002;115:3865–3871. [PubMed]
  • Ni JZ, Grate L, Donohue JP, Preston C, Nobida N, O’Brien G, Shiue L, Clark TA, Blume JE, Ares M., Jr Ultraconserved elements are associated with homeostatic control of splicing regulators by alternative splicing and nonsense-mediated decay. Genes & Development. 2007;21:708–718. [PubMed]
  • Nilsen TW, Graveley BR. Expansion of the eukaryotic proteome by alternative splicing. Nature. 2010;463:457–463. [PMC free article] [PubMed]
  • Ozsolak F, Milos PM. RNA sequencing: advances, challenges and opportunities. Nat Rev Genet. 2011;12:87–98. [PMC free article] [PubMed]
  • Palladino MJ, Keegan LP, O’Connell MA, Reenan RA. A-to-I pre-mRNA editing in Drosophila is primarily involved in adult nervous system function and integrity. Cell. 2000;102:437–449. [PubMed]
  • Pan Q, Saltzman AL, Kim YK, Misquitta C, Shai O, Maquat LE, Frey BJ, Blencowe BJ. Quantitative microarray profiling provides evidence against widespread coupling of alternative splicing with nonsense-mediated mRNA decay to control gene expression. Genes & Development. 2006;20:153–158. [PubMed]
  • Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genetics. 2008;40:1413–1415. [PubMed]
  • Park PJ. ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009;10:669–680. [PMC free article] [PubMed]
  • Paul MS, Bass BL. Inosine exists in mRNA at tissue-specific levels and is most abundant in brain mRNA. EMBO J. 1998;17:1120–1127. [PubMed]
  • Pekowska A, Benoukraf T, Ferrier P, Spicuglia S. A unique H3K4me2 profile marks tissue-specific gene regulation. Genome Research. 2010;20:1493–1502. [PubMed]
  • Pickrell JK, Gaffney DJ, Gilad Y, Pritchard JK. False positive peaks in ChIP-seq and other sequencing-based functional assays caused by unannotated high copy number regions. Bioinformatics. 2011;27:2144–2146. [PMC free article] [PubMed]
  • Pickrell JK, Gilad Y, Pritchard JK. Comment on “Widespread RNA and DNA sequence differences in the human transcriptome” Science. 2012;335:1302. author reply 1302. [PubMed]
  • Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010;20:110–121. [PubMed]
  • Pradeepa MM, Sutherland HG, Ule J, Grimes GR, Bickmore WA. Psip1/Ledgf p52 Binds Methylated Histone H3K36 and Splicing Factors and Contributes to the Regulation of Alternative Splicing. PLoS Genet. 2012;8:e1002717. [PMC free article] [PubMed]
  • Rieder LE, Reenan RA. The intricate relationship between RNA structure, editing, and splicing. Seminars in Cell & Developmental Biology 2011 [PubMed]
  • Risso D, Schwartz K, Sherlock G, Dudoit S. GC-Content Normalization for RNA-Seq Data. BMC Bioinformatics. 2011;12:480–480. [PMC free article] [PubMed]
  • Saltzman AL, Kim YK, Pan Q, Fagnani MM, Maquat LE, Blencowe BJ. Regulation of multiple core spliceosomal proteins by alternative splicing-coupled nonsense-mediated mRNA decay. Molecular and Cellular Biology. 2008;28:4320–4330. [PMC free article] [PubMed]
  • Schmucker D, Clemens JC, Shu H, Worby CA, Xiao J, Muda M, Dixon JE, Zipursky SL. Drosophila Dscam is an axon guidance receptor exhibiting extraordinary molecular diversity. Cell. 2000;101:671–684. [PubMed]
  • Schones DE, Cui K, Cuddapah S, Roh TY, Barski A, Wang Z, Wei G, Zhao K. Dynamic regulation of nucleosome positioning in the human genome. Cell. 2008;132:887–898. [PubMed]
  • Schones DE, Zhao K. Genome-wide approaches to studying chromatin modifications. Nature Reviews. Genetics. 2008;9:179–191. [PubMed]
  • Schrider DR, Gout JF, Hahn MW. Very few RNA and DNA sequence differences in the human transcriptome. PLoS ONE. 2011;6:e25842. [PMC free article] [PubMed]
  • Schwartz S, Meshorer E, Ast G. Chromatin organization marks exon-intron structure. Nat Struct Mol Biol. 2009;16:990–995. [PubMed]
  • Schwartz S, Oren R, Ast G. Detection and removal of biases in the analysis of next-generation sequencing reads. PLoS ONE. 2011;6:e16685–e16685. [PMC free article] [PubMed]
  • Science Staff. Challenges and Opportunities. Science. 2011;331:692–693. [PubMed]
  • Shu W, Chen H, Bo X, Wang S. Genome-wide analysis of the relationships between DNaseI HS, histone modifications and gene expression reveals distinct modes of chromatin domains. Nucleic Acids Research. 2011;39:7428–7443. [PMC free article] [PubMed]
  • Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research. 2005;15:1034–1050. [PubMed]
  • Sims RJ, Millhouse S, Chen CF, Lewis BA, Erdjument-Bromage H, Tempst P, Manley JL, Reinberg D. Recognition of trimethylated histone H3 lysine 4 facilitates the recruitment of transcription postinitiation factors and pre-mRNA splicing. Molecular Cell. 2007;28:665–676. [PMC free article] [PubMed]
  • Spies N, Nielsen CB, Padgett RA, Burge CB. Biased Chromatin Signatures around Polyadenylation Sites and Exons. Molecular Cell. 2009;36:245–254. [PMC free article] [PubMed]
  • The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. [PMC free article] [PubMed]
  • Tilgner H, Nikolaou C, Althammer S, Sammeth M, Beato M, Valcárcel J, Guigó R. Nucleosome positioning as a determinant of exon recognition. Nature Structural & Molecular Biology. 2009;16:996–1001. [PubMed]
  • Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols. 2012;7:562–578. [PMC free article] [PubMed]
  • Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet. 2012;13:36–46. [PMC free article] [PubMed]
  • Ule J, Jensen K, Mele A, Darnell RB. CLIP: a method for identifying protein-RNA interaction sites in living cells. Methods. 2005;37:376–386. [PubMed]
  • Valouev A, Ichikawa J, Tonthat T, Stuart J, Ranade S, Peckham H, Zeng K, Malek JA, Costa G, McKernan K, et al. A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. Genome Research. 2008;18:1051–1063. [PubMed]
  • Visel A, Blow MJ, Li Z, Zhang T, Akiyama JA, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F, et al. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature. 2009;457:854–858. [PMC free article] [PubMed]
  • Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008a;456:470–476. [PMC free article] [PubMed]
  • Wang J, Lunyak VV, Jordan IK. Genome-wide prediction and analysis of human chromatin boundary elements. Nucleic Acids Research. 2012;40:511–529. [PMC free article] [PubMed]
  • Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews. Genetics. 2009;10:57–63. [PMC free article] [PubMed]
  • Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, Cui K, Roh TY, Peng W, Zhang MQ, et al. Combinatorial patterns of histone acetylations and methylations in the human genome. Nature Genetics. 2008b;40:897–903. [PMC free article] [PubMed]
  • Ward LD, Kellis M. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Research. 2012;40:D930–D934–D930–D934. [PMC free article] [PubMed]
  • Watson JD, Crick FH. Genetical implications of the structure of deoxyribonucleic acid. Nature. 1953a;171:964–967. [PubMed]
  • Watson JD, Crick FH. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature. 1953b;171:737–738. [PubMed]
  • Weischenfeldt J, Waage J, Tian G, Zhao J, Damgaard I, Jakobsen JS, Kristiansen K, Krogh A, Wang J, Porse BT. Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrant splicing patterns. Genome biology. 2012;13:R35. [PMC free article] [PubMed]
  • Wilkins MH, Stokes AR, Wilson HR. Molecular structure of deoxypentose nucleic acids. Nature. 1953;171:738–740. [PubMed]
  • Wulff BE, Sakurai M, Nishikura K. Elucidating the inosinome: global approaches to adenosine-to-inosine RNA editing. Nat Rev Genet. 2011;12:81–85. [PMC free article] [PubMed]
  • Yang Y, Zhan L, Zhang W, Sun F, Wang W, Tian N, Bi J, Wang H, Shi D, Jiang Y, et al. RNA secondary structure in mutually exclusive splicing. Nature structural & molecular biology. 2011;18:159–168. [PubMed]
  • Zaranek AW, Levanon EY, Zecharia T, Clegg T, Church GM. A Survey of Genomic Traces Reveals a Common Sequencing Error, RNA Editing, and DNA Editing. PLoS Genet. 2010;6:e1000954–e1000954. [PMC free article] [PubMed]