Mouse embryonic stem cells (V6.5) were co-cultured with irradiated MEFs (GlobalStem; GSC-6002C) on 0.2% gelatin coated plates in a culture media consisting of Knockout DMEM (Invitrogen; 10829018) containing 10% FBS (GlobalStem; GSM-6002), 1% pen-strep 1% Non-essential amino acids, 1% L-glutamine, 4ul Beta-mercaptoethanol, and .01% LIF (Millipore; ESG1106). ESC were passaged once on gelatin without MEFs before RNA extraction. V6.5 ES cells were differentiated into neural progenitor cells (NPC) through embryoid body formation for 4 days and selection in ITSFn media for 5–7 days, and maintained in FGF2 and EGF2 (R&D Systems) as described27
. The cells uniformly express Nestin and Sox2 and can differentiate into neurons, astrocytes and oligodendrocytes. Mouse lung fibroblasts (ATCC), were grown in DMEM with 10% fetal bovine serum and penicillin/streptomycin at 37°, 5% CO2
RNA Extraction & Library Preparation
RNA was extracted using the protocol outlined in the RNeasy kit (Qiagen). Extracts were treated with DNase (Ambion 2238). Polyadenylated RNAs was selected using Ambion’s MicroPoly(A)Purist kit (AM1919M) and RNA integrity confirmed using Bioanalyzer (Agilent). We used a cDNA preparation procedure that combines a random priming step with a shearing step8–9,28
and results in fragments of ~700 bp in size. We previously found9,28
that this protocol provides relatively uniform coverage of the whole tanscript, thus assisting in ab initio
reconstruction. Specifically, a ‘regular’ RNA sequencing library (non strand specific) was created as previously described28
, with the following modifications. 250 ng of polyA+
RNA was fragmented by heating at 98°C for 33 minutes in 0.2 mM sodium citrate, pH 6.4 (Ambion). Fragmented RNA was mixed with 3 μg random hexamers, incubated at 70°C for 10 minutes, and placed on ice briefly before starting cDNA synthesis. First strand cDNA synthesis was performed using Superscript III (Invitrogen) for 1 hour at 55°C, and second strand using E. coli
DNA polymerase and E. coli
DNA ligase at 16°C for 2 hours. cDNA was eluted using Qiagen MiniElute kit with 30ul EB buffer. DNA ends were repaired using dNTPs and T4 polymerase, (NEB) followed by purification using the MiniElute kit. Adenine was added to the 3′ end of the DNA fragments to allow adaptor ligation using dATP and Klenow exonuclease (NEB; M0212S) and purified using MiniElute. Adaptors were ligated and incubated for 15 minutes at room temperature. Phenol/choloform/isoamyl alcohol (Invitrogen 15593-031) extraction followed to remove the DNA ligase. The pellet was then resuspend in 10ul EB Buffer. The sample was run on a 3% Agarose gel (Nusieve 3:1 Agarose) and a 160 – 380 base pair fragment was cut out and extracted. PCR was performed with Phusion High-Fidelity DNA Polymerase with GC buffer (New England Biolabs) and 2M Betaine (Sigma). [PCR conditions: 30 sec at 98°C, (10 sec at 98°C, 30 sec at 65°C, 30 sec at 72°C -16 cycles) 5 min at 72°C, forever at 4°C], and products were run on a poly-acrylamide gel for 60 minutes at 120 volts. The PCR products were cleaned up with Agencourt AMPure XP magnetic beads (A63880) to completely remove primers and product was submitted for Illumina sequencing.
The “strand-specific” library was created from 100 ng of polyA+
RNA using the previously published RNA ligation method29
with modifications from the manufacturer (Illumina, manuscript in preparation). The insert size was 110 to 170 bp.
RNA-Seq library sequencing
All libraries were sequenced using the Illumina Genome Analyzer (GAII). We sequenced 3 lanes for ESC corresponding to 152 million reads, 2 lanes for MLF corresponding to 161 million reads, and 2 lanes for NPC corresponding to 180 million reads.
Alignments of reads to the genome
All reads were aligned to the mouse reference genome (NCBI 37, MM9) using the TopHat aligner13
. Briefly, TopHat uses a two-step mapping process, first uses Bowtie30
to align all reads that map directly to the genome (with no gaps), and then maps all the reads that were not aligned in the first step using gapped alignment. TopHat uses canonical and non-canonical splice sites to determine possible locations for gaps in the alignment.
Generation of connectivity graph
Given a set of reads aligned to the genome, we first identified all spliced reads, as those whose alignment to the reference genome contains a gap. These reads and the reference genome are used to construct connectivity graphs. Each connectivity graph contains all bases from a single chromosome. The nodes in the graph are bases and the edges connect each base to the next base in the genome as well as to all bases to which it is connected through a ‘spliced’ read (). In the analysis presented, we defined an edge between any two bases in the chromosome that were connected by two or more spliced reads. The connectivity graph thus represents the contiguity that exists in the RNA but that is interrupted by intron sequences in the reference genome.
Identification of splice site motifs and directionality
We restricted our analysis to splice reads that mapped connecting donor/acceptor splice sites, either canonical (GT/AG) or non-canonical (GC/AG and AT/AC). We oriented each mapped spliced read using the orientation of the donor/acceptor sites it connected.
Construction of transcript graphs
The ‘spliced’ edges in the connectivity graph reflect bases that were connected in the original RNA but are not contiguous in the genome. To construct a transcript graph, we ‘thread’ the connectivity graph (which was constructed only from the genome and spliced reads) with the non-spliced (contiguous) reads, to provide a quantitative measure of the reads supporting each base and edge. We then use a statistical segmentation strategy to traverse the graph topology directly and determine paths through the connectivity graph that represent a contiguous path of significant enrichment over the background distribution (below). In this segmentation process, we scan variable sized windows across the graph and assign significance to each window. We then merge significant paths into a transcript graph. Specifically, for a window of fixed size, we slide the window across each base in the connectivity graph (after augmenting it with the non-spliced reads). If a window contains only contiguous non-spliced reads, then it represents a non-spliced part of the transcript. However, if the window hits an edge in the connectivity graph connecting two separate parts of the genome (based on two or more spliced reads), then the path follows this edge to a non-contiguous part of the genome, denoting a splicing event. Similarly, when alternative splice isoforms are present, if a base connects to multiple possible places, then all windows across these alternative paths are computed. Using a simple recursive procedure we can compute all paths of a fixed size across the graph.
Identification of significant segments
To assess the significance of each path, we first define a background distribution. We estimate a genomic defined background distribution by permuting the read alignments in the genome and counting the number of reads that overlap each region and the frequency by which they each occur. Specifically, if we are interested in computing the probability of observing alignment a (of length r) at position i (out of a total genome size of L) we can permute the alignments and ask how often read a overlaps position i. Under this uniform permutation model, the probability that read a overlaps position i is simply r/L. Extending this reasoning, we can compute the probability of observing k reads (of average length r) at position i as the binomial probability. Given the large number of reads and the large genome size, the binomial formula can be well approximated by a Poisson distribution where λ=np (or the number of reads/number of possible positions).
Given a distribution for the real number of counts over each position we scan the genome for regions that deviate from the expected background distribution. First consider a fixed window size w
. We slide this window across each position (allowing for overlapping windows), and compute the probability of each observed window based on a Poisson distribution with λ=wnp
. Since we are sliding this window across a genome of size L,
we correct our nominal significance for multiple testing, by computing the maximum value observed for a window size (w
) across a number of permutations of the data. This distribution controls the family-wise error rate, defined as the probability of observing at least one such value in the null distribution31
. Notably, we can estimate this maximum permutation distribution well by a distribution known as the scan statistic distribution32
, which depends on the size of the genome that we scan, the window size used, and our estimate of the Poisson λ parameter. This method provides us with a general strategy to determine a multiple testing corrected P-value for a specified region of the genome in any given sample. We use this method to compute a corrected significance cutoff for any given region.
Finally, to identify significant intervals, we scan the genome using variable sized windows, computing significance values for each and filtering by a 5% significance threshold. For each window size, we merge the significant regions that passed this cutoff into consecutive intervals. We trim the ends of the intervals as needed, since we are computing significant windows (rather than regions) and it is possible that an interval need not be fully contained within a significant region. Trimming is performed by computing a normalized read count for each base in the interval compared to the average number of reads in the genome. We then trim the interval to the maximum contiguous subsequence of this value. We test this trimmed interval using the scan procedure and retain it only if it passes our defined significance level.
We work with a range of different window sizes in order to detect paths (intervals) with variable support, Small windows have power to identify short regions of strong enrichment (e.g. short exon which is highly expressed), whereas long windows capture long contiguous regions with often lower and more ‘diffuse’ enrichment levels (e.g. a longer lower expression transcript, whose ‘moderate evidence’ aggregates along its entire length).
Estimation of library insert size
We estimated the insert size distribution by taking all reconstructed transcripts for which we only reconstructed a single isoform and computing the distribution of distances between the paired-end reads that aligned to them.
Weighting of isoforms using paired end edges
Using the size constraints imposed by the length of the paired ends, we assigned weights to each path in the transcript graph. We classified all paired ends overlapping a given path and assigned them to all possible paths that they overlapped. We then assigned a probability to each paired end of the likelihood that it was observed from this transcript given the inferred insert size for the pair in that path. We used an empirically determined distribution of insert sizes, estimated from single isoform graphs. We then scaled each value by the average insert size. We refer to this scaled value as our insert distribution. For each paired end in a path, we computed I, the inferred insert size (the distance between nodes following along the full path) minus the average insert size. We then determined the probability of I as the area in our insert distribution between −I, I. This value is the probability of obtaining the observed paired end insert distance given this distribution of paired end reads. To aggregate these into weights for each path, we simply weight each paired end by its probability of observing to the given path. Paired ends that equally support multiple isoforms will count equally for all, but paired ends with biases toward some isoforms and against others will provide weighted evidence for each isoform. We assign this weight to each isoform path. This score is normalized by the number of paired ends overlapping the path. We filter paths with little support (normalized score<0.1) of paired reads supporting it.
Determination of expression levels from RNA-Seq data
Expression levels are computed as previously described8
. Briefly, the expression of a transcript is computed in Reads Per Kilobase of exonic sequence per Million aligned reads (RPKM) defined as:
, where r
is the number of reads mapped to the exonic region of the transcript, t
is the total exonic length of the transcript, and R
is the total number of reads mapped in the experiment.
Array expression profiling in ESC cells
Microarray hybridization data was obtained from our previous studies including ESC, NPC16
Comparisons to known annotation
The reconstructed transcripts were compared to the RefSeq genome annotation15
(NCBI Release 39). To determine whether a known annotation of a protein coding gene from RefSeq was fully reconstructed, we first compared the 5′ and 3′ ends of the reconstructed vs
the annotated transcript. If these overlapped, we further verified that all exons in the annotated transcript matched those in the reconstructed version. To score the portion of an annotated transcript covered by our reconstructions, we found the reconstructed transcript whose exons covered the largest fraction of the annotated transcript, and reported the portion of the annotation that it covered.
ChIP-seq profiles in ESC cells and determination of K4 and K36 regions
To determine regions enriched in chromatin marks from ChIP-seq data we used our previously described method4
, applied to ESC, MLF, and NPC data4,16
Determination of external and internal 5′ start sites
We identified alternative 5′ start sites by comparing the 5′ exon of our reconstructed transcripts to the location of the 5′ exon of the annotated gene overlapping it. If the reconstructed 5′ start site resided upstream to the annotated 5′ we termed it ‘external start site’. For the novel 5′ ends that are downstream of the annotated 5′ end (internal) we required a few additional criteria to avoid reconstruction biases due to low coverage. First, we required that the novel internal 5′ end do not overlap any of the known exons within the known gene. Second, we required that the reconstructed gene contains a completed 3′ end. To determine the presence of H3K4me3 modifications overlapping the promoter regions defined by these novel start sites, we computed regions of enriched K4me3 genome-wide (as previously described) and intersected the location of the novel 5′ exon (both internal and external) with the location of a K4me3 peak.
Determination of premature/extended 3′ end
To determine novel 3′ ends, we compared the locations of the 3′ exon of our reconstructed 3′ ends and those of annotated genes. If the reconstruction extended past the annotated 3′ end, we classified it as an extended 3′ end. If the reconstruction ended before the annotated 3′ end we required that it not overlap any known exon and have a fully reconstructed 5′ start site.
Determination of sequence conservation levels
We used the SiPhy19
algorithm and software package (http://www.broadinstitute.org/genome_bio/siphy/
) to estimate ω
, the deviation (‘contraction’ or ‘extension’) of the branch length compared to the neutral tree based on the total number of substitutions estimated from the alignment of the region of interest across 20 placental mammals (build MM9, http://hgdownload.cse.ucsc.edu/goldenPath/mm9/multiz30way/
). For global (whole transcript) conservation, we estimated ω
for each protein coding, lincRNA and antisense transcript exon and compared it to similarly sized regions within introns. To identify local regions of conservation within a transcript, we computed ω for all 12-mers within the transcript sequence, and assigned a p
-value for each 12-mer based on the chi-square distribution, as previously described19
. We then took all 12-mers showing significance at p
< 0.05, collapsed overlapping 12-mers, and identified constrained regions within the transcript (e.g.
Supplementary Fig. 6
We estimated maximal supported open reading frames (ORFs) for each transcript built by scanning for start codons and computing the length (in nucleotides) until the first stop codon was reached.
To further estimate the coding potential of novel transcripts, we evaluated whether evolutionary sequence substitutions were consistent with the preservation of the reading frame of any detected peptide. In a nutshell, if a transcript encodes a protein, we expect a reduction in frame shifting indels, non synonymous changes and, in general, any substitution that affects the encoded protein. To assess this, we used Codon Substitution Frequency (CSF) method as previously described17–18
Primers were obtained for a randomly selected set of predicted lincRNA, protein coding genes, antisense transcripts, and intron primers (Supplementary Table 2
); all begining with M13 primer sequence. RNA from ESC cells was extracted using Qiagen’s RNeasy kit (74106). A one-step cDNA/RT-PCR reaction was run using Invitrogen’s one-step RT-PCR kit (12574-018), following the manufacturer’s instructions, with the following PCR protocol: 55°C for 30 minutes, 94°C for 2 minutes (94°C for 15 seconds, 64°C for 30 seconds, 68°C for 1 minute – 40 cycles) 68°C for 5 minutes, 4°C forever. Samples were separated on a 3% agarose gel, and all bands were cut out and gel extracted using the QIAquick Gel Extraction Kit 28706. 30ng of DNA were mixed with 3.2pmol M13 forward or M13 reverse primer for sequencing.