In the ten years since the publication of draft human genomes (Lander et al., 2001
; Venter et al., 2001
), extraordinary advances in DNA sequencing technology (Bentley et al., 2008
) have made it possible to obtain comprehensive genomic information rapidly and at low cost. Decoding the information contained in these genomes represents a central challenge for the biological community. Protein-coding regions have been defined according to simple rules about the nature of translation--for example, that open reading frames (ORFs) have a minimum length, biased codon usage and start at the first AUG in a transcript (Brent, 2005
). Yet there are many exceptions to these rules, including internal ribosome entry sites, initiation at non-AUG codons, leaky scanning, translational reinitiation and translational frame shifts (Atkins and Gesteland, 2010
). Additionally, an abundant class of large intergenic non-coding RNAs (lincRNAs) that do not contain canonical ORFs has been recently been described (Guttman et al., 2009
; Guttman et al., 2010
). Many of these newly identified transcripts are likely to be functional RNAs, but there are well-documented cases of biologically important short coding regions. For example, the Drosophila tarsal-less/polished rice
gene, was originally thought to be a lincRNA (Tupy et al., 2005
) but actually encodes a series of short peptides that modulate the activity of the shavenbaby transcription factor (Kondo et al., 2010
). The question of which of the potential lincRNAs are actually translated remains largely unaddressed.
We also know that the rate of translation is not constant across a message and translation pauses can regulate synthesis (Darnell et al., 2011
; Morris and Geballe, 2000
), folding (Kimchi-Sarfaty et al., 2007
; Zhang et al., 2009
), and localization of a protein (Mariappan et al., 2010
) or mRNA (Yanagitani et al., 2011
). These pauses can results from codon usage (Irwin et al., 1995
), mRNA structure (Namy et al., 2006
), or peptide sequence (Nakatogawa and Ito, 2002
; Tenson and Ehrenberg, 2002
), but little information exists on how generally they occur, let alone their functional impact.
Recently, we described a strategy, termed ribosome profiling, based on deep-sequencing of ribosome-protected mRNA fragments, that makes it possible to monitor translation with a depth, speed and accuracy that rivals existing approaches for following mRNA levels (Guo et al., 2010
; Ingolia et al., 2009
). By revealing the precise location of ribosomes on each mRNA, ribosome profiling also has the potential to identify protein-coding regions. However, initiation from multiple sites within a single transcript makes it challenging to define all open reading frames, especially in complex transcriptomes. Additionally, ribosome profiling provides a snapshot of ribosome positions but does not report directly on the kinetics of translational elongation or distinguish stalled ribosomes from those engaged in active elongation.
Here we describe a simplified, robust protocol for ribosome profiling in mammalian systems. We have used this technique to determine the kinetics of translation by following run-off elongation after stalling new initiation using the drug harringtonine (Fresno et al., 1977
; Huang, 1975
; Robert et al., 2009
; Tscherne and Pestka, 1975
). We further employ harringtonine, which causes ribosomes to accumulate precisely at initiation codons, together with a machine learning algorithm, to define the sites of translation initiation genome-wide. Application of our approach to mouse embryonic stem cells reveals a wide range of novel or modified ORFs, including highly translated short ORFs in the majority of annotated lincRNAs. We now classify these atypical protein-coding transcripts as short, polycistronic ribosome-associated RNAs (sprcRNAs). Additionally, we identify over a thousand strong translational pauses that could act as key regulatory sites. Our approach is readily applicable to other cells and organisms and as such provides a general scheme for decoding complex genomes, monitoring rates of proteins production and exploring the molecular mechanisms used to regulate translation.