Large surveys of transcriptomes, such as ENCODE (ENCODE Project Consortium et al.,
2007) and FANTOM (Maeda et al.,
2006), demonstrated that eukaryotic genomes are pervasively transcribed (Jacquier,
2009). Long, mRNA-like, non-protein-coding transcripts (mlncRNAs) are an important component of this transcriptional output, often arising from regions unlinked to annotated protein-coding genes (Khalil et al.,
2009). Apart from a few exceptions, the detailed function of these transcripts, however, still remains in the dark. The cases that are reasonably well understood, on the other hand, implicate mlncRNAs as key molecules orchestrating essential cellular processes, including gene expression, transcriptional, and post-transcriptional regulation, chromatin-remodeling, differentiation and development (Mercer et al.,
2009).
As a group, mlncRNAs show evidence of stabilizing selection (Ponjavic et al.,
2007; Marques and Ponting,
2009). Although the evidence for wide-spread evolutionary constraints on the sequence evolution of ncRNAs is the most direct evidence that at least a large fraction of them is in fact functional, we know very little about the evolutionary history of individual transcripts. In contrast to protein-coding genes or short structured ncRNAs, for which comprehensive evolutionary information is available in databases like Pfam (Finn et al.,
2010) or Rfam (Gardner et al.,
2011), there is no comparable resource for long ncRNAs. The lncRNA database (Amaral et al.,
2011) is a first pioneering step in this direction, predominately compiling non-coding transcripts from the model organisms human and mouse.
To-date, only a few detailed case studies are available. Chodroff et al. (
2010 recently considered the conservation of a few brain-specific mlncRNAs, reporting weak sequence conservation and major changes in gene structure across amniotes. Even more detailed descriptions of mlncRNA evolution zooming in on the sequences are available only for a few “famous” transcripts. Xist, an eutherian-specific regulatory long ncRNA that plays a central role in inactivation of one female X chromosome by recruiting chromatin-remodeling complexes, reviewed, e.g., by Arthold et al. (
2011), is the only long ncRNAs whose evolutionary origin is understood in detail. It arose after the divergence of marsupials and placental mammals from the protein-coding Lnx3 gene upon incorporation of additional, repeat-derived exons (Duret et al.,
2006; Elisaphenko et al.,
2008; Kolesnikov and Elisafenko,
2010). Xist, along with Kcnq1ot1 (Kanduri,
2011), HOTAIR (Tsai et al.,
2010), or HOTTIP (Wang et al.,
2011) belongs to a class of chromatin regulatory mlncRNAs. The evolutionary features of HOTAIR were recently studied in some detail by (He et al.,
2011). MALAT-1 and its apparent relative MENε/β, on the other hand, are nuclear-retained ncRNAs that are mostly unspliced (Hutchinson et al.,
2007), undergo a highly unusual processing of their 3′-ends (Wilusz and Spector,
2010), and function as organizers of nuclear speckle structures (Sasaki et al.,
2009). MALAT-1, which exhibits an atypically high level of sequence conservation, dates back at least to the radiation of the gnathostomes (Stadler,
2010).
Besides long intergenic RNAs (lincRNAs), vertebrate genomes also harbor tens of thousands of totally and partially intronic transcripts (TINs and PINs; Nakaya et al.,
2007; Louro et al.,
2008,
2009). A fraction of these comprises unspliced long antisense intronic RNAs (Rinn et al.,
2003; Reis et al.,
2004) and other predominately unspliced transcripts (Engelhardt and Stadler,
2011), while another subgroup consists of spliced RNAs. These could potentially be very similar to lincRNAs. In this contribution, we explore in detail the evolution of one particular example of the latter class, the eosinophil granule ontogeny transcript (EGOT).
The eosinophil granule ontogeny transcript is a transcriptional regulator of granule protein expression during eosinophil development (Wagner et al.,
2007). Using sucrose density gradients Wagner et al. (
2007) demonstrated that EGOT is not associated with ribosomes and thus most likely functions as
bona fide non-coding RNA. The same authors proposed that EGOT may act as an siRNA against the eosinophil granule major basic protein (MBP) and eosinophil-derived neurotoxin (EDN). We choose EGOT as an example for a spliced antisense TIN as it is probably the experimentally best-characterized ncRNAs of this type. It is located in an intron of the ITPR1 gene, which codes for the type 1 inositol 1,4,5-triphosphate receptor mediating calcium release from the endoplasmic reticulum upon stimulation by inositol.
Human EGOT has two known isoforms that share the same transcriptional start site. EGO-B consists of two closely spaced exons. Its primary transcript covers about 2.4

kb, of which about 1.4

kb are exonic. In contrast, EGO-A remains unspliced, reaching about 190

nt into the intron. Both transcripts are polyadenylated (Wagner et al.,
2007). Overall, EGOT is quite poorly conserved at sequence level. The intron, however, contains a sequence element that was already recognized by Wagner et al. (
2007) to be conserved between human and chicken.
Here, we report on an in-depths computational analysis of EGOT, focusing in particular on the spliced and polyadenylated EGO-B transcript, which because of these properties is classified as a mlncRNA.