Preliminary sequence analysis of the draft sequences of the human, mouse and rat genomes (
1–
3) suggested that less than 5–6% of the genomic sequence appears to be under selective constraint and less than 1–2% is coding for proteins, while most of the genomic sequence comprises neutrally evolving remnants of various transposable elements. Such interspersed repeats are normally assumed not to have any host-specific functionality and are therefore commonly omitted from functional analysis, e.g. by applying the RepeatMasker program (Smit & Green,
http://repeatmasker.org) prior to gene prediction (
4). However some repetitive elements do encode proteins, and a considerable number of genes predicted in these genomes are similar to
Retroviral or
Transposon-
associated (RETRA) genes. Indeed fragments of transposable elements (TEs) have been found to insert into vertebrate genes, contributing to at least 4% of current coding regions (
5,
6). Moreover a number of reports demonstrate or propose (
7–
9) the domestication of genes from TEs by vertebrate genomes. Well-characterized examples include the major centromere-binding protein CENP-B, which is related to pogo-like DNA transposases (
9) and telomerase, a reverse transcriptase related to non-LTR retrotransposons (
10). Yet in many cases there does not seem to be any relationship between the sizes of protein families with similarity to RETRA genes and the number of well-characterized family members with known functions in the vertebrate genomes. For example as many as 307 human and 244 mouse reverse transcriptases had been predicted in the respective landmark genome sequence papers [see table 25 in (
1) and table 11 in (
2)] although to our knowledge only one well-characterized vertebrate member, telomerase (
11), has been described so far. The inconsistent inclusion of RETRA genes into gene sets can result in misleading comparative analysis due to artificially inflated sizes of RETRA gene families. Therefore, there is a need for reliable identification and annotation of such genes, particularly if they contribute to vertebrate function.
To get an overview of the coding potential of RETRA genes we compiled a list of known characteristic protein domains. We then applied these domains to evaluate the instances of RETRA genes included into several frequently used gene prediction sets derived from four completely sequenced vertebrate genomes (human, mouse, rat and puffer fish), and developed a strategy to discriminate those with likely vertebrate function. For all the candidate RETRA genes in three mammalian genomes, we measured selective constraints to identify genes with a function in the host genome. It has been shown that 97% of human and rat orthologous genes are retained in orthologous genomic regions (
3). Hence if a RETRA gene has been preserved in synteny in either rodent or human, we then assume that it is performing a vertebrate function because otherwise purifying selection would have led to the elimination of the gene. This is a much more rigorous criterion than the requirement of a supporting expressed sequence tag (EST), which has been previously used to identify 34 RETRA genes (annotated in the current Ensembl gene set build 34) with putative functionality in human (
1), as ESTs can also be derived from pseudogenes or other non-coding regions (
12,
13). If a RETRA gene is not in synteny, it may either have recently acquired a vertebrate function or, much more likely, it functions only in the context of retroviral or transposon activity. Although the procedure to identify RETRA genes with vertebrate functionality outlined above can be applied in principle automatically, it depends on derived data (e.g. gene predictions) and there are inherent limitations in the methods used (e.g. use of best-reciprocal hits for orthology detection), hence we did a manual refinement of the results. Therefore, the curated data sets obtained, in combination with the marker domains, should result in a reliable automatic method for RETRA gene detection.
TEs and retroviruses with coding potential
Transposable elements are repetitive mobile sequences that are dispersed throughout the genome. In vertebrates, the content and diversity of these elements varies considerably. In mammalian genomes, the recognizable copies of these elements are estimated to cover 40–50% of their DNA content (
1–
3), whereas in the more compact vertebrate genome of puffer fish (fugu) the fraction is only 2.7% of the genome (
14). TEs can be classified into class I and class II depending on whether their transposition intermediate is RNA or DNA respectively. Each class can be subdivided into elements that code for genes that catalyze transposition (autonomous TEs) () and those that do not contain such genes (non-autonomous TEs).
Class I elements or retrotransposons replicate through a reverse transcription mechanism and the most common elements of this class are the non-LTR retrotransposon short (SINE) and long (LINE) interspersed nuclear elements, LTR-retrotransposons and endogenous retroviruses. While SINEs have no open reading frames (ORFs) and are therefore always non-autonomous, all other class I elements encode a number of proteins. When retroviruses occasionally insert into the genome of a germ line cell they can become endogenous () and for this reason we also considered retroviruses in this study.
Class II elements or DNA transposons excise and reinsert as DNA. The autonomous DNA transposons usually contain only a single gene encoding a transposase. Vertebrate genomes contain only a few copies of autonomous full-length TEs together with numerous fragmented copies. Taken together, both TEs and Retroviruses have coding potential and we thus derive marker domains of the characteristic ORFs from these elements.