|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions/at/oupjournals.org
We suggest an annotation strategy for genes encoded by retroviruses and transposable elements (RETRA genes) based on a set of marker protein domains. Usually RETRA genes are masked in vertebrate genomes prior to the application of automated gene prediction pipelines under the assumption that they provide no selective advantage to the host. Yet, we show that about 1000 genes in four vertebrate gene sets analyzed contain at least one RETRA gene marker domain. Using the conservation of genomic neighborhood (synteny), we were able to discriminate between RETRA genes with putative functionality in the vertebrates and those that probably function only in the context of mobile elements. We identified 35 such genes in human, along with their corresponding mouse and rat orthologs; which included almost all known human genes with similarity to mobile elements. The results also imply that the vast majority of the remaining RETRA genes in current gene sets are unlikely to encode vertebrate functions. To automatically annotate RETRA genes in other vertebrate genomes, we provide as a tool a set of marker protein domains and a manually refined list of domesticated or ancestral RETRA genes for rescuing genes with vertebrate functions.
Preliminary sequence analysis of the draft sequences of the human, mouse and rat genomes (1–3) suggested that less than 5–6% of the genomic sequence appears to be under selective constraint and less than 1–2% is coding for proteins, while most of the genomic sequence comprises neutrally evolving remnants of various transposable elements. Such interspersed repeats are normally assumed not to have any host-specific functionality and are therefore commonly omitted from functional analysis, e.g. by applying the RepeatMasker program (Smit & Green, http://repeatmasker.org) prior to gene prediction (4). However some repetitive elements do encode proteins, and a considerable number of genes predicted in these genomes are similar to Retroviral or Transposon-associated (RETRA) genes. Indeed fragments of transposable elements (TEs) have been found to insert into vertebrate genes, contributing to at least 4% of current coding regions (5,6). Moreover a number of reports demonstrate or propose (7–9) the domestication of genes from TEs by vertebrate genomes. Well-characterized examples include the major centromere-binding protein CENP-B, which is related to pogo-like DNA transposases (9) and telomerase, a reverse transcriptase related to non-LTR retrotransposons (10). Yet in many cases there does not seem to be any relationship between the sizes of protein families with similarity to RETRA genes and the number of well-characterized family members with known functions in the vertebrate genomes. For example as many as 307 human and 244 mouse reverse transcriptases had been predicted in the respective landmark genome sequence papers [see table 25 in (1) and table 11 in (2)] although to our knowledge only one well-characterized vertebrate member, telomerase (11), has been described so far. The inconsistent inclusion of RETRA genes into gene sets can result in misleading comparative analysis due to artificially inflated sizes of RETRA gene families. Therefore, there is a need for reliable identification and annotation of such genes, particularly if they contribute to vertebrate function.
To get an overview of the coding potential of RETRA genes we compiled a list of known characteristic protein domains. We then applied these domains to evaluate the instances of RETRA genes included into several frequently used gene prediction sets derived from four completely sequenced vertebrate genomes (human, mouse, rat and puffer fish), and developed a strategy to discriminate those with likely vertebrate function. For all the candidate RETRA genes in three mammalian genomes, we measured selective constraints to identify genes with a function in the host genome. It has been shown that 97% of human and rat orthologous genes are retained in orthologous genomic regions (3). Hence if a RETRA gene has been preserved in synteny in either rodent or human, we then assume that it is performing a vertebrate function because otherwise purifying selection would have led to the elimination of the gene. This is a much more rigorous criterion than the requirement of a supporting expressed sequence tag (EST), which has been previously used to identify 34 RETRA genes (annotated in the current Ensembl gene set build 34) with putative functionality in human (1), as ESTs can also be derived from pseudogenes or other non-coding regions (12,13). If a RETRA gene is not in synteny, it may either have recently acquired a vertebrate function or, much more likely, it functions only in the context of retroviral or transposon activity. Although the procedure to identify RETRA genes with vertebrate functionality outlined above can be applied in principle automatically, it depends on derived data (e.g. gene predictions) and there are inherent limitations in the methods used (e.g. use of best-reciprocal hits for orthology detection), hence we did a manual refinement of the results. Therefore, the curated data sets obtained, in combination with the marker domains, should result in a reliable automatic method for RETRA gene detection.
Transposable elements are repetitive mobile sequences that are dispersed throughout the genome. In vertebrates, the content and diversity of these elements varies considerably. In mammalian genomes, the recognizable copies of these elements are estimated to cover 40–50% of their DNA content (1–3), whereas in the more compact vertebrate genome of puffer fish (fugu) the fraction is only 2.7% of the genome (14). TEs can be classified into class I and class II depending on whether their transposition intermediate is RNA or DNA respectively. Each class can be subdivided into elements that code for genes that catalyze transposition (autonomous TEs) (Figure 1) and those that do not contain such genes (non-autonomous TEs).
Class I elements or retrotransposons replicate through a reverse transcription mechanism and the most common elements of this class are the non-LTR retrotransposon short (SINE) and long (LINE) interspersed nuclear elements, LTR-retrotransposons and endogenous retroviruses. While SINEs have no open reading frames (ORFs) and are therefore always non-autonomous, all other class I elements encode a number of proteins. When retroviruses occasionally insert into the genome of a germ line cell they can become endogenous (Figure 1) and for this reason we also considered retroviruses in this study.
Class II elements or DNA transposons excise and reinsert as DNA. The autonomous DNA transposons usually contain only a single gene encoding a transposase. Vertebrate genomes contain only a few copies of autonomous full-length TEs together with numerous fragmented copies. Taken together, both TEs and Retroviruses have coding potential and we thus derive marker domains of the characteristic ORFs from these elements.
The analysis was based on publicly available gene prediction sets and genomic sequences as of October 9, 2003 (Supplementary Table 2). Together with the final sets of genes provided by NCBI (http://www.ncbi.nlm.nih.gov) and Ensembl (15) for human, rat and mouse genomes, and by JGI (http://www.jgi.doe.gov) for the fugu genome, we also used gene prediction sets directly produced by automatic gene calling methods, such as GeneScan (16) used in Ensembl and JGI pipelines and Gnomon (http://www.ncbi.nlm.nih.gov/genome/guide/gnomon.html) at NCBI. A comparative gene prediction method, Twinscan (17), was also included.
We decided to use a set of protein domains characteristic of genes encoded by transposable elements and retroviruses as a discriminator of such genes. Since protein coding signal is more conserved through evolution, this approach is more sensitive than DNA-based analysis. Protein domain signatures associated with retrotransposons, DNA transposons and retroviruses were identified on the basis of literature survey, InterPro domain annotation and annotation of proteins in SWISS-PROT and TrEMBL databases as follows: (i) we selected all InterPro protein signatures annotated with any of the following keywords (substrings): ‘transposable’, ‘transposase’, ‘transposon’, ‘retroelement’, ‘retroid’, ‘retrotransposon’, ‘retroviral’, ‘retrovirus’, and (ii) we considered all HMM-based InterPro domains that are over-represented (100-fold) in vertebrate proteins annotated in SWISS-PROT and TrEMBL with a keyword (substring) ‘transposa’ in the description or keyword lines with respect to the rest of the vertebrate proteins, or that are over-represented (10 fold) in retroviral proteins with respect to the rest of proteins in the database. The ratios were based on the fraction of the total number of retroviral proteins in the databases. A manual inspection refined this list to a total of 85 RETRA Pfam HMM profiles that we consider as being RETRA gene specific. As an example, this manual refinement excluded from the list two profiles of CCHC Zn-finger (IPR001878) and Endonuclease (IPR005135) domains that are also found in variety of non-RETRA proteins. The list contains a number of profiles characteristic to RETRA genes even though no endogenous genes with the domains have yet been detected.
We used only corresponding signatures from the Pfam database to simplify the surveying procedure. Although not all protein domains are characterized and as Pfam does not have complete coverage (providing profiles for only about 75% of known proteins (18)) the use of HMM profiles gives significant advantages in terms of both sensitivity and specificity. For searching the gene sets using HMMER2, both ls (global) and fs (local) modes were considered yielding practically the same results and the fs mode was selected for further use as it is more tolerant to gene prediction errors (such as gene truncation). The results were filtered using family specific ‘gathering’ cut-offs specified in the HMM model descriptions (18). The selected HMMs (Supplementary Table 1) were retrieved from the Pfam (v.10) database and were scanned against the predicted proteomes (Supplementary Table 2). Parsed results were loaded and analyzed in a PostgreSQL (http://www.postgresql.org) database.
Since scanning HMM profiles directly against genomic sequences is extremely CPU intensive, in Supplementary Table 3 we report the number of matches found by TBlastN (24) for sample protein fragments, extracted from the corresponding PfamA seed alignments, in non-masked vertebrate genomic sequences with E-value less than 0.001.
To identify RETRA genes that probably encode a host-specific function in mammals, we checked all human genes with the characteristic RETRA domains for conservation of their genomic neighborhood (synteny) in the two rodent genomes. The synteny maps were derived using all genes as orthologous markers as outlined below. Although DNA-level comparison can provide additional details we do not expect many false negatives as it has been estimated that 97% of human and rat orthologous genes are retained in synteny (3). First, we determined putative orthologous genes requiring them to be best reciprocal hits in an inter-species BlastP analysis without low-complexity filtering and using the default E-value cut-off. The synteny of the best reciprocal hits was identified requiring at least two putative orthologous pairs to be nearby on genome but allowing for up to four intervening genes as described before (19) using SyntQL tool (Zdobnov, unpublished). We checked synteny manually for all human genes with RETRA domains for which orthologous genes in mouse or rat were not found automatically. Intrinsic limitations of this approach are discussed in detail in Results and Discussions. In addition, we inspected EST support for the human genes with RETRA domains: all ESTs from dbEST (20) (as of February 2004) were aligned against the human genome (build 34) using stand-alone BLAT (2) and we consider only EST alignments in the genome with a percentage identity greater than 96% and the alignment length greater than 100 bases. If the difference in score between the best hit and second-best hit was less than 10 in a BLAT-like scoring scheme, we considered such an EST alignment as ambiguous.
The list of selected 85 RETRA characteristic Pfam HMM models, the models itself from Pfam version 10 and the list of true vertebrate genes with similarity to RETRA genes are available from: http://www.bork.embl-heidelberg.de/Docu/RETRA/.
To identify characteristic protein signatures that could be used as RETRA gene markers in vertebrate genomes, we surveyed known characteristic domains of RETRA genes as described in the literature (Figure 1) and extended the list by including domains that are clearly annotated as being RETRA in the InterPro database (21) or those that are over-represented in annotated RETRA genes in protein databases (SWISS-PROT and TrEMBL (22); see data flow in Figure 2). As a result we collected a manually curated set of 85 HMM profiles (23) for the domains that can be considered as markers for RETRA genes (Supplementary Table 1, see also Table 3).
In order to evaluate inconsistencies in RETRA gene inclusion into current vertebrate gene sets, we scanned the 85 marker domains against several popular gene sets (Table 1). The numbers are considerably lower than those in previous releases due to ongoing annotation efforts. For example, we find only 251 candidate RETRA genes in the Ensembl gene set based on human assembly build 33 compared with more than 1000 in the early releases. Despite these considerable improvements there are still as many as 54 predicted genes containing the reverse transcriptase domain [including the well-characterized telomerase and a recently identified LTR retrotransposon element conserved in synteny in human and rodents species (25)] and 127 L1 transposases in this gene set. Given the background of thousands of human reverse transcriptases and L1 transposases in the non-masked human genomic sequence (Supplementary Table 3), gene prediction pipelines already filter out many of the unwanted RETRA genes. Yet Table 1 also indicates that a considerable number of such genes still exist in current annotation schemes, despite manual curation efforts.
Although the overall number of RETRA genes in vertebrate gene sets appears similar in five popular gene prediction protocols, a more detailed breakdown of the results in different species reveals that different gene prediction pipelines give considerably different results among each other and among different species (Table 2). It indicates that there is still a need for consistent annotation of RETRA genes in gene prediction pipelines. The automatic detection of RETRA genes is complicated by the fact that some RETRA genes encode functionality for the host genome, i.e. are true vertebrate genes. We have thus screened such genes among the RETRA genes recognized by marker domain analysis.
Under the assumption that synteny between species as divergent as human and rodents should be a sufficient denominator for host-specific functionality, we analyzed all identified RETRA genes in mammals for this feature. In brief, our synteny analysis requires the conservation of local genomic neighborhood of putative orthologs (19), operationally defined by best reciprocal hits in an inter-species BlastP analysis. Since the method relies on existing genome annotation, the analysis was complemented by manual inspection of all human genes with RETRA domains for which no putative mouse or rat orthologs were found in synteny automatically. As orthology and synteny identification methods are sensitive to genome sequence completeness and quality, the fugu genome was not included in this analysis.
The majority of the RETRA domain families contain none or one gene in synteny, the latter being often a known human gene with similarity to RETRA genes (see Table 3 and below). Other families of RETRA genes contain a few genes in synteny, the majority of which have not been noted before. Examples are genes with HAT dimerization (InterPro family identifier: IPR008906), Integrase (IPR001584) and BED finger (IPR003656) domains (Table 3). Surprisingly, in a small number of families (almost) all genes were found in synteny between human and rodents, suggesting that the members of these families perform host-specific functions in vertebrates.
For example, an entire family of gag-like proteins (IPR005162) (at least four paralogs in mammals) appears to play a role in mammalian biology. One member of this family has been detected as an antigen in patients with testicular cancer (26) and another, PEG10, is probably a regulator of transcription (27). We can only speculate about the mammalian functionality of the two other members of this gag-like family, but the recent discovery of the cellular interaction partner of the homologous viral gag protein of the Moloney murine leukemia virus, endophilin 2 (28), might give a first hint.
Another family for which host-specific functionality in mammals was discovered using our procedure contains homologs of the centromeric protein CENP-B (IPR004875). Out of 13 CENP-B family members that were detected (not counting YCE7_HUMAN gene, see Table 3), 11 were found to be in synteny in mammals including two experimentally proven human genes, CENP-B itself and jerky (29,30). The mammalian function of the remaining nine genes is unknown, although all of them with one exception have already been either predicted based on EST support (1) or based on orthology in rodents (31). CENP-B binds to alpha-repetitive sequences at the centromere of autosomes and the X chromosome (29), therefore the presence of homologs with slightly different binding specificities might explain the inability of CENP-B to bind Y chromosome centromeric regions.
In total, 35 human genes with similarity to RETRA genes were found in synteny; of which only 27 were recognized automatically as best-reciprocal hits in synteny with rodents, and the remainder through manual analysis of all other candidates detected by the marker domains.
To get an overview of how many known human genes with similarity to RETRA our procedure identified, we surveyed the literature for the respective reports. We retrieved 21 genes with a proven human function (Table 3), of which 18 were recovered by our procedure as likely having a host function (in 3 of these 18 cases synteny was detected only by manual refinement). Of the three genes that we did not detect using synteny, Syncytin1 and Syncytin2, were previously reported as primate-specific acquisitions (32,33) and the third, human transcription factor ZBED1 with HAT and BED finger domains, was described before as a homolog of the Drosophila DREF transcription regulator (34).
In addition to the recovery of 18 out of 21 known mammalian RETRA genes, our procedure led to the identification of an additional 17 expressed human genes with similarity to RETRA genes (Table 3); they all have retained their genomic neighborhood in rodents and human, and therefore being under selective constraint they are likely to have a specific function in mammals. We combined these two sets and recorded the respective vertebrate orthologs to derive a list of vertebrate genes with similarity to RETRA (‘rescue list’). Despite additional manual effort to compile this list, it can now be used automatically in conjunction with the set of marker domains for the annotation of RETRA genes in forthcoming vertebrate genomes. This concept can also be extended to other metazoan genomes.
Although most RETRA genes are commonly filtered out prior to the application of automated gene prediction pipelines (4) as elements without any selective advantage for the host, we find that current gene prediction pipelines still include a considerable number of RETRA genes (Table 1), and detect inconsistencies not only between methods applied to the same genome but also between applications of each method to different genomes. These discrepancies cannot be explained merely by the different time points at which the analyses were done or by the amount of manual work invested for a particular genome. They are most likely due to the absence of an established criterion for the annotation of RETRA genes, which can easily lead to erroneous conclusions in comparative analyses. We therefore propose that, as is attempted by the repeat masking process, RETRA elements should be specifically identified and excluded from gene sets unless evidence for host-specific functionality is found.
For this reason we have developed a procedure for the identification of RETRA genes and suggest a criterion based on genomic neighborhood conservation to distinguish between those with a host-specific function and those that function only in the context of mobile elements. By applying this methodology we have identified the vast majority of described RETRA genes with functionality in mammals (e.g. Telomerase, CENP-B, RNAse H etc.) and have revealed additional functional genes, all with EST or mRNA support. In contrast, only 15% of the non-syntenic RETRA genes are supported by uniquely mapped ESTs or mRNAs (Supplementary Table 4), although this is likely to be a significant underestimate due to high level of sequence similarity within families of repetitive elements. The sensitivity of the method could in principle be further increased as the orthology identification is far from being perfect and the domain detection also has its limits. For instance, the human transcription factor ZBED1 (34), whose homolog can be found in Drosophila but not in rodents. Yet synteny will not be able to identify all RETRA genes with functionality in the host genomes as it is possible that true orthologous genes have been translocated, lost or even acquired and domesticated, since the divergence of rodents and human approximately 75 MYA ago (2). This seems to be the case for Syncytin1 and Syncytin2 in primates (32,33). Apparently both proteins have been acquired from human endogenous retroviruses (HERV-W and HERV-FRD, respectively) envelope proteins and now potentially have a role in placenta formation (32,33,35). Although this is a clear example of the domestication of a RETRA gene, analysis of the phylogenetic trees of other RETRA gene families does not give a clear picture of the origin of these genes i.e. in most cases we were not able to distinguish whether they have been domesticated by the vertebrates or have been picked up by the mobile elements (data not shown). As the method relies in part on existing genome annotation (e.g. gene sets in different vertebrates), the analysis might have to be extended to the whole genome to allow a more comprehensive overview of RETRA genes.
These limitations should not hamper the detection of both RETRA genes and the subset with vertebrate functionality as we not only provide the set of marker domains but also a list of known or identified RETRA genes with functionality in human, mouse and rat. Thus, even without synteny identification, these true mammalian genes together with their orthologs in fugu can be used to rescue genes with similarity to RETRA genes in other vertebrates. When applied to the four vertebrates, over 1000 RETRA genes can be flagged (Table 1), the vast majority of which should probably not be included in the gene sets. In human, 33–251 RETRA genes are found depending on the gene prediction method used (Tables 1 and and2),2), despite manual curation efforts. We have identified 35 true human genes (none of the methods predicted the complete set of genes in mammals). The number of RETRA genes increases significantly in more automatically annotated organisms (e.g. in the mouse gene sets 116-807 RETRA proteins are found). For other vertebrate genomes to be sequenced, a reproducible automatic pipeline should be applied to separate RETRA genes from host genes and we have taken the first step in this direction. In summary, the proposed use of HMM models of characteristic RETRA protein domains for identification of RETRA genes has greater sensitivity than DNA similarity-based methods. The concept of using conserved synteny in species as divergent as human and rodents to identify RETRA genes with host-specific functionality has been proven to be able to detect the vast majority of such genes, despite the limitations discussed above. The fact that the majority of these genes are also present in fugu genome indicates that the compiled list of such genes is applicable with high confidence to other more distant vertebrate genomes. The automatic procedure proposed here consists of two steps: (i) The identification of RETRA genes using the set of RETRA marker domains we have derived here (this can be done using standard HMM searches (Eddy, http://hmmer.wustl.edu/) and (ii) Detection of genes with vertebrate functions in those sets using the ‘rescue list’ derived here by manually refined synteny analysis.
Supplementary Material is available at NAR Online.
The authors are grateful to all members of the P. Bork group and T. Gibson for the useful discussions. M.C. is a recipient of a FEBS long-term Fellowship. E.D.H. is a recipient of an E-STAR fellowship funded by the EC FP6 Marie Curie Host Fellowship for Early Stage Research Training under contract number MEST-CT-2004-504640. Funding to pay the Open Access publication charges for this article was provided by EMBL.