|Home | About | Journals | Submit | Contact Us | Français|
L1Base is a dedicated database containing putatively active LINE-1 (L1) insertions residing in human and rodent genomes that are as follows: (i) intact in the two open reading frames (ORFs), full-length L1s (FLI-L1s) and (ii) intact ORF2 but disrupted ORF1 (ORF2-L1s). In addition, due to their regulatory potential, the full-length (>6000 bp) non-intact L1s (FLnI-L1s) were also included in the database. Application of a novel annotation methodology, L1Xplorer, allowed in-depth annotation of functional sequence features important for L1 activity, such as transcription factor binding sites and amino acid residues. The L1Base is available online at http://l1base.molgen.mpg.de. In addition, the data stored in the database can be accessed from the Ensembl web browser via a DAS service (http://l1das.molgen.mpg.de:8080/das).
Long interspersed elements (LINE-1, L1s) are the only active autonomous retrotransposons (1) in mammals, covering as much as 18% of their genomes. L1s' activity results in a great repertoire of actions, such as gene disruption (2), transcriptional regulation (3), alternative splicing (4), creation of exons and gene coding regions (5) and amplification of the processed pseudogenes and the Alu SINE family (6–8).
The full-length mammalian L1 is ~6000 bp long and is composed of the 5′-untranslated region (5′-UTR) bearing an internal promoter, two open reading frames (ORF1 and ORF2) separated by intergenic region and 3′-UTR containing a poly(A) tail (1). The ORF1 is a non-sequence-specific RNA binding protein (1) and the ORF2 harbors three domains involved in L1 retrotransposition activity: endonuclease, reverse transcriptase and a 3′ terminal zinc finger-like domain (1).
The great majority of L1 insertions are truncated in the 5′ regions and/or contains various insertions/deletions. Still, it is the full length, intact in the two ORFs, 5′-UTR-located internal promoter and 3′-UTR regions LINE-1s (FLI-L1s), which are the most likely to display the retrotransposition activity. Interestingly, it has been recently proposed that L1 insertions containing a disrupted ORF1 gene but an intact ORF2 (ORF2-L1s) may be competent for the mobilization of Alu sequences (7).
Another class of L1s contributing to the genomic content and functionality are the retrotransposition-inactive, full length, non-intact due to multiple mutations LINE-1s (FLnI-L1s). A population of those may have retained an ability to be expressed and, although at a low frequency (9), could be retrotransposed by the proteins encoded by retrotransposition-active FLI-L1s. One of their regulatory potentials is embraced within the 5′-UTR located antisense promoter, which, when intact, may be capable of guiding the expression of many genes (10).
It, therefore, becomes an important task to identify and functionally characterize the FLI-L1s, ORF2-L1s and FLnI-L1s residing in mammalian genomes. This task can only be accomplished, when detailed analyses of conservation of the sites known to be important for L1 activity can be executed on the genomic scale. With this motivation we built the L1Base.
To identify and functionally annotate the two types of putatively active L1 insertions residing in mammalian genomes, FLI-L1s and ORF2-L1s, we created and utilized the novel annotation methodology, L1Xplorer. Briefly, the L1Xplorer is a suite of perl scripts, which are designed to detect L1 insertions either by performing genomic BLAST searches (11) with the L1 template sequence as a query (i.e. Homo sapiens L1.2, gi: M80343) or analyzing the Repeatmasker annotation [provided by Ensembl (12)]. During a series of tests, we established that the sensitivity of BLAST searches (BLAST parameters: -p blastall –f F, E-value threshold of E−10) is sufficient, when mining for L1s harboring the intact ORFs. After extraction of genomic region corresponding to L1 insertion, L1Xplorer detects the two LINE-1 ORFs, checks on their intactness and recognizes a number of experimentally characterized features important for activity on the LINE-1 sequence (such as the transcription factor binding sites) using HMM-profiles (HMMER versions 1.8.4 and 2.3.2) (13), TFASTX program of the FASTA suite (14) and ClustalW (15) alignments. In addition, it carries out family classifications based on diagnostic residues, located in the 5′- and 3′-UTR. Supplementary Table 1 lists the recognized sites including specific classifications for human, mouse and rat L1 elements.
The database contains sequences along with annotations produced by the L1Xplorer for the three classes of L1 elements residing in the human, mouse and rat genomes: (i) identified by the L1Xplorer putatively active FLI-L1s, (ii) ORF2-L1s and (iii) identified by applying RepeatMasker full length (>6000 bp), and classified by L1Xplorer as non-intact, retrotransposition-inactive, L1s (FL-nIL1s).
The functional annotation of the LINE-1 loci produced by the L1Xplorer was further complemented by SNPs (dbSNP) (16), repeat [RepeatMasker (17)] and coding genes annotation, as available in the recent version of Ensembl (12) (H.sapiens Ensembl v23.34e; Mus musculus, Ensembl v24.33; and Rattus norvegicus Ensembl v23.3c). The Table Table11 contains a summary of the current L1Base content.
L1Base can be searched via the MySQL-driven query system by using criteria, such as conservation of the functional sites important for activity (for details see the Supplementary Table 1), chromosomal localization and families.
A user can take advantage of MySQL regular expressions and Boolean AND/OR operators to compose complex queries. The database can also be searched by executing Blastn-based (11) queries with a user-specified L1 sequence. A detailed display mode (DDM), activated each time a user points to any particular result of a query, allows for an easy identification of all annotated features on LINE-1 sequence via a graphical interface utilizing color-coding schemes.
The results of queries can be exported in the comma separate value (CSV), Fasta and GeneBank formats. While in the DDM, the database entry can be exported to fasta and tinyseq-xml formats.
Each entry of the database is html cross-linked to the Ensembl genome web browser (12).
The L1Base is freely available through http://l1base.molgen.mpg.de. In addition, the annotation data stored in the L1Base can be accessed from the Ensembl genome web browser via a DAS service (l1das.molgen.mpg.de:8080/das).
Mouse LINE-1s are characterized by the presence of a variable number of 200 bp long repeats called monomers in the 5′-UTR region. It has been shown that monomers possess promoter activity and that increasing their number increases the level of transcription (18,19). Therefore, using the number of the monomers as a criteria, we searched the L1Base for putatively highly expressed full-length, intact L1s (FLI-L1s) belonging to the young G(F) subfamily (20) (query executed using regular expression: ‘G.F-monomer*’, with the ‘Search Monomers’ option selected). As a result, L1s with the following L1Base IDs (FLI-L1 Database): 692, 636 and 113 were identified as the top three hits, with the ID 692 having as many as 13 monomers.
LINE elements are autonomously active and cover a substantial fraction (up to 18%) of mammalian genomes. Given their extent, a precise and full annotation of their specific features and subclasses is helpful for a better understanding of mammalian genomes, their evolution and encoded activities (e.g. promoter activity in the case study). We plan on extending our annotation to other mammalian genomes, exploiting the ongoing progress of sequencing projects (by including i.e. chimpanzee and cat genomes). Meanwhile, the database will be improved with respect to the total number of features annotated on rodent L1 sequences. Finally, we aim for automatic updates of the L1Base.
Supplementary Material is available at NAR Online.
T.Z. thanks Prof. Martin Vingron for helpful discussions and support. This work has been a part of BioSapiens project funded by the European Commission within its FP6 Programme, under the thematic area ‘Life sciences, genomics and biotechnology for health’, contract number LHSG-CT-2003-503265.