The most abundant type of DNA in the human genome consists of the four major classes of interspersed transposable elements (TEs), comprising ~45% of our total DNA [1
]. Short interspersed repeat elements (SINEs), long interspersed repeat elements (LINEs), and retrovirus-like long terminal repeat (LTR) retrotransposons propagate by reverse transcription of an RNA intermediate. DNA transposons move by a direct “cut and paste” mechanism [2
]. TEs have been active in mammalian genomes for hundreds of millions of years, and have had a huge impact on our genomic structure [3
]. Each TE has had a distinct period of transpositional activity in which it has spread through the genome, followed by inactivation and accumulation of mutations. Both SINE and LINE transpositions have been associated with insertional mutations causing human disease and pseudogene formation [1
]. TEs may actively influence the expression of nearby genes, usually due to the regulatory promoter and terminator sequences found in LTRs [5
TEs in the human and other genomes have been classified into a comprehensive database, called Repbase [6
]. A program called Repeat Masker [7
] was developed in order to identify all known repeat elements based on homology to the derived consensus sequences curated in Repbase. Repeat Masker has proven to be extremely valuable in gene identification and genome annotation, primarily by “masking” transposable elements in query sequences during homology searches so that the presence of a common transposon does not lead to many spurious, biologically uninteresting matches. Repeat Masker also provides a wealth of information regarding the classification, genome position, length, fragmentation, and divergence of each repeat element.
Each copy of a particular TE in a genome is derived from an active sequence that, once transposed, has accumulated mutations randomly and separately from other copies [3
]. Consensus sequences of the original active copies, found in Repbase [8
], have been derived from multiple sequence alignments of the present-day diverged copies. The age of these elements can be inferred from the average sequence divergence of the copies from the consensus sequence, and such classification has been applied to both Alu [9
] and L1 [11
] elements, permitting assignment of approximate ages [3
]. However, these divergence-based classifications are limited by the assumption that the mutation rate, or molecular clock, has been constant both over time and between the different classes of transposable elements [12
]. Substitution rates will depend on the original sequence of the element, especially the CpG frequency, because of its higher mutation rate. Substitution rates are known to change significantly during evolution and to differ between species, chromosomes of the same species, and even regions of the same chromosome [14
]. Furthermore, the variance in percent divergence within a TE family will be dependent on both the length and age of the element. Hence, while estimates of the age of younger TE subfamilies have been presented [9
], this has not been possible with older, more diverged elements.
Nevertheless, the apparent age of TEs is increasingly being used to obtain reference points in phylogenomic analysis [17
]. Schueler et al. relied on the relative ages of LINE elements to date different parts of the human X chromosome centromeric alpha satellite arrays [18
]. Specific insertions of MLT1A0 and L1MA9 elements were used as evidence for the sister–taxon relationship of primates and rodents [20
]. Recently, evidence has been presented that some individual TEs have been exapted for use as conserved, functional, noncoding elements in mammalian genomes, which places these particular elements under selective pressure [22
This study presents a novel genomic analysis of TE evolution and its impact on genomic organization, which will greatly facilitate the analysis of TEs for use in phylogenomics. A genome-wide defragmentation of TEs in the human and other mammalian genomes was performed, and the number of times that each TE has inserted into each other TE was compiled in a matrix. A novel computational method was developed that uses the age information implicit in the patterns of TE insertions to determine the relative chronological age of TEs in the human and other genomes spanning over 100 million years, independent of sequence divergence and the molecular clock. This method confirms the relative ages of TEs within classes, and was used to determine the relative ages of TEs between different classes and for older elements for which sequence divergence is particularly unreliable. This study also provides the methodological framework for the analysis of the patterns of interruptions of TEs by TEs on a genome-wide level, which represents a large, essentially untapped genomic dataset that is of fundamental importance regarding TE classification and organization. The data and analysis tools supplied here will provide a rich source of genomic information for data mining to further explore transposon biology and genome evolution.