|Home | About | Journals | Submit | Contact Us | Français|
The mammalian centromeric protein CENP-B shares significant sequence similarity with three proteins in fission yeast (Abp1, Cbh1 and Cbh2) that also bind centromeres and have essential function for chromosome segregation and centromeric heterochromatin formation. Each of these proteins displays extensive sequence similarity with pogo-like transposases, which have been previously identified in the genomes of various insects and vertebrates, in the protozoan Entamoeba and in plants. Based on this distribution, it has been proposed that the mammalian and fission yeast centromeric proteins are derived from ‘domesticated’ pogo-like transposons. Here we took advantage of the vast amount of sequence information that has become recently available for a wide range of fungal and animal species to investigate the origin of the mammalian CENP-B and yeast CENP-B-like genes. A highly conserved ortholog of CENP-B was detected in 31 species of mammals, including opossum and platypus, but was absent from all non-mammalian species represented in the databases. Similarly, no ortholog of the fission yeast centromeric proteins was identified in any of the various fungal genomes currently available. In contrast, we discovered a plethora of novel pogo-like transposons in diverse invertebrates and vertebrates and in several filamentous fungi. Phylogenetic analysis revealed that the mammalian and fission yeast CENP-B proteins fall into two distinct monophyletic clades, each of which includes a different set of pogo-like transposons. These results are most parsimoniously explained by independent domestication events of pogo-like transposases into centromeric proteins in the mammalian and fission yeast lineages, a case of ‘convergent domestication’. These findings highlight the propensity of transposases to give rise to new host proteins and the potential of transposons as sources of genetic innovation.
The origin of new genes is key to our understanding of how genomes evolve and how new biological functions emerge. The most intensively studied mechanism underlying the evolution of new genes involves the duplication or rearrangement of pre-existing genes or exons (Long et al. 2003). Gene duplication may occur at the DNA level, through segmental or whole-genome duplication, or at the RNA level through retroposition, a process by which messenger RNA is reverse-transcribed into a DNA copy that is reintegrated elsewhere in the genome. A distinct, less characterized mechanism for the emergence of new genes is the recycling of sequences and activities previously encoded by transposable elements (TEs), also known as TE ‘domestication’ or ‘exaptation’ (Brosius and Gould 1992; Miller et al. 1999; Volff 2006).
TEs are selfish mobile genetic elements, and their genes encode proteins that normally serve only their propagation. On some occasions, however, TE genes can be co-opted or ‘domesticated’ by the host to assume cellular function, a form of ‘exaptation’ (Brosius and Gould 1992). While several cases of exaptation of TE coding sequences have been documented, very few of these TE-derived proteins have been characterized functionally, and thus, for the most part, their biological functions remain unknown (Brosius 1999; Miller et al. 1999; Smit 1999; Britten 2004; Volff 2006; Feschotte and Pritham 2007). Another open question is whether TE domestication is merely an evolutionary incident, resulting from the sheer abundance and nearly ubiquitous nature of TEs in eukaryotic genomes, or whether certain TEs possess intrinsic properties that enhance exaptation and their recycling into functional components of the genome (Cordaux et al. 2006; Feschotte and Pritham 2007).
One of the earliest documented cases of TE exaptation is the gene encoding human centromere-associated protein B (CENP-B) (for review, Masumoto, Nakano, and Ohzeki 2004). CENP-B encodes a ~599-aa protein which localizes densely at the centromere of all human chromosomes, except the Y chromosome (Earnshaw et al. 1987; Earnshaw, Ratrie, and Stetten 1989; Yoda et al. 1992). The CENP-B protein binds as a homodimer specifically to a 17-bp motif called the CENP-B box located within alpha-satellite centromeric DNA (Masumoto et al. 1989; Yoda et al. 1992; Tanaka et al. 2001). CENP-B and the CENP-B box appear to be highly conserved throughout mammals. The mouse homologous protein is 92% identical to human CENP-B and is associated with centromeric satellites through binding of a DNA motif highly similar to the human CENP-B box (Sullivan and Glass 1991; Kipling et al. 1995). Sequences displaying high sequence identity to the human CENP-B have also been isolated in hamster, sheep and in several primates (Haaf et al. 1995; Burkin at al. 1996; Goldberg et al. 1996; Yoda et al. 1996; Bejarano and Valdivia 1996). More recently, a DNA motif similar in sequence to the CENP-B box was identified within the centromeric satellite repeats of the marsupial Macropus rufogriseus (Bulazel et al. 2006). A fragment containing the motif was bound in vitro by recombinant human CENP-B protein, suggesting that it represents a binding site for a yet uncharacterized marsupial CENP-B homolog (Bulazel et al. 2006). There is no convincing report of the isolation of a CENP-B homolog outside mammals, although some have reported the presence of motifs weakly similar to the CENP-B box in the satellite DNA repeats of Xenopus, insects and plants (Coelho et al. 1996; Lopez and Edstrom 1998; Weide et al. 1998; Heslop-Harrison et al. 1999; Nonomura and Kurata 1999; Lorite et al. 2004; Mravinac, Plohl, and Ugarkovic 2004; Edwards and Murray 2005). Despite the apparent selective constraint acting on CENP-B and the CENP-B box in diverse mammals, its exact function at the centromere remains unclear and even controversial, since a mouse null mutant for Cenp-b exhibits no obvious defects in chromosome segregation and only weak phenotypic abnormalities (Hudson et al. 1998; Kapoor et al. 1998; Perez-Castro et al. 1998; Fowler et al. 2000). Thus, it remains unclear whether CENP-B is involved in chromosome segregation.
It was initially noted that CENP-B displays significant similarity throughout its entire sequence with the transposase encoded by the pogo element of Drosophila melanogaster (Tudor et al. 1992). This relationship was later confirmed through the discovery and analysis of distantly related pogo-like elements in human and Arabidopsis (Robertson 1996; Smit and Riggs 1996; Kapitonov and Jurka 1999; Feschotte and Mouchès 2000). These elements are DNA transposons that form a monophyletic subgroup within the extended Tc1/mariner superfamily (Capy et al. 1998; Plasterk, Izsvák, and Ivics 1999). Since the taxonomic distribution of CENP-B was apparently narrower than those of pogo-like elements and other Tc1/mariner transposons, which are widespread in eukaryotes, it has been hypothesized that CENP-B arose from domestication of a pogo-like transposon (Smit and Riggs 1996; Kipling and Warburton 1997)
Subsequent to the isolation of CENP-B in human and mouse, three centromere-binding proteins were identified in the fission yeast Schizosaccharomyces pombe, ARS-binding protein (Abp1), CENP-B homolog 1 (Cbh1) and CENP-B homolog 2 (Cbh2), that share significant sequence similarity to each other and to mammalian CENP-B (Murakami, Huberman, and Hurwitz 1996b; Lee, Huberman, and Hurwitz 1997; Irelan, Gutkin, and Clarke 2001). Abp1, Cbh1 and Cbh2 have been shown to localize and bind distinct degenerate DNA motifs within the centromeres of S. pombe chromosomes (Halverson et al. 1997; Lee, Huberman, and Hurwitz 1997; Ngan and Clarke 1997; Irelan, Gutkin, and Clarke 2001; Nakagawa et al. 2002). In addition, genetic and biochemical analysis indicates that all three proteins have partially redundant function required for centromeric heterochromatin assembly and chromosome segregation in this organism (Baum and Clarke 2000; Irelan, Gutkin, and Clarke 2001; Nakagawa et al. 2002). It has been proposed that the fission yeast proteins represent functional homologs of the mammalian CENP-B. Studies of these proteins in fission yeast are often used as a model to better understand the function of CENP-B in human centromeric function (e.g. Irelan, Gutkin, and Clarke 2001; Nakagawa et al. 2002; Amor et al. 2004), with the underlying, but undemonstrated, assumption that they are orthologous.
The apparent phenotypic redundancy of the three fission yeast CENP-B homologs has led to the hypothesis that the lack of phenotypic deficiencies in mouse cenp-b mutants might stem from the presence in the mouse genome of genes encoding proteins functionally redundant with CENP-B (Hudson et al. 1998; Kapoor et al. 1998; Irelan, Gutkin, and Clarke 2001; Nakagawa et al. 2002). Although several genes encoding proteins distantly related to CENP-B have been identified in mammals, (Toth et al. 1995; Smit and Riggs 1996; Kipling and Warburton 1997; Zeng et al. 1997; Dou et al. 2004), it is unknown whether any of these proteins act redundantly with CENP-B. Furthermore, the evolutionary relationships of these proteins with mammalian CENP-B or with the fission yeast homologs remain unclear.
The observations summarized above raise two important questions. First, do the fission yeast proteins and mammalian CENP-B descend from a common CENP-B-like ancestor, i.e. are they orthologous, or did they arise independently from domestication events of distinct pogo-like transposases? Second, are there any paralogous genes in mammals that could be functionally redundant with CENP-B, as demonstrated for the S. pombe CENP-B-like proteins? To begin answering these questions, we capitalized on the vast amount of sequence information that recently became available to investigate the origin and evolutionary relationships of the mammalian CENP-B, fission yeast ‘homologs’ and pogo-like transposons. The results point to a remarkable scenario of convergent TE domestication, whereby two different sources of pogo-like transposase were independently recruited into centromere-binding proteins in the mammalian and fission yeast lineages.
In order to retrieve all sequences related to mammalian CENP-B and S. pombe Abp1, Cbh1 and Cbh2 proteins, we used these four proteins, as well as representative pogo-like transposases obtained from GenBank and Repbase (Jurka et al. 2005), to perform exhaustive and reiterative similarity searches of all NCBI protein and nucleotide databases using PSI-BLAST and TBLASTN, respectively. All sequences returned that had more than 30% identity to any of the queries were parsed and retained for further analysis. We next sought to distinguish which of these proteins are encoded by pogo-like transposons and which are encoded as stationary, ‘domesticated’ genes by the host. pogo-like elements, like most DNA (class 2) transposons, are characterized by terminal inverted repeats (TIRs) and conserved target site duplications (TSD) (TA for pogo-like elements). We thus inspected the flanking of each sequences retrieved from the databases for TIRs and TSDs. This procedure allowed us to identify numerous novel pogo-like transposons, which were clustered into families in which individual copies shared extensive nucleotide similarity (>80% over their entire length). To assess syntenic relationships among mammalian pogo-derived genes, we used the pairwise “chained” alignments available at the UCSC Genome Browser Database (http://genome.ucsc.edu/). Orthology was also confirmed by observing the synteny conservation of genes flanking CENP-B and other pogo-derived genes. Accession numbers of newly described pogo-derived genes and pogo-like transposons are reported in table 1 and table 2, respectively. Accession numbers of CENP-B orthologs used in supplementary figures 1 and 2 are: Ateles geoffroyi GI:145279281; Aotus nancymaae GI:74096596; Bos taurus GI:112152000; Callithrix jacchus GI:74217361; Canis familiaris GI:63121068; Cavia porcellus GI:78823234; Chlorocebus aethiops GI:78097410; Colobus guereza GI:82491700; Cricetulus griseus, GI:836955; Dasypus novemcinctus GI:64743060; Equus caballus GI:124093125; Erinaceus europaeus GI:87976610; Felis catus GI:94069308; Homo sapiens GI:148138332; Lemur catta GI:83745227; Macaca mulatta GI:86636346; Monodelphis domestica GI:84821469; Muntiacus reevesi GI:151933349; Mus musculus GI:69973226; Myotis lucifugus GI:105819785; Ornithorhynchus anatinus GI:91357513; Oryctolagus cuniculus GI:63919666; Otolemur garnettii GI:106166469; Ovis aries GI:1016291; Pan troglodytes GI:89627452; Papio anubis GI:78097412; Pongo pygmaeus GI:83745199; Rattus norvegicus GI:32740592; Saimiri boliviensis boliviensis GI:60418061; Spermophilus tridecemlineatus GI:107608120; Tupaia belangeri GI:107987561.
Pairwise alignments were constructed using ClustalX (Chenna et al. 2003) and MAFFT (Katoh et al. 2005), and manually refined using Bioedit v22.214.171.124 (Hall 1999) and GeneDoc v2.6.002 (http://www.psc.edu/biomed/genedoc). K-estimator (Comeron 1999) was used to tabulate the number of non-synonymous sites and substitutions as well as frequencies for each of three classes of synonymous sites and substitutions (2-S, 2-V, and 4-fold). Using these data, we calculated dN/dS after the method of Pamilo and Bianchi (Pamilo and Bianchi 1993). For the maximum likelihood method, we utilized the free ratio branch model of the codeml program in the PAML suite (Yang 1997). We confined this analysis to full length sequences comprising unambiguous ORFs, using the following input tree, aided by Treeview: ((((CENP-B_hs, CENP-B_Papio_anubis), (CENP-B_Ateles_geoffroyi,CENP-B_Aotus_nancymaae, CENP-B_Callithrix_jacchus)), CENP-B_Lemur_catta), ((CENP-B_mus,CENP-B_rat), CENP-B_hamster), CENP-B_opo). To formally test whether the observed ω on each individual branch was significantly < 1, we ran a series of models in which all branches were free, except 1, which was constrained to ω=1. Likelihood ratio tests were then compared to a chi-squared distribution, with 1 degree of freedom. The neighbor-joining tree in supplementary figure 1 has been obtained using MEGA 3.1 (Kumar, Tamura, and Nei 2004).
Phylogenetic trees of the pogo family were inferred with MrBayes (Ronquist and Huelsenbeck 2003), applying a mixed amino acid model with a discrete gamma-distribution with four rate categories and random starting trees. Two independent runs with four Markov chains each were operating for one million generations with a sampling frequency set to 100. When the standard deviation of split frequencies was <0.01, we considered the two runs converged. The temperature difference between the ‘cold’ chain and the ‘heated’ chains was set to 0.1 to improve the chain swap. For the consensus tree the “burnin” parameter was set to 25% of the samples. We also tested a slightly different approach to infer the phylogenetic tree of the pogo family. First, we determined the most likely amino acid model of evolution on the final multialignment of 65 protein sequences using ProtTest1.3 (Abascal, Zardoya, and Posada 2005). ProtTest implements the Akaike Information Criterion and the Bayesian Information Criterion to establish the likelihood of up to ten different models, and produces a maximum-likelihood phylogenetic tree according to the best model. Then, we executed MrBayes using as initial tree the maximum-likelihood tree instead of random trees for both runs, and fixing the model to the best fitting one according to ProtTest. We observed no significant difference between the trees obtained with the two methods.
To identify sequences related to CENP-B, we performed reiterative PSI-BLAST and TBLASTN searches of all NCBI protein and translated nucleotide databases using the human CENP-B protein as the initial query (see Methods). These searches yielded several hundred significant hits that fell into two distinct categories. The first category of hits exhibited amino acid identity >70% over >100 amino acids and all were of mammalian origin. The second category of hits had less than 35% amino acid identity and originated from various organisms. The restricted taxonomic distribution of the high-similarity hits, coupled to the absence of hits with intermediate levels of sequence identity, intuitively suggested that the first category of sequences represented CENP-B orthologs. This assumption was subsequently corroborated by phylogenetic analyses (see suppl. fig. 1 and below) and by syntenic relationship across the genomes of human, mouse, rat, dog and opossum (see Methods). Together these data revealed the presence of a CENP-B ortholog in 31 mammalian species (fig. 1). Reciprocal and reiterative blast searches with each query against the databases revealed only one sequence with high sequence identity (74–100% identical over 100 amino acids) to CENP-B in each mammalian species represented in the database, which includes 27 species with genome sequencing projects completed or nearing completion. For 10 of these mammalian species, high coverage, assembled genome data are currently available. Thus, CENP-B appears to be present as a single-copy gene in all mammalian species examined, including the nonplacental species (opossum and platypus). However, because some species have only partial genome coverage, we cannot exclude the possibility that they possess one or several recently generated CENP-B paralog(s).
Out of the 31 mammalian CENP-B orthologs detected, 28 genes harbor a single uninterrupted ORF, while 3 genes exhibit at least one frameshift mutation that could not be attributed to obvious sequencing errors based on the available sequence data (indicated by an unshaded symbol in fig. 1). To confirm that the 28 apparently intact orthologs are extant, functional genes, we sought evidence of selective constraint, in the form of ω (dN/dS), by both a maximum likelihood method (Yang 1997) and a counting method using K-estimator (Comeron 1999) (see Methods). Both analyses produced evidence of strong purifying selection acting in all mammalian lineages, with ω estimates no greater than 0.21. All ω estimates were significantly lower than 1 (p values < 0.00001, Likelihood Ratio Test for PAML, and p values < 0.002, z-test, for counting method) and the tree-wide ω estimate was 0.078. Similar ω estimates were obtained for the 3 ambiguous genes noted above, suggesting that the apparent frameshift mutations in these ORFs are either sequencing artifacts or that these genes underwent independent and recent pseudogenization events. However, given the omnipresence of CENP-B and the strong selective constraint operating on the encoded protein in all the mammalian lineages examined, the hypothesis that CENP-B would be lost independently in these three species seems improbable.
In summary, CENP-B appears to be present as a functional, single-copy gene in most (if not all) mammalian species examined (fig. 1). The gene has followed a similar pattern of evolution in all lineages, characterized by strong purifying selection. The mark of selection is particularly intense on the N-terminal DNA-binding domain (Tanaka et al. 2001), which is nearly identical in the 17 mammalian species where sequence coverage is available for this region (see alignment in suppl. fig. 2). We could not detect any potential CENP-B ortholog in any non-mammalian vertebrates, including three tetrapod species for which draft genome assemblies are available (chicken, the squamate Anolis carolinensis, and the amphibian Xenopus tropicalis). Likewise, we could not detect a direct homolog of CENP-B in any of the numerous and diverse invertebrates with draft genome assemblies (including one echinoderm, 3 ascidians, 12 Drosophila, 3 mosquitoes, one beetle, one lepidopteran, 3 nematodes, 2 flatworms and 1 cnidarian). Given the extreme level of sequence conservation of CENP-B in mammals, it seems inconceivable that homology-based searches would systematically fail to identify a possible ortholog in every one of these animals. Therefore, CENP-B must be taxonomically restricted to mammals. These data indicate that CENP-B originated prior to the split of monotremes, marsupials and placentals and, most likely, in the last common ancestor of these lineages.
The second category of BLAST hits to CENP-B in mammalian genomes appears to be the result of sequence similarity to distantly related transposase pseudogenes (Tigger elements, see next section)(Smit and Riggs 1996) or to several additional ‘host’ genes with distant homology to CENP-B. Besides CENP-B, nine orthologous clusters of pogo-derived genes can be recognized in mammals: Tigger-derived genes 1–7 (TIGD1-7) (Robertson 2002; Dou et al. 2004), JERKY (JRK) (Toth et al. 1995) and Jerky-like (JRKL) (Zeng et al. 1997) (see table 1). Only the mouse jerky gene has been functionally characterized. Disruption of this gene in mice causes a phenotype that is characterized by recurrent limbic seizures reminiscent of some forms of human inherited epilepsy (Toth et al. 1995). The mouse JRK protein has both DNA and RNA-binding activity and specific neuronal localization (Liu et al. 2003). Like CENP-B, JRK and other pogo-derived genes seem to be restricted to mammals. Based on the current data, the only two exceptions appear to be TIGD4 and TIGD5. We identified an ortholog of TIGD4 in Anolis carolinensis and an ortholog of TIGD5 in Xenopus tropicalis and chicken (though it may be a pseudogene in the latter species, data not shown) (table 1). Thus, TIGD4 and TIGD5 are the only two mammalian pogo-derived genes found in other vertebrates.
Overall, each of these additional pogo-derived proteins has relatively weak similarity to CENP-B (18–28% identity). TIGD3, TIGD4 and TIGD6 seem most closely related to CENP-B (see below) and, like CENP-B, they are widely distributed and highly conserved in mammals (fig. 1). Within a given mammalian genome, TIGD6 is consistently the closest relative to CENP-B, with an average pairwise amino acid identity of 28%. Phylogenetic distance analysis based on synonymous sites (Ks tree, not shown) with an application of an average neutral substitution rate of ~2.2−9 in mammals (Kumar and Subramanian 2002) is indicative of a coalescence time congruent with a much earlier divergence of CENPB, TIGD3, TIGD4 and TIGD6 than is suggested by their taxonomic distribution (i.e. mammalian-wide or possibly amniote-wide in the case of TIGD4). These observations are in line with the hypothesis that CENP-B and TIGD3, 4 and 6 are not paralogous genes that arose by ‘classic’ duplication of an ancestral gene. It is more probable that each of these genes originated by independent domestication of a different source of pogo-like transposase. These events must have taken place during an evolutionary timeframe ranging from the emergence of amniotes (~360 Mya) to the split of monotremes and marsupials (~230 Mya) (Hedges and Kumar 2003; van Rheede et al. 2006). This scenario is further supported by the phylogenetic placement and relationships of the encoded proteins to each other and to various pogo-like transposases (see below). Thus, it appears that pogo-like transposons were a recurrent source of new protein-coding genes in mammals.
While we could not detect an ortholog of CENP-B in any of the non-mammalian metazoan species represented in the databases, we could readily identify multiple pogo-like transposons in many of these species (table 2). Some of these transposons have been previously described, such as the founding pogo element in D. melanogaster (Tudor et al. 1992) and the Tigger elements of H. sapiens (Smit and Riggs 1996; Robertson 1996), but most represent newly identified families from the lizard Anolis carolinensis, the nematode Trichinella spiralis, the sea hare Aplysia californica, the flatworm Schmidtea mediterrranea, the starlet sea anemone Nematostella vectensis and several insects (table 2). Most of these elements have TIRs very similar to those of known pogo-like transposons and are flanked by canonical 5′-TA-3′ TSD (fig. 3). Each family is represented by multiple copies dispersed throughout the respective host genome, suggesting relatively recent transpositional activity. Although the transposase sequences from different families or from different species may sometimes be highly divergent (16–52% identity, but in most cases less than 30% identity), phylogenetic analysis unequivocally places these elements within the pogo subgroup of the Tc1/mariner superfamily (fig. 2). These data underscore the high diversity and evolutionary persistence of pogo-like transposons in metazoans, and particularly in invertebrates. Moreover, the widespread distribution of pogo-like transposons in metazoans stands in contrast with the taxonomic restriction of CENP-B to mammals. These findings illustrate that highly divergent pogo-like sequences can be readily retrieved from a wide diversity of animals using conventional homology-based searches. With one exception (TIGD4 in A. carolinensis), these transposases represent the most closely related sequences to the mammalian CENP-B in each of the non-mammalian species represented in the databases. Hence, we believe that our inability to detect CENP-B in non-mammalian species is not due to a lack of sensitivity of the BLAST algorithms, but rather reflects the absence of CENP-B orthologs in these species.
We employed the same BLAST-based strategy to identify sequences related to the fission yeast Abp1, Cbh1 and Cbh2 proteins and to pogo-like transposases in all fungal genomes available. These searches revealed several fungi sequences related to the S. pombe genes. With two exceptions (discussed hereafter), each of these sequences exhibits typical features of transposons, such as multiple interspersed copies, TIRs and TSDs (table 2). Inspection of the TIR sequences and phylogenetic analyses of the predicted transposases indicate that these transposons belong to the pogo subgroup (fig. 2). To our knowledge, these elements are the first characterized fungal pogo-like transposons sensu stricto. Indeed, these elements are clearly distinct from Fot1-like transposons, another subgroup of the Tc1/mariner superfamily that is abundant in filamentous fungi (Daboussi and Capy 2003). In fact, some genomes harbor both Fot1-like and pogo-like elements (e.g. Aspergillus), akin to the single-celled eukaryotes Entamoeba (Pritham, Feschotte, and Wessler 2005). Thus, pogo-like and Fot1-like transposons diverged prior to the divergence of fungi and Entamoeba. The newly discovered fungal pogo-like transposons are the only relatives of the S. pombe CENP-B-like genes in most of the fungi for which we have genome sequence data. Conversely, S. pombe lacks detectable pogo-like transposons and, in fact, this species does not appear to contain any DNA transposons (Wood et al. 2002; Feschotte and Pritham 2007).
Neither Saccharomyces cerevisiae nor other Saccharomycetales harbor detectable pogo-like transposons. Instead, each of these yeasts possesses a single gene, first described as Pdc2 in S. cerevisiae (Mojzita and Hohmann 2006), which is distantly related to the fission yeast ‘CENP-B homologs’ (Smit and Riggs 1996). Orthologs of Pdc2 could be identified in 23 fungi species, all of which are Saccharomycetales. Several lines of evidence suggest that Pdc2 is not orthologous to the fission yeast CENP-B-like genes. First, Pdc2 functions as a transcription factor in the pyruvate decarboxylase pathway (Mojzita and Hohmann 2006) and there is no evidence that the protein has centromere-binding activity or that it plays a role in chromosome segregation. Second, the entire Pdc2 protein is about 1.5 times longer and aligns poorly with the fission yeast proteins (25% identity over 526 amino acid in the most conserved region). Pdc2 and the closely related protein in other Saccharomycetales contain a fast-evolving C-terminal extension with essentially no homology to the fission yeast proteins (data not shown). Third, Pdc2 and the S. pombe proteins fall into two separate phylogenetic clades of pogo-derived proteins intermingled with different pogo-like transposases (see fig. 2 and further description below). Thus, the Pdc2 orthologous gene cluster appears to be derived from yet another source of pogo-like transposase than the S. pombe CENP-B-like proteins.
Two Aspergillus species possess, in addition to pogo-like transposons, a single-copy orthologous gene predicted to encode a protein related to pogo-like transposases (table 1). These two ORFs could not be associated with obvious transposon features. The protein sequences do not group with the fission yeast proteins or with the Pdc2 proteins in phylogenetic reconstructions (see below, fig. 2), suggesting that these two Aspergillus genes are unlikely to be orthologous to the fission yeast CENP-B-like genes. Most likely, they represent previously undescribed stationary pogo-derived genes with an independent origin.
Together these data indicate that the CENP-B ‘homologs’ of S. pombe are restricted to this species, or possibly to the Schizosaccharomycetales. The three fission yeast proteins share 39–47% amino acid identity over their entire length and have redundant functions in chromosome segregation, although each protein appear to bind distinct DNA sites at S. pombe centromeres (Lee, Huberman, and Hurwitz 1997; Ngan and Clarke 1997; Baum and Clarke 2000; Irelan, Gutkin, and Clarke 2001; Nakagawa et al. 2002). These data are consistent with a scenario whereby Abp1, Cbh1 and Cbh2 arose from a single domestication event of a pogo-like transposase in the lineage of S. pombe followed by ‘standard’ gene duplication and subfunctionalization (Prince and Pickett 2002).
The above data raise the question whether the fission yeast and mammalian centromere-binding proteins descend from a common ancestral gene domesticated prior to the divergence of metazoa and fungi (i.e. they are orthologous) or originated independently from different transposase sources. In light of the taxonomic distribution described above, and taking into account only the taxa wherein one or more whole-genome sequences are available, the first scenario would require a minimum of 15 independent losses of the CENP-B gene during animal and fungi evolution (10 losses in animals and 5 in fungi, see fig. 4). Although gene loss is a common process in eukaryotic evolution (Aravind et al. 2000; Krylov et al. 2003), this scenario is difficult to reconcile with the critical function of the fission yeast genes for chromosome segregation and cell cycle progression (Halverson et al. 1997; Irelan, Gutkin, and Clarke 2001; Nakagawa et al. 2002; Locovei et al. 2006). Additionally, it seems unlikely that a sequence domesticated prior to divergence of fungal and animal lineages would be so highly constrained in a single animal lineage (i.e. mammals) while independently becoming dispensable in most other animal lineages. Thus, a scenario that invokes two separate domestication events of pogo-like transposases in the lineages of fission yeast and mammals is indeed more parsimonious (fig. 4).
Two distinct predictions as to the topology of a phylogenetic tree of mammalian and fission yeast centromere-binding proteins and pogo-like transposases arise as a result of these observations.
Under the single domestication event hypothesis, one would expect that the mammalian and fission yeast CENP-B proteins will form a monophyletic clade sister to a clade of pogo-like transposases descended from the transposase source that gave rise to CENP-B in the common ancestor of fungi and animals. Alternatively, the vast evolutionary gap between mammals and yeasts might hinder the phylogenetic clustering of the two groups of proteins and obscure their relationship to each other and to a particular clade of transposases. Under the independent domestication hypothesis, the fission yeast proteins and mammalian CENP-B proteins should fall into distinct clades together with fungal and animal transposases from which they respectively derived.
To test these predictions, we selected 65 sequences from a data set of non-redundant transposases and pogo-derived proteins and built a multiple alignment using the MAFFT program (Katoh et al. 2005) (tables 1 and and2).2). After manually adjusting the alignment and removing regions with extremely low conservation introducing long gaps, we reconstructed the evolutionary tree from our data set using a Bayesian approach (see Methods). The Bayesian tree (fig. 2) shows that the mammalian pogo-derived proteins, including CENP-B, and the fission yeast centromeric proteins Abp1, Cbh1 and Cbh2 fall into two separate clades with robust statistical support.
As predicted under the independent domestication hypothesis, the mammalian pogo-derived proteins cluster with transposases isolated from various animal genomes (insects, nematodes, mollusks and flatworms). However, we note that the pogo-like transposases currently found in vertebrate genomes (Tiggers from mammals and amniotes and TIGGU from pufferfish) are not the most closely related to CENP-B. Tiggers and TIGGU transposases fall into a different clade (denoted ‘JR’ for JERKY-related in fig. 2) together with several newly identified transposases from various invertebrates and from the microsporidian Nosema bombycis and several pogo-derived proteins (JRK, JRKL and TIGD2, 5 and 7 in one subclade and TIGD1 in a second subclade; see fig. 2, tables 1 and and2).2). This topology suggests that there are two major, anciently diverged clades of pogo-like transposons in metazoans: the CR clade (CENP-B-related) and the JR clade. Most likely, the two clades were already separated prior to the split of invertebrates and vertebrates, but transposons from the CR clade have gone extinct and are presently undetectable in the dataset of vertebrate genomes currently available (mostly mammals). CENP-B, TIGD3, 4 and 6 might be viewed as derivatives of these transposons that have persisted through evolutionary time due to the action of natural selection.
All the fungi sequences group into a strongly supported clade that includes the fission yeast centromeric proteins, the Pdc2 proteins and pogo-like transposases from fungi, plants and the oomycete Phytophthora. We designate this clade ‘AR’ for Abp1-related. Within the AR clade, the fission yeast proteins are significantly closer to transposases found in the ascomycetes class of Eurotiomycetes, which comprises the genera Aspergillus, Neosartorya and Coccidioides. These transposases thus appear to represent the closest extant relatives of the transposases that gave rise to the fission yeast centromeric proteins. This group of transposons has now completely disappeared from the S. pombe genome, together with all other DNA transposons. Finally, the Pdc2 group of proteins falls outside of this group, consistent with its independent origination from yet another source of pogo-like transposase.
Our analysis provides evidence that two distinct sources of pogo-like transposases gave rise independently to mammalian and fission yeast proteins with centromere-binding activity. Based on this common activity and overall sequence similarity, these proteins have been previously considered functional homologs (Murakami, Huberman, and Hurwitz 1996b; Halverson et al. 1997; Lee, Huberman, and Hurwitz 1997; Irelan, Gutkin, and Clarke 2001; Nakagawa et al. 2002; Amor et al. 2004; Masumoto, Nakano, and Ohzeki 2004). Consequently, the results drawn from the analysis of fission yeast proteins have often been used to speculate on the function of the mammalian CENP-B and vice-versa. While functional parallels clearly exist between these proteins and in the general centromeric protein–DNA architecture of fission yeast and mammals (Clarke 1998; Amor et al. 2004; Henikoff and Dalal 2005), our results indicate that the sequence relationship between these proteins does not derive from vertical inheritance of an ancestral host protein with centromere-binding activity. Instead, our data indicate that the relationship of these proteins reflects a process of convergent domestication, whereby the same type of transposases (i.e. pogo-like transposases) were exapted independently in the lineage of fission yeast and mammals to give rise to host proteins with centromere-binding activity.
At first, it may seem surprising that the same type of transposase would be recruited twice independently in evolutionarily distant lineages to become centromere-binding proteins. However, we note that the association of TEs and centromeres appears to be a recurrent theme in molecular evolution. Many intimate links have been uncovered between various kinds of TEs and the structure and/or function of centromeres in diverse species (reviewed in Kipling and Warburton 1997; Dawe 2003; Wong and Choo 2004). Furthermore, it is conceivable that pogo-like transposases possess a predisposition to be recruited as centromeric proteins. For example, one can envision that these transposases have an intrinsic ability to interact with centromeric DNA, either directly (all transposases so far characterized have DNA-binding activity) or indirectly via interaction with a host protein. This host factor could be a constitutive component of the kinetochore or a protein transiently associated with centromeric chromatin.
Genetic and biochemical analyses of the three fission yeast centromeric proteins have clearly established their role in the function and organization of the centromeres (Halverson et al. 1997; Lee, Huberman, and Hurwitz 1997; Ngan and Clarke 1997; Baum and Clarke 2000; Irelan, Gutkin, and Clarke 2001; Nakagawa et al. 2002). Individually, each protein is not essential for viability, but double deletion mutants display loss of viability and dramatic morphological changes, including abnormal branching and cell elongation. All three proteins are also required for the recruitment of the major heterochromatin protein Swi6 to the centromeres, although Abp1 seems to make the greatest contribution to this process (Nakagawa et al. 2002). Thus the function of the three proteins seems to be partially redundant, but together essential for cell cycle progression and specification of a chromatin state at the centromeres that is competent for chromosome segregation. Consistent with some level of functional partitioning, the three proteins have different DNA binding affinities in vitro (Murakami, Huberman, and Hurwitz 1996a; Halverson et al. 1997 Lee, Huberman, and Hurwitz 1997) and distinct targets in vivo. Cbh1 binds the outer repeats of the S. pombe centromere and perhaps also non-centromeric regions (Baum and Clarke 2000), while Cbh2 appears to bind predominantly the inner centromeric region (Irelan, Gutkin, and Clarke 2001). Abp1 binds specifically the outer centromeric repeats, where it promotes specific histone modifications that lead to the recruitment of Swi6 and the formation of centromeric heterochromatin (Nakagawa et al. 2002). The notion of subfunctionalization is compatible with the results of our analyses, although we cannot at the moment determine whether the three S. pombe proteins originated from a single domestication event followed by subsequent gene duplication events or by domestication of three closely related transposase genes.
The function of CENP-B at mammalian centromeres is less clear than for the fission yeast proteins (for review, Warburton 2001; Masumoto, Nakano, and Ohzeki 2004). On one hand, it has been demonstrated that CENP-B and its binding sites on alphoid satellite DNA (CENP-B box) are required for de novo centromere formation in human cells and faithful segregation of human artificial chromosomes (Harrington et al. 1997; Ohzeki et al. 2002). On the other hand, the CENP-B box and CENP-B are not found on the Y chromosome and CENP-B is dispensable for the activation of neo-centromeres (Broccoli, Miller, and Miller 1990; Depinet et al. 1997; Warburton 2001). Finally, in CENP-B knockout mouse cells, functional kinetochores are maintained and null mutant mice exhibit only mild growth and reproductive abnormalities in the laboratory (Hudson et al. 1998; Kapoor et al. 1998; Perez-Castro et al. 1998; Fowler et al. 2000). While what is mild in the laboratory is not necessarily invisible to natural selection, these observations and others (Goldberg et al. 1996; Masumoto, Nakano, and Ohzeki 2004) raise the hypothesis that CENP-B might have functionally redundant homologs within mammalian genomes. Our analysis suggests that all mammalian genomes possess indeed other transposase-derived proteins related to CENP-B. However, the overall level of similarity of these proteins to CENP-B is relatively weak and it is lower than among the three proteins of fission yeast. Based on our analysis, the best candidates as functional homologs of CENP-B are TIGD3, 4 and 6, with the latter being the closest relative to CENP-B (fig. 2). Recently, it was found that recombinant human JRK-GFP fusion proteins densely co-localize with CREST antigens (which recognize CENP-A, -B and -C) at distinct chromosomal foci of cultured cells in S and G2 phase (Waldron and Moore 2004). Thus JRK might also interact directly or indirectly with some centromeric components. Given the distant relationship of JRK to CENP-B, but its close proximity to mammalian Tigger4 transposons (see fig. 2), it could be that JRK represents yet another case of independently domesticated pogo-like transposase with a centromeric function. Further experiments are needed to explore the function of mammalian pogo-derived proteins and to assess whether any could act redundantly and/or cooperatively with CENP-B at the centromere.
We thank Ellen Pritham for critical reading of the manuscript and Esther Betrán for providing logistical and financial support to C.C. We also thank the following Sequencing Centers: Agencourt, Inc., WUSTL School of Medicine, Joint Genome Institute, TIGR, Broad Institute and Baylor College of Medicine for prepublication access to their genome data. This work was supported by start-up funds from UT Arlington and by grant R01GM77582-01 from the National Institute of Health to C.F.