Chimpanzee orthologues of NANOG and NANOGP1
] is the functional gene in the human genome, whereas NANOGP1
] is apparently an unprocessed pseudogene derived from tandem duplication of the chromosomal region containing NANOG
. However, cDNA and EST data show that NANOGP1
may be transcriptionally active, albeit at a lower level than NANOG
, and that its transcripts are spliced differently than those derived from NANOG
. Hart et al. [5
] designated NANOGP1
and referred to it as a functional gene, whereas Booth and Holland [4
] argued that because of its relatively high degree of divergence from NANOG
, and the comparative paucity and ambiguity of transcripts derived from it, NANOGP1
is an unprocessed duplication pseudogene.
MEGABLAST searches of the chimpanzee genome readily identified the orthologues of NANOG and NANOGP1. However, the organization of the chimpanzee orthologue of the human NANOG gene in the chimpanzee Build 1.1 genome assembly suggests that the gene is either rearranged in the chimpanzee genome, or that the assembly is incorrect within this gene. All four exons of the orthologue are present in the assembly but in two different GenBank accessions. The entire sequences of the 5' UTR, exon 1, and exon 2 are found in the region spanning nucleotides 683046 through 686855 of the chromosome 12 contig [GenBank:NW_114668], in a region on the short arm of chromosome 12 near the telomere at a location orthologous to that of the human NANOG gene at 12p13.31. Introns 1 and 2 of the chimpanzee orthologue are also within this region but large segments of them are unsequenced. The complete sequences of exon 3, intron 3, exon 4, and the 3' UTR of the chimpanzee orthologue are found in nucleotides 3808 though 5350 of another accession [GenBank:NW_115304], which is known to reside on chromosome 12 but has not been placed in the Build 1.1 assembly of this chromosome. Furthermore, exon 4 in this accession contains an apparent single nucleotide-pair insertion mutation, resulting in a frameshift and premature termination codon in the reading frame.
To determine if the apparent gene rearrangement and frameshift mutation are present in the chimpanzee NANOG gene, or whether these are assembly and sequencing errors, we compared the available sequences of NANOG and NANOGP1 in the chimpanzee assembly and selected PCR primer sequences in regions that differed sufficiently to ensure specific amplification of the NANOG gene. To verify that the amplicons were not derived from processed NANOG pseudogenes, all target sequences included at least a portion of a NANOG-specific intron.
Two primer combinations amplified fragments that include the region of apparent misassembly within intron 2. Both of these primer combinations amplified PCR fragments of the sizes expected if the gene is intact. We sequenced these fragments (and all other amplified fragments) of the gene and found that their sequences most closely matched those of the intact human NANOG gene and less closely the corresponding sequences in the human pseudogenes, including NANOGP1, confirming that our sequences are derived from the intact chimpanzee NANOG gene. Furthermore, our sequences show that the apparent frameshift mutation in exon 4 in the Build 1.1 assembly is a sequencing error. Our sequencing enabled us to assemble and annotate the genomic sequence of the intact chimpanzee NANOG gene [GenBank:DQ179631].
Evolution of the NANOG gene and pseudogene family
The entire functional NANOG gene (according to our sequencing data) and NANOGP1 are present in both the human and chimpanzee genome assemblies at orthologous chromosomal positions. In the 3' UTR of the NANOG gene, there is an Alu element, which is missing from NANOGP1 in both genomes. Therefore, the NANOGP1 unprocessed pseudogene arose through duplication of the chromosomal region containing NANOG before the human-chimpanzee (H/C) divergence and before insertion of the Alu element into the NANOG gene. Because the same Alu element is present in both the human and chimpanzee NANOG genes, its insertion must also have preceded the H/C divergence. The processed pseudogenes NANOGP2, NANOGP3, NANOGP4, NANOGP5, NANOGP6, NANOGP7, NANOGP9, and NANOGP10 lack this Alu element. They thus likely arose before its insertion and, therefore, also predate the H/C divergence. The presence of the NANOGP11 pseudogene fragment in both the human and chimpanzee genomes likewise shows that its origin preceded H/C divergence.
The human NANOGP8 pseudogene is highly similar to the NANOG gene, is absent from the chimpanzee genome, and contains the same Alu element as the NANOG gene, indicating that this processed pseudogene is the most recent of the NANOG pseudogenes and was inserted into human chromosome 15 after the H/C divergence.
Based on the assumption of a pseudogene mutation rate of 1.25 × 10-9
mutations per site per year in humans [16
], Booth and Holland [4
] estimated the origin of the NANOGP8
pseudogene as the most recent at 5.2 million years ago, about the time of the H/C divergence. Our results demonstrate that NANOGP8
arose after the H/C divergence, and thus are consistent with this date. Booth and Holland [4
] estimated the origins of the other pseudogenes as ranging from over 150 million years ago for NANOGP6
to 22 million years ago for NANOGP1
, with the caveat that these dates may be inaccurate, and are likely overestimates, because nucleotide substitution rates for pseudogenes are not well calibrated within this range.
Booth and Holland [4
] determined the relative ages of the human NANOG
pseudogenes by counting the number of mutations in the reading-frame regions of the human NANOG
pseudogenes when compared to the reading frame of the functional NANOG
gene, scaling their analysis by counting adjacent deletions as a unit-site size of one to compensate for the reduced opportunity of substitution mutation in deleted regions. They concluded that NANOGP6
is the most ancient of the pseudogenes, followed in order of most ancient to most recent by NANOGP5
, then NANOGP10
, then NANOGP9
, then NANOGP7
, then NANOGP4
, then NANOGP1
, and NANOGP8
as the most recent. Booth and Holland's analysis did not distinguish the order of NANOGP5
relative to each other, nor of NANOGP2
relative to each other, because of similar degrees of divergence for each of these pairs of pseudogenes from NANOG
We conducted a similar analysis of relative age, with the same scaling for multiple-nucleotide deletions as a single unit site when those deletions were shared by the human and chimpanzee sequences. We identified mutations that occurred after the H/C divergence as differences between the human and chimpanzee sequences and corrected them to reflect the ancestral sequence at the time of the H/C divergence before completing our analysis. This correction was especially important for NANOGP10, which has accumulated 20 mutations since the H/C divergence, compared to 1–10 mutations for the other pseudogenes. We excluded NANOGP8 from this correction because of its absence in the chimpanzee genome. Also, since NANOGP3 is a truncated pseudogene with only 254 nucleotides within the NANOG coding region, we compared only the portions of NANOG and the other pseudogenes that aligned with these 254 nucleotides when determining the relative age of NANOGP3. The pseudogene fragment NANOGP11 was not included in Booth and Holland's analysis nor ours because it lacks the entire reading frame and has no significant homology with several of the other processed pseudogenes.
Comparison of the sequences after these adjustments results in a relative order that is the same as that determined by Booth and Holland [4
]. Also similar to Booth and Holland's conclusions, our analysis showed that NANOGP3
were almost identical in the degree of similarity to NANOG
(88.6% and 88.2%, respectively), and that NANOGP2
were likewise nearly identical in the degree of divergence from NANOG
(94.6% and 94.4%, respectively). Thus, like Booth and Holland [4
], we could not conclusively determine the relative orders within each of these two pairs of pseudogenes using this type of analysis.
Such an analysis assumes that natural selection has conserved the functional gene's sequence so that the modern sequence of the reading frame represents the source sequence of each of the pseudogenes. Under most circumstances, such an assumption cannot readily be tested. However, the periodic insertion and fixation of ten NANOG pseudogenes with a complete or partial reading frame should have left a record, albeit an imperfect one, of the functional NANOG gene-sequence evolution. If we assume that the reading frame of the functional NANOG gene has changed during the time when the pseudogenes were inserted into the genome, the mutational differences in the pseudogenes should consist of three different types: 1) source-gene mutations, defined as those that occurred in the functional NANOG gene after the insertion of one pseudogene but before the insertion of another, resulting in a polymorphism between these pseudogenes, 2) post-insertion mutations, defined as those that occurred in a pseudogene after its insertion but before the H/C divergence, and 3) post-H/C divergence mutations, defined as mutations that occurred in the NANOG gene and its pseudogenes after the H/C divergence. We readily identified 88 post-H/C divergence mutations in the reading-frame regions of the NANOG gene and its pseudogenes, and in all but four cases we were able to determine the mutant and ancestral nucleotides at each site by comparison of the human and chimpanzee orthologues with the NANOG gene and the other pseudogenes.
Some of the source-gene mutations should be distinguishable from post-insertion pseudogene mutations in our data as a nucleotide that is identical in a set of older pseudogenes, which then changes to a different nucleotide in a set of younger pseudogenes. Moreover, if possible source-gene mutations can be identified, they can be used to reconstruct the evolutionary history of the pseudogene family, and to some extent the evolutionary history of the gene itself.
To reconstruct the evolutionary history of the NANOG gene and its pseudogene family with source-gene mutation analysis, we aligned the reading frame of the human and chimpanzee NANOG gene with the corresponding sequences in all pseudogenes (except NANOGP11, which lacks the reading frame), and corrected (in all but four cases) post-H/C divergence mutations to reflect the ancestral sequence. We identified sites with possible source-gene mutations as a nucleotide shared by two or more pseudogenes and a different nucleotide shared by two or more additional pseudogenes. Any nucleotide present in a particular position in only one pseudogene was considered as a post-insertion pseudogene mutation. A total of 68 sites (out of 918) within the reading frame met these criteria for identification of possible source-gene mutations. We then identified the most parsimonious order of pseudogenes as the one which required the fewest number of source-gene mutations across these 68 sites.
The most parsimonious ordering of the NANOG pseudogenes (154 possible source-gene mutations across 68 sites) from most ancient to most recent is NANOGP6, NANOGP5, NANOGP3, NANOGP10, NANOGP2, NANOGP9, NANOGP7, NANOGP1, NANOGP4, and NANOGP8 as the most recent. The next most parsimonious ordering (156 mutations) is the same as the above order but with the positions of NANOGP5 and NANOGP3 reversed. As a truncated pseudogene, NANOGP3 contains only 19 possible source-gene mutation sites. Of these, only five are informative in distinguishing NANOGP3 and NANOGP5, three supporting NANOGP5 as the older pseudogene and two supporting NANOGP3. Sites with only one mutation in a particular order are more likely to represent a true source-gene mutation than sites with multiple mutations, which probably consist of a combination of source-gene and post-insertion mutations. The three sites, 399, 531, and 568, that support NANOGP5 as the older pseudogene require 1, 2, and 1 mutations to explain the order, respectively. The two sites that support NANOGP3 as the older pseudogene (sites 390 and 566) require 5 and 4 mutations, respectively, to explain that order, suggesting that the most parsimonious order (NANOGP5 older than NANOGP3) is also the most plausible with respect to these two pseudogenes. Additionally, our analysis clarifies the relative order of NANOGP2 and NANOGP9 by clearly placing NANOGP2 as the older of the two (reversing their positions in the order requires 168 mutations).
The only notable discrepancy between the results of source-gene mutation analysis and ordering by overall similarity to the modern NANOG
gene is the relative placement of NANOGP1
. In the latter analysis, the functional NANOG
gene is more similar to NANOGP1
(98.6%) than it is to NANOGP4
(96.4%), implying that NANOGP4
is the older pseudogene. However, source-gene mutation analysis places NANOGP4
as the more recent of the two. Examination of the mutations that distinguish NANOGP1
provides compelling evidence that NANOGP1
is indeed the older pseudogene. NANOGP1
is an unprocessed pseudogene that arose from duplication of a segment of chromosome 12, and thus may have remained functional for an undetermined period of time after its formation. As Booth and Holland [4
] pointed out, NANOGP1
cannot use the same initiation codon as NANOG
because a mutation at position 25 in the reading frame produced a premature termination codon after only eight amino acids. This mutation is present in both the human and chimpanzee orthologues indicating that it preceded the H/C divergence. Booth and Holland noted, however, that of the three characterized human transcripts from NANOGP1
, two are alternatively spliced to remove all of exon 1, so that the NANOGP1
reading frame begins at a position corresponding to the 58th
amino acid in the protein encoded by NANOG
, which is an internal methionine in the NANOG protein. If NANOGP1
did indeed remain functional after its formation, we would expect natural selection to conserve the sequence within its reading frame when compared to NANOG
After correction to the ancestral sequence for post-H/C divergence mutations, 15 mutations distinguish NANOGP1
from the NANOG
reading frame, and they are nonrandomly distributed. Twelve are clustered in a 121-nucleotide region entirely within exon 1 of the NANOG
gene, a region removed during splicing in two characterized NANOGP1
transcripts. Of the three mutations in NANOGP1
's apparent reading frame, two are nonsynonymous and one is synonymous. A nonsynonymous mutation at position 246 is a guanine-to-thymine substitution that results in a lysine-to-asparagine substitution in the protein. Comparison with the human and chimpanzee sequences of the other pseudogenes reveals that this is a source-gene mutation that supports NANOGP1
as being older than NANOGP4
. Comparison of this polymorphism to the sequences of the other pseudogenes reveals that the guanine in NANOGP1
, and therefore the lysine in the protein, are ancestral, and that the source-gene mutation occurred after duplication of NANOGP1
but before insertion of NANOGP4
. Interestingly, Booth and Holland [4
] found through experimental sequencing that this particular mutation (and amino acid substitution) is polymorphic in modern humans, suggesting that neither lysine nor asparagine is detrimental to protein function at this position.
The other nonsynonymous mutation is a cytosine-to-thymine substitution at position 477, resulting in a proline-to-leucine substitution in the protein. Because proline and leucine have similar biochemical properties, this mutation is also not likely to adversely affect protein function. The NANOG gene and all other pseudogenes in both the human and chimpanzee genomes have a cytosine residue at this position, indicating that this is a post-duplication mutation in NANOGP1.
The single synonymous mutation in the apparent reading frame is at position 384, which lies within the homeobox region. This is clearly a source-gene mutation that also supports the ordering of NANOGP1 as being older than NANOGP4. Only NANOG, NANOGP4, and NANOGP8 have a cytosine at this position; all other pseudogenes, including NANOGP1, have a thymine at this position.
Taken in the aggregate, these observations strongly support the hypothesis that NANOGP1 remained functional after duplication and, therefore, was subject to selection-driven conservation of its reading frame. They also raise the possibility that NANOGP1 may retain some functionality or that its loss of function may be evolutionarily recent.
Nucleotide polymorphisms at possible source-gene mutation sites may represent true source-gene mutations or post-insertion pseudogene mutations. Sites in which a single mutation separates a set of older pseudogenes from a set of younger pseudogenes are the most plausible sites for identification of true source-gene mutations. In the most parsimonious ordering, 29 of the 68 sites contained a single possible source-gene mutation (Figure ). Twenty of these mutations are nonsynonymous and nine are synonymous. If a mutation is indeed a true source-gene mutation, the amino acid it encodes may be reflected in the NANOG proteins of other vertebrates. To determine if this is the case, we used the amino acid sequence of the polypeptide encoded by the human NANOG gene [GenBank:NP_079141] as a query for a BLASTP search of the protein database of all organisms. Proteins from six species displayed full-length or nearly full length homology to the NANOG protein: crab-eating macaque (Macaca fascicularis [GenBank:BAD72891]), house mouse (Mus musculus [GenBank:XP_132755]), Norway rat (Rattus norvegicus [GenBank:XP_575662]), domestic cattle (Bos taurus [GenBank:AAY84556]), domestic goat (Capra hircus [GenBank:AAW50709]), and domestic dog (Canis familiaris [GenBank:XP_543828]). We excluded a match to a computationally generated hypothetical protein in chimpanzee [GenBank:XP_510125] because it is derived from the DNA sequence of chimpanzee NANOGP7.
Figure 2 Potential single source-gene mutations in the most parsimonious ordering of the NANOG pseudogenes by source-gene mutation analysis. The left side depicts nucleotide sequences of the NANOG gene and pseudogenes after correction of post-H/C divergence mutations (more ...)
As shown in Figure , several of the putative source-gene mutations and their inferred effect on amino acid sequence in the human/chimpanzee NANOG pseudogene family are consistent with the corresponding amino acids in the NANOG proteins of other eutherian mammals. For example, at site 52 in the reading frame, an adenine-to-guanine substitution in the NANOG gene apparently occurred after the insertion of NANOGP10 but before the insertion of NANOGP2, resulting in an asparagine-to-aspartic acid substitution in amino-acid residue 18 of the polypeptide. The dog, cattle, and rat proteins have asparagine at this position, whereas the macaque, chimpanzee, and human have aspartic acid at this position. Similar patterns of congruence between amino acid substitution and amino acid sequences in other mammals is evident at positions 250–251, 275, 568, 713, 817, and 820–821 of the reading frame (Figure ).
Another feature of the putative source-gene mutations is the paucity of amino acid substitutions at source-gene mutation sites within the homeobox region (positions 283–462 in the reading frame) indicative of high source-gene sequence conservation in this region. Six possible source-gene mutation sites are present within the homeobox region (three of which are single-mutation sites depicted in Figure ). Five of these six sites have only synonymous mutations. The single nonsynonymous mutation is at position 358, with thymine present in NANOGP7 and NANOGP9 and cytosine present in all other pseudogenes and the NANOG gene, resulting in a leucine-to-phenylalanine substitution in the NANOGP7 and NANOGP9 sequences. These thymines may be independent post-insertion mutations or they could be a source-gene mutation that reverted to its original sequence after the insertion of NANOGP7.
Pseudogene mutations can be used to estimate the dates of origin for individual pseudogenes. However, only post-insertion mutations not subject to purifying selection are reliable indicators of the age of a pseudogene. Our analysis shows that, in the case of the NANOG
pseudogene family, source-gene mutations are present and may contribute to a significant number of polymorphisms in the pseudogenes. Although some source-gene and post-insertion mutations may be readily distinguished based on their patterns when the pseudogenes are ordered, others may not be so easily discerned. Even when post-insertion mutations can be reliably identified, pseudogene evolution rates have not been well calibrated prior to the H/C divergence, as pointed out by Booth and Holland [4
]. For these reasons, we have avoided age estimations in this study, focusing instead on the relative order of NANOG