We started with the 12,687 presumably orthologous protein pairs retrieved (see Materials and Methods
) from the Inparanoid web site [17
] for which gene sequences could be found. We used BLAST to scan the human and mouse genomes for potential pseudogenic sequence pairs (see Materials and Methods
). A pseudogene pair corresponding to a protein pair was then used together with the gene sequences to form the sequence quartets on which we base our analysis.
This initial search with subsequent refinement resulted in 168,855 such quartets. For the vast majority of these quartets, one or both pseudogene sequences overlap regions of known
protein-coding genes. Gene position data from Ensembl were used to filter out known genes. Predicted genes are kept for further analysis, since it is known [8
] that gene predictors sometimes mistake pseudogenes for protein-coding genes.
The set that remains after filtering constitutes 11,146 sequence quartets originating from 1,349 protein pairs. The distribution of quartets per protein pair is highly nonuniform. While many gene pairs lack corresponding pseudogene pairs, a handful (EF12, G3PT, LDHB, TSY1, UB46, and several ribosomal genes) are origins of large pseudogene families in both species. Using the mutual-best-hit filtering outlined in , we, however, removed a large number of pairs likely to be insignificant; after this step 1,453 sequence quartets remained. We divide these into four classes according to the following: class 1—pairs that have detectable disablements and do not overlap any Ensembl gene prediction; class 2—pairs that have detectable disablements and overlap an Ensembl gene prediction; class 3—pairs that do not have detectable pseudogenic disablements and do not overlap any Ensembl gene prediction; and class 4—pairs that do not have detectable pseudogenic disablements and overlap an Ensembl gene prediction.
Visualization of the Effect of Mutual-Best-Hit Filtering
We used the partition of our data induced by this classification in combination with mutual-best-hit filtering. The number of sequence quartets belonging to the classes are: class 1—247 quartets; class 2—299; class 3—146; and class 4—761 (see Table 1).
Our aim is to find those quartets for which S
1 is the most likely explanation. We use a probabilistic methodology to compare scenarios (see Materials and Methods
), to obtain p
-values for any possible alternative hypothesis with respect to the interesting one, namely, that S
1 best describes a given quartet. For visualization purposes, we also consider quotients of type L1
, where Li
is the log-likelihood corresponding to scenario i;
for a particular quartet, a value L1
< 1 suggests that L1
is preferable to L2
For the majority of our 1,453 quartets, data support S2 (), the scenario where pseudogenes originated later than the species split. In 425 out of 1,453 cases (29%), the p-value for S1 being the scenario that best explains our data is less than 0.001.
Histogram of Likelihood Quotients when Comparing Scenarios S1 and S2
Interestingly, we note a bimodal pattern with one large hump distributed around 1.1 and another one distributed around 0.9. That is, in a large majority of cases, data show clear preference for either S1 or S2; it is only for a comparatively small number of cases that the quotients are close to 1.
We now use the same technique to compare S1 and S3. Remember that S3 is the scenario where the transitions from genes to pseudogenes were independent of each other and occurred subsequently to the speciation. Hence it is only the models of sequence evolution used for genes and pseudogenes, respectively, that distinguish S1 from S3. The likelihood values are in this case much less varied, yielding many quotients close to one ().
Histogram of Likelihood Quotients when Comparing Scenarios S1 and S3
For 73 of the 425 quartets, S1 is the explanation favored by our method and for 30 of these 73 the p-value is lower than 0.1. For 352 sequence pairs, S3 is the most likely topology, and 262 of those clearly favor S3 (p-value again lower than 0.1).
To summarize, we have 30 quartets for which the sequences suggest that: 1) the pseudogenes are evolutionarily conserved since before the human and mouse speciation; 2) they have been pseudogenes since prior to the speciation.
Because we find 30 such quartets, and the number of quartets expected to pass our scenario test is 1453 * 0.001 * 0.1 ≈ 0.15, it is reasonable to conclude that a significant number of these 30 quartets are ancient pseudogenes, i.e., satisfying 1) and 2).
We are now going to investigate these 30 sequence quartets further, with the aim of testing their potential biological function. The criteria that will be our focus are synteny, expression evidence, and conservation.
Synteny can be used as a means to evaluate our methodology's capacity to separate S1 and S3 from S2 quartets. It is also interesting to compare the fraction of syntenic quartets among S1, S3, and genes. The latter can be seen as a test of functionality.
It has long been known that eukaryotic genomes undergo rearrangements on both microscopic (intrachromosomal with a span < 1 Mb) and macroscopic (intrachromosomal with larger span, as well as interchromosomal) level during evolution [18
]. By using so-called sequence markers, often protein-coding segments, it has been possible to infer maps of syntenic regions, that is, regions of conserved marker order.
The orthologous pairs of protein-coding genes in our data set have the following synteny relations: 69% syntenic, 2% reversed syntenic, 11% corresponding chromosomes, 4% nonsyntenic, and 13% unknown synteny (see Materials and Methods
). We find 20 out of the 30 S
1 pairs in synteny and five are found close to synteny (Materials and Methods
It is reasonable that sequences that have originated from duplication events prior to the species split (sequences belonging to S
1 and S
3 quartets) are primarily found in syntenic positions, as we have seen is the case for genes. Conversely, there is no reason to presuppose this for quartets showing preference for S
2 (remember that the pseudogenes here are expected to have originated independently of each other). From inspection of , we see the following: out of our 30 S
1 sequence pairs, 20 are syntenic (67%). A similar amount, 149 out of 262 (57%), of S
3 sequences are found in syntenic regions. Only 130 out of the total 702 (19%) S
2 sequence pairs are syntenic. In fact, one could argue that the latter percentage is unexpectedly high. A possible explanation is that these are a result of tandem duplications; that is, the duplicated sequences are found nearby the original ones and are therefore in synteny. The tendency for class 4 to be found in syntenic regions could simply be due to the fact that these are detected by comparative gene finders, which often use synteny as a criterion [19
Number of Sequence Pairs in Each Class Favoring a Particular Scenario
It is notable that within classes 1, 2, and 3, ten of the 13 S1 pseudogene pairs are syntenic, while only 43 out of 100 S3 pairs are syntenic.
If we again consider the ATXN7L3 tree (), we see that among the pseudogenes only the human Chr12/mouse Chr10 pair, the pair retained after mutual-best-hit filtering, is found in syntenic positions.
We investigated whether our candidates for potential function are enriched for transcription or not, by searching publicly available databases for transcript sequences, expressed sequence tags (ESTs) and mRNAs. An EST or mRNA sequence is postulated to come from a specific pseudogene if its sequence is more similar to the pseudogene than it is to any other known gene or pseudogene (see Materials and Methods
for details). We found that, out of the 30 sequence pairs showing preference for S
1, 22 are transcribed in both human and mouse. For the 20 syntenic S
1 sequence pairs, 17 are transcribed in both species and all but one are transcribed in either human or mouse (see ). Notable are the ATX1 and ATXN7L3 duplicates, both class 1 members, for which we find ESTs from many different tissues (human thyroid, colon, and prostate among them), and also class 3 ZNF629 duplicates, each perfectly matched by approximately 1,000-bp-long mRNAs.
p-Values for Scenario Comparisons and Pseudogene Expression Evidence (Number of Matching EST and mRNA Sequences) for the 20 Syntenic S1 Quartets
Among the 20 syntenic S1 sequence pairs, the only completely unexpressed example is the IMB1 copy found on the X chromosome. This pair shows clear preference for S
1, and is also unusually well-preserved for a nonfunctional pseudogene. This might indicate that the IMB1 pseudogene was functional for a short period after the species split, but it could also simply be an effect of the X-chromosome's lower mutation rate [20
The majority of these pseudogenes are much less expressed than their respective genes (to the latter, one can generally map large numbers of ESTs originating from tissues throughout the body).
To perform an enrichment test we need a good comparison set. We believe that S3 contains many young pseudogenes, that is, those which recently underwent the transition from gene to pseudogene, but also protein-coding genes. It is reasonable to assume that young pseudogenes are more frequently transcribed than older ones. For these reasons, S3 is not a good comparison set, and no other such set is available either.
Instead, we focus on the correlation between the S3 pairs' positioning of the gene-to-pseudogene transitions (see ) and their pseudogene expression. For this we adopted the following labeling scheme with notation from :
S3 Topology with Gene-to-Pseudogene Breakpoints
If we assume that the rate of pseudogene creation does not vary over time, then the low number of detected early pairs—only nine out of 198 non-unclear S3 examples conform to this group—is a sign that most pairs of the same age as the early pairs have diverged beyond recognition.
shows the number of examples in each group together with an evaluation of their tendency to be expressed. We note that whereas in the genelike and late groups a large majority (94% and 80%, respectively) is expressed in both organisms, the figures are much lower for the early (22%) and unclear (44%) ones. We could also compare these figures with the corresponding figures for S1 examples where 22 out of 30 examples (73%) are expressed in both organisms. As can be seen in , this tendency is even more pronounced if we count only syntenic examples. It is reasonable to conclude that the majority of the early pairs are nonfunctional, because they are expressed to such a low extent. Considering the higher age and the extent of expression for the 20 S1 pseudogenes, it is also reasonable to conclude that this set contains pseudogenes of potential biological function.
Human and Mouse Expression for 262 S3 Quartets Selected as Described in
According to estimates in [21
], the neutral rate of substitution has been roughly 0.5 substitutions per site since the divergence of the human and mouse lineages. This estimate conforms to 67% sequence identity for orthologous regions under no selective pressure. At the other extreme, protein-coding sequences have, on average, approximately 85% conservation [21
We will now address where our putatively functional pseudogenes are placed along that scale. We note ( and ) that although the conservation for the 20 syntenic S
1 pseudogene pairs is not as strong as for the corresponding genes, it is in most cases significantly above the 67% limit (p
-values are computed using Hoeffding's bound, see Materials and Methods
). For instance, the ATX1 derivative shows conservation at least as high as a typical gene, even slightly higher than its paralogous gene (). The ATXN7L3 duplicate, previously discussed, has conservation similar to that of a protein-coding gene. It is 77% conserved, counted over the total alignment, but a 288-bp-long section in the beginning is 89% conserved ().
Conservation between Human and Mouse Gene and Pseudogene Sequences for the 20 Syntenic S1 Sequences
Conservation Percentage in and around the Pseudogene
Recognizing Pseudogenes by Inspecting Their Alignment
Substitution rates vary along and between chromosomes. To make sure that it is the pseudogenes only, and not their genomic vicinities in general, that are conserved, we also aligned a 1,000-bp section upstream and downstream of each pseudogene. We observe in most cases () that the conservation for the surroundings is about the expected 67% and much lower than for the actual pseudogene. The flanking sequences have in most cases about the 67% conservation that we expect. The unexpectedly high value registered downstream of, among others, the ATX1 relative, might be due to conservation of the 3′ UTR.
Conceivably, a potential pseudogene could in its close vicinity have protein-coding exons originating from the same gene. To exclude this possibility, we also checked the proximity for signs of exons originating from the same gene, with potentially intact protein-coding ability. No additional such protein-coding exons were found. For the absolute majority of our pseudogenes, no hit could be found on the same chromosome, and in no case was any hit found closer than 10,000 bp.
We also applied our methodology to the human–chimpanzee pair of genomes. This choice was motivated by our desire to discover young pseudogenes. Remember that the mouse Makorin pseudogene, although vital, has only been functional over a relatively short evolutionary period [7
The procedure was the same as for human–mouse. The chimpanzee data was downloaded from Ensembl, including assembly 1 as of April 2005 together with protein sequences and gene-sequence data. For human–chimpanzee, sequence conservation is less effective as a means to separate functional from nonfunctional pseudogenes. The reason is of course that many pseudogenes originating before the comparatively recent primate species split can be expected to be nonfunctional, although they have not diverged sufficiently to be easily recognized as such. So, while in the human–mouse case we can be relatively confident that syntenic pseudogenes that prefer S1 are functional, it is likely that many S1 pseudogenes found in a human–chimpanzee comparison are nonfunctional. In fact, conservation estimates can, even together with expression evidence, be expected to be insufficient for revealing whether an individual pseudogene is functional or not. What we can hope for is a signal in the data showing that the quartets preferring S1 include functional pseudogenes.
As expected, the human–chimpanzee comparison resulted in a large set of pseudogene pairs. We therefore restricted our analysis to the most interesting class, i.e., class 1. We found 742 class 1 pseudogenes belonging to quartets favoring S1 (using p-value 0.001 for comparing S1 and S2 and 0.1 for comparing S1 and S3). The aforementioned class 1 pseudogenes found in the human–mouse comparison all belong to this set. We are fairly confident that these 742 sequences have indeed evolved in a manner atypical for a protein-coding gene. A key question is whether these are functional or not. Note that more than 1/5 of them show transcriptional evidence ().
Percentage of Expressed Pseudogenes in Relation to Their Conservation p-Values (Calculated with Hoeffding's Bound)
Many pseudogenes have regulatory regions showing high similarity to those of the corresponding protein-coding genes. This is either because few mutations have occurred in these regions, or alternatively because many of the mutations that have occurred have been selected against, due to functionality of the pseudogenes.
To further purify our result set, i.e., the 742 pseudogenes favoring S1, we again looked at significant deviations of conservation. This approach requires a reliable estimate of the background conservation percentage and we used 5,000-bp-long alignments flanking each pseudogene to compute such an estimate.
Typical values for the background mismatch percentage range from 1.2% to 3% (counting the first but not subsequent indels in a gap), which conforms well with previous estimates of 1.4% [22
] and 1.6% [23
]. Note that the percentage of pseudogenes with EST and/or mRNA expression evidence is higher for more conserved pseudogenes. contains a list of those pseudogenes that we have found to be expressed as well as conserved since the human–chimpanzee speciation. Notable is that the pseudogenes originating from proteins ATXN7L3, PDZRN3, or IMB1 are not found in . ATXN7L3 and PDZRN3 are not sufficiently conserved (in the former case we remember from the human–mouse analysis that it is only sections of the pseudogenes that exhibit exceptional conservation). The IMB1 pair residing on chromosome X is indeed conserved enough (it has a conservation p
-value of 4.3*10−4
) but lacks, as was noted in the human–mouse section, expression evidence.
Human–Chimpanzee Conserved and Expressed Pseudogene Pairs