Using all annotated genes from human, chimpanzee, mouse, rat and dog (Ensembl v42) [30
], we extracted 412 genes annotated as protein-coding in all species but dog. These genes exhibit a '1:1:1:1:0' phyletic pattern, that is indicative of the presence/absence of genes with a one-to-one orthologous relationship among the five species. We refer to these as 'missing genes' for purposes of this study. We examined the structural features of the 412 missing genes in the four mammalian reference sequences and compared them to an independent and randomly selected set of 400 genes. The mean length of the protein products of the missing genes set was 722 amino acids (AA), which is significantly smaller than the random set at 905 AA (t
= 6.8e - 11). Similarly, the mean transcript size was ~50% smaller than observed in a random set (t
= 2.6e - 9). The mean number of exons in missing genes was also smaller (5.8 vs 9.8; t
= 3.7e - 13) than the random set and particularly single-exon genes were found to be over represented by 15%. To ensure that single-exon missing genes were functional and not processed pseudogenes, we analyzed each, using the human dataset, for accumulated degenerative mutations (frameshifts and premature stop codons) in their coding sequence and found none. In addition, we identified sequence alignment between single-exon genes and ESTs (sequence similarity > 96% for at least 150 bp) for 95% of them.
To test the underlying assumption that missing genes may be implicated in particular biological pathways, we examined their functional annotation in the context of Gene Ontology (GO) using the program GO Tree Machine [31
]. Using the human sequence as a reference, the results demonstrate that the missing gene set is enriched for genes implicated in physiological pathways of immunity and organism responses to pathogens (12 genes), olfaction (16) and regulation of transcription (63). This classification comprises functional pathways that play an important role in the adaptation of organisms to their environment. Interestingly, these biological functions are often linked to large proteins families that are attractive targets for lineage-specific functions and lineage-specific loss and gain of genes [32
Constructing synteny maps with 1:1 orthologs
We extracted pairwise sets of 14,997; 14,798; 14,667 and 14,065 one-to-one (1:1) orthologous protein-coding genes (Ensembl v42) between human and dog (H-D), chimpanzee-dog (C-D), mouse-dog (M-D) and rat-dog (R-D), respectively. Using those 1:1 orthologs as comparative anchors, we built four fine-scale whole-genome pairwise synteny maps (Additional data file 1
) with the program AutoGRAPH, which we recently developed [13
]. We identified 218, 229, 326 and 325 CSOs, i.e. chromosomal segments for which markers are in the same linear order on the chromosome as noted across species [25
], between H-D, C-D, M-D and R-D respectively. The mean distance between two consecutive genes was ~180 kb. In all synteny maps, CSOs cover almost the entire genome while breakpoint regions, areas delimitating CSOs, cover only ~5% of a genome and may contain single-gene segment or very short synteny blocks [33
] (Additional data file 2
In each pairwise synteny map, we localized the missing gene orthologs on the reference sequence (Figure ). Of the 412 missing genes, the vast majority (mean of 92.3%; range 92 to 94%) mapped within CSOs with only 7.7% mapping within breakpoints. In all reference species the missing genes spanned all chromosomes, although their distribution varied greatly, i.e. one to 44 per human (HSA) chromosome in the case of the human-dog synteny map.
Figure 1 Consensus Ortholog IntervaL identification. The figure illustrates the 4-step method to infer targeted interval for gene prediction. (1) is the first step that build the pairwise synteny map (here a schematic Human-dog syntenic map) using 1:1 orthologs (more ...)
Targeting genomic intervals
We used multiple pairwise synteny maps described above to identify short, targeted, orthologous genomic intervals. On each reference genome, these intervals are delimited by the closest flanking 1:1 orthologs on either side of each missing gene that in turn define orthologous intervals on the canine genome as shown in Figure . The use of multiple pairwise maps enabled us to identify the shortest consensus interval on the canine genome to search for genes, that we refer to as Consensus Ortholog IntervaLs (COILs) (Figure ). From the 412 missing genes, we delimited 383 COILs (92.9%) having a mean size of 347 kb (Additional data file 3
). For a set of 17 COILs (4.1%) localized in common breakpoint regions (i.e. overlapping between at least two species) [24
] and for 12 missing genes, no COIL could be determined because of the absence of a consensus interval.
Targeted gene prediction
Within each canine COIL, we used the GeneWise program [6
] to splice and align the protein sequence of each reference species in order to most accurately predict the structure of the dog gene. We retained gene predictions produced by at least two reference species protein templates. This produced 231 gene structure predictions with amino acid identity > 40% (Figure ). Fifty-three genes were predicted using only rodent protein sequence as templates, thus illustrating the complementary contribution of multispecies analysis. We post-processed GeneWise results to detect potential gene features and found the presence of a coding start site for 53.1% of the gene predictions. In addition, amongst the 231 predicted genes, 75% of the predictions with multi-exonic structure exhibit at least a canonical splice site (GT/AG).
Figure 2 Flowchart of the computational analysis. The left pipeline indicates all steps in the computational analysis of gene predictions and the right pipeline shows a detailed account of the process of undetected genes and pseudogenes. Gray boxes summarize the (more ...)
To address the question whether COIL delimitation is too restrictive for gene prediction, we aligned the human transcript sequences corresponding to the 383 missing genes for which we defined a COIL, against the assembly of the canine genome sequence (CanFam 2.0) with the Exonerate program [35
]. We repeated the analysis with chimpanzee, mouse and rat transcript sequences. We considered the best five matching sequences to relax the limitations of conventional best-match methods [29
]. Then, we defined a concordance between the COIL approach and the whole-genome sequence analysis, when matching sequences from the Exonerate-based analysis for at least two species were totally embedded in COILs. Based on this criterion, concordance was obtained for 342 (89.2%) genes. Of the 41 instances with no agreement between the expected syntenic location and the whole-genome sequence analysis, 36 showed weak match (identity < 20%) within the canine genome assembly suggesting unspecific alignment while five showed a significant match, from at least two species suggesting that these genes may have acquired a new location in the dog. Of the latter five instances, we identified only one gene prediction (PLA2G4C
) with conservative criteria indicating a relocated gene in a non-syntenic genomic area.
In this study, we applied Genewise program with a sequence similarity-based method that explicitly models the conservation of gene structure and a high degree of conservation. As such model is known to show a marked decrease in performance for less similar genes [36
], we further investigate the undetected subset of genes using a probabilistic pair hidden Markov model (HMM) that show a weaker dependence on percent identity and performs better to pick out distant homologs. The Genewise HMM based analysis allowed to predict 36 additional genes (Figure ). Both prediction sets were merged into a single set (n = 268) for further analysis.
Sequence alignments were next generated between gene predictions and canine transcript sequences (Unigene april 08 [37
]). We identified significant alignment (sequence similarity > 96% for at least 150 bp) in 53% of cases with an average of 7.5 ESTs/mRNA per gene prediction (range 1–99). Using Interproscan, [38
] protein motifs were found from InterPro database for 80.5% of the gene predictions, providing additional evidence for dog gene identification.
As a further validation step, the construction of canine predicted protein three-dimensional models was investigated based on the homologous structure of the human ortholog or paralog (>40% identity), which was used as a template. For the subset of genes for which the 3D structure is solved (n = 21), canine-human comparative modelling was determined using the SWISS-MODEL server [39
]. In 16 instances of canine-human comparative modelling, the mean identity obtained between sequences was 70%. Homology-based 3D model for each canine prediction was validated using the Verify 3D graphs [40
] (data not shown) that distinguish between homology models of higher and lower accuracy.
To test for possible overlap between gene predictions obtained in this study and all canine genes annotated in Ensembl (v42), we performed sequence alignment between these two sets of predictions. A total of 232 (88%) predicted genes did not overlap any Ensembl annotated protein-coding genes. Therefore, these were classified as "definite" gene identifications together with the delineation of new orthologous relationships with the four reference species (Additional data file 4
). The remaining 36 gene predictions overlapped an annotated gene (protein identity > 80%) indicating that these gene predictions correspond to sequences already defined as genes, but with undetected or spurious orthologous relationships (Figure ). [41
Gene prediction assessment from dN/dS analysis
To assess the validity of gene predictions through the strength and direction of selective constraints, we used a functionality test that uses the ratio of replacement to silent nucleotide substitution rates (dN
). The ratio dN
, where dN
is the number of non-synonymous nucleotide substitution per non-synonymous site and dS
the number of synonymous nucleotide substitution per synonymous site, is used as a proxy for the evolutionary constraints that occur on nucleotide substitution [42
]. The calculation of the dN
ratio requires the comparison to a homologous reference sequence. First, we constructed a benchmark set of true orthologous genes using all 1:1 orthologous genes between human and dog (n = 14,994) to obtain a representative dN
value. From this benchmark set, we calculated the median dN
ratio of 0.15 using all dN
values extracted from the pairwise alignments of transcripts (Figure ). To assess the 232 gene predictions identified in this study with the functionality test, we determined dN
ratio for each of the gene predictions in comparison to their human functional orthologous gene from pairwise transcripts alignments. We calculated a median dN
of 0.19, a value highly similar to the benchmark set (0.15). To further assess the dN
values were analyzed through their distributions (as log dN
) between benchmark and predicted genes sets (Figure ) and we did not detect statistically significant differences (Mann-Whitney test; P
= 0.16). Therefore dN
similar distributions are indicative of similar high selective constraints and little or no positive selection on both benchmark and predicted genes sets, suggesting the functional properties of the canine gene predictions products involved are conserved.
Figure 3 DN/dS cumulative frequency distribution of references, gene predictions and pseudogene predictions sets. Benchmark, predicted genes, pseudogenes (with one mutation) and pseudogenes (with accumulated mutations) sets exhibit a median dN/dS of 0.15, 0.18, (more ...)
Figure 4 DN/dS distributions of benchmark and test sets. A. The dN/dS distribution (as log dN/dS) of the test set (new predicted genes) is represented in purple and benchmark set (human-dog 1:1 orthologous) is represented in blue. Test set exhibits a dN/dS distribution (more ...)
To analyze the evolutionary rate of the new canine predicted gene sequences in a phylogenetic context we used the 232 mouse genes in addition to human genes and dog predicted genes to assess the levels of selective constraint of each lineage in comparison to the rest of the tree. In this way, differences or similarity in selective constraints can be predicted on all lineages within the phylogeny. For each of the 232 genes, we inferred the dN
values and calculated the dN
ratio. The median dN
for the dog lineage was found between human and mouse (Table ), a result in agreement to these determined for 13,816 human, mouse and dog genes with 1:1:1 orthologs [2
] with similar differences found across the three lineages.
Median and mean dS and dN/dS values of pseudogenes, predicted genes and reference set of human-canine orthologues
Off the 412 missing genes, a subset of 55 predictions containing ORF-disrupting mutations lead to pseudogene identification. Among pseudogenes, we determined if protein sequences have different numbers of in-frame stop codons and/or frameshift disruptions. Using such quantitative measures, two mutation levels were apparent. A set of inactivated genes (n = 21) was predicted with accumulated mutations (mean = 4.2; range 2–11) and a second set (n = 34) was predicted with one mutation (Figure ). To normalize the mutation rate by taking into account the coding sequence length, we expect proteins of similar lengths to now have similar numbers of stop-codons or a frameshift. We therefore examined the ratio of accumulation of ORF-disrupting mutations per 100 AA in both groups of pseudogenes. A mutation rate of 0.28 was determined for the group of pseudogenes with one mutation and a significant higher rate of 1.21 (Mann-Whitney test; P = 8.052e - 7) was found for the set of pseudogenes with accumulated mutations.
Although transcribed pseudogenes have been experimentally identified [43
], a significant part of pseudogenes are thought to be transcriptionally silent in comparison to protein-coding genes. We thus searched for sequence alignment with canine transcript sequences (Unigene april 08 [37
]) to assess the transcription activity of the pseudogene predictions with two and more mutations. We obtained alignment for 14%, a result in agreement with a recent report [44
] showing that 19% of pseudogenes are the sources of novel RNA transcripts. These data indicate that the predicted pseudogenes are mostly undetected as expressed sequences in comparison to gene predictions with intact ORF (53%) and therefore significantly correspond to untranscribed pseudogenes [44
Detecting nonfunctionality from dN/dS analysis
To assess independently of the presence of stop codons or frame-shifts, the validity of pseudogene predictions, we used the functionality test that uses the dN
ratio. Assuming a constant mutation rate, the dN
ratio between dog pseudogenes, for which a loss of function occurred, and their human functional orthologs should theoretically relax towards 0.57 (as the average of 1.0 in the absence of selection and 0.15 for negative selection as we calculated from the benchmark set) [10
]. Thus, we calculated dN
ratio for each of the candidate pseudogene predictions in comparison to their human functional orthologous gene from pairwise transcripts pair alignments. For the pseudogene set with accumulated mutations, we calculated a median dN
of 0.50 indicating a considerable relaxation of selective constraints of the canine pseudogenes in comparison to their human functional orthologous (Figure and Table ). Furthermore, the dN
distributions obtained were shifted upwards in comparison to the benchmark set (Figure ), which is significant to a Mann-Whitney test (P
= 5.17e - 6), indicating relaxation of evolutionary constraints on the predicted pseudogenes. For the pseudogene set with one mutation, the median dN
of 0.18 was observed, suggesting no detectable differences in selective constraints between predicted pseudogenes from the canine sequence and their human functional counterparts. In addition, we analyzed whether the dN
ratio has an independent value before and after the stop codon among the predicted pseudogenes. In 26/28 instances, no significant differences were detected when comparing dN
ratio for the two parts of each gene. In two cases, the dN
value before the stop was indicative of strong selective constraints (<0.1), in comparison to the value detected after the stop (>0.9), which suggest that the biological function may have been preserved.
We next searched to determine if the canine predicted pseudogenes showed any deviations from the expected rate of evolution using a phylogenetic context that includes human and mouse gene sequences. Such variation in rate may reflect relaxation of constraints in the dog lineage. The deviation between dog predicted pseudogenes with multiple mutations and the human and mouse lineages differs clearly (dN/dS = 0.41 for dog, 0.19 for mouse and 0.26 for human; Kruskal-Wallis test: P = 1.04e - 2) while no significant deviation (P = 0.36) was observed for the set of pseudogenes with one mutation (Table ). We therefore retained the 21 pseudogene predictions with both the higher dN/dS value as characterized by pairwise and phylogenetic approaches and high mutation rate as gene loss candidates.
Evolutionary constraints (dS and dN/dS) for 1:1:1 orthologs among human, mouse and dog
Gene loss identification
In addition to pseudogene identification, 11 gene predictions could not be detected with sufficient protein identity (average = 21.7%), both in the targeted genomic region (COIL) and in the whole canine sequence. For these predictions with no readily identifiable counterparts in dog, we searched for sequence alignment with canine expressed sequences (Unigene april 08) to address the underlying assumption that genes are not transcribed when placed in the context of highly degraded sequence. We identified sequence alignment in only three cases. These results showed that the gene predictions with poor sequence similarity were largely undetected as expressed sequences in comparison to gene predictions with intact ORF.
For the last subset of 49 canine genes that remained undetected in this study, we address the possibility that gene predictions could have been prevented because of a gap in the canine sequence assembly. We searched for gap content in the COILs that lack canine orthologous genes. For 12 COILs, the gap content was found to account for >10% of the total size of the COIL, seven-fold more than a random expectation set (n = 1000, gap = 1.32%) and manual inspection of sequence content resulted in identifying multiple sequence gaps. The 12 missing genes in those short targeted regions were therefore not retained in further analysis. Based on these results, a total of 37 undetected genes was considered and merged with the 11 gene predictions that could not be detected with sufficient protein identity and the 21 pseudogenes into a single set (n = 69) of gene loss candidates for further analyses (Figure and Additional data file 5
Evolutionary scenarios of the canine gene losses
Do we detect losses of genes that occur specifically in the dog or do such losses occur in other mammalian lineages as well? If so, do such losses correspond to the time the dog branch diverged from the Euarchontoglires (rodent/primate) lineage? One way to analyze these possibilities is to determine their phyletic pattern using ten species from chicken to human and to define the amount of time between gene origin and present. The timing of genes origin was defined by searching for 1:1 orthologs between human and nine species. In addition to human, chimp, mouse and rat genome sequence assemblies, we used scaffold assemblies of elephant, tenrec and armadillo from the Afrotheria and Xenarthra superorder and two non-placental genome assemblies of opossum and platypus. We also included the chicken sequence to infer gene origins that occurred as long as 310 million years ago (MYA) (Figure ).
Figure 5 Gene origin timing. Timing of gene origin is assessed by determining the one-to-one orthologs between human and nine species listed on the left side of the figure. The species belong to Euarchontoglire (Primates and rodents), Xenarthra (Armadillo), Afrotheria (more ...)
Orthologous genes were detected between human and all species (except dog) for 11 genes. Therefore, they have an origin that occurred before the separation of the mammals and birds lineages and have been functional for 310 million years (My). In addition, 17 genes were identified in all species of the opossum/platypus, elephant-tenrec-armadillo and Euarchontoglires branches, a period of 170 My, 17 in all species of the elephant-tenrec-armadillo and Euarchontoglires branches (100 My), and 10 in Euarchontoglires only (87 My) (Figure ) [45
Overall, 28 canine gene losses could be characterized as being functional in other species for more than 170 My and 10 genes were not detected before 87 My and therefore specifically arose in rodent and primate lineages. For these genes, postulating that they arose through duplication events of a parental gene, we searched for paralogs among all human genes. For seven genes (ZNF426, WFDC12, ZIK1, HLA-SX-alpha, PNMA5, PNMA3, ZNF251) we identified at least one paralog (sequence identity >30%) in the close vicinity of the parental gene (mean of 71 kb; range: 16–128 kb).
We further used the Ensembl reconciliation tree method [46
] for checking possible duplication events specific of the primates and rodents lineages. Indeed, assuming that all homologous genes are known, the reconciliation of the gene tree with the species tree allows to distinguish duplication from speciation events and therefore orthologous from paralogous genes. Five genes (ZNF426, ZIK1, HLA-SX-alpha, PNMA5, PNMA3
) have in-paralogs in the reference species suggesting a pattern of duplication event (Additional data file 6
These results suggest that tandem duplication events have occurred and lead to specific in-paralogs in the branch leading to human species. Another contribution of this analysis is that it permits identification of 10 losses that occur in several lineages indicating multiple and independent gene loss events [47
Functional characteristics of gene losses
For the 28 canine-specific gene losses that have been functional for more than 170 My, we determined the functional annotation of the human genes using WebGestalt, a Web-based gene set analysis toolkit [48
]. The classification using the GOTree sub-module includes seven genes that belong to the biological process of response to stimulus with PROZ
, a vitamin K-dependent protein Z precursor involved in blood coagulation pathway and SERPINA10
a protein Z-dependant protease inhibitor that regulates factor Xa involved in blood coagulation. Moreover, it includes five genes involved in response to stimulus pathways that play a role in sensory function such as UGT2A
which encodes an enzyme with transferase activity that may catalyze inactivation and facilitate elimination of odorants, OR1Q1
which arethree olfactory receptors, and Noggin, a secreted polypeptide encoded by the NOG
gene that appears to have pleiotropic effect, both early in development as well as in sensory perception of sound. Other genes of interest belong to families with at least six members such as TBX22
a transcription factor involved in the regulation of various aspects of embryonic development, in particular cell type specification and regulation of morphogenetic movements [49
], and MS4A3
which is a subset of the superfamily of tetraspan transmembrane protein encoding genes. Several genes were classified with function involved in DNA repair, apoptosis and tumor formation such as BOK
which encodes a Bcl-2 related protein and PDE1B
which may play a role in apoptosis. To address the question of which tissue might be significantly affected by gene loss, we determined a gene-expression profile characterization per tissue based on the occurrence frequency of the ESTs profiles of human genes corresponding to the gene lost set using the tissue expression profile sub-module of WebGestalt. Testis-expressed gene expression profiles showed a significant over or under representation and, to a lesser extent, expression profiles related to placenta and kidney tissues did as well (Additional data file 7