Over the last decade, many different approaches for identifying gene orthology between species have been proposed in the literature. The process of gene annotation, as well as the discrimination between protein coding and non coding genes [39
], will become even more important as the number of available genome sequences increases, in line with the rapid progress of the sequencing technology. Depending on the sensitivity and specificity of methods used to identify orthologous genes, the fraction of genes without orthologs between species is variable, and also depends on the quality of the genome assembly [40
]. Among these genes, there are the so called orphans [32
], which have no homologs among the genes of other species. Even though several explanations have been proposed for the absence of homologs, one of the possibilities is that they might represent species specific genes. In the literature, the search for orphans genes has been carried out in different species by comparing gene sets at protein level [34
]. In the work presented here we faced the problem of non-orthologous genes between species at nucleotide level. We focused on the bovine genome (version 4.0), whose assembly and annotation is still ongoing. Ensembl orthology predictions from release 50 were used, as these represented the highest quality genome annotations across several mammalian species. Ensembl automatically produces orthology predictions between species and for each release of the database these predictions can be easily queried using the "BioMart" tool. A simple query to obtain the number of non-orthologous genes between the bovine and the human genome returned 5,507 out of 22,836 genes. The reverse query returned 19,811 out of 36,396 human genes with no orthologs in the bovine genome. The differences are dependent on the level and quality of annotation for the two genomes, and on the larger set of annotated human genes. In order to reduce this effect in a simple two-way comparison, the bovine and human datasets were filtered with information coming from other completed genomes, specifically mouse and dog. A total of 3,801 bovine genes had no orthologs with these three species while 1,010 human genes, with orthologs in mouse and dog, had no orthologs in cow (Figure ). These two groups of genes were considered as the most consistent non-orthologous genes to use in further work. In the previous assembly of the cow genome (Btau 3.1, Ensembl release 49), a similar query gave almost double the number of bovine non-orthologous genes (6,247), while non-orthologous human genes were slightly fewer (865). This reflects major improvements in the bovine genome assembly and annotation between version 3.1 and 4.0, but suggests that there are still problems either with the assembly or the annotation of the bovine sequence.
The two sets of non-orthologous genes (cow vs
. human and human vs
. cow) were investigated in order to test the quality of orthology predictions, to reveal genuine differences between species and most commonly show problems with the genome assemblies. A bioinformatic pipeline and web tool were developed to describe the alignments of each library with the genome of the other species, and the alignments were classified into 5 different categories, according to the annotation associated with the sequence in each genome (Figure , Table ). These classes were established according to the different scenarios that might explain annotation problems, which were: potential orthologs, gene variants, new genes, intronic genes and not-aligned sequences. For this analysis only the protein coding genes were selected, which most likely represent functional genes, while pseudogenes and retrotransposed genes were removed as the non-coding RNAs, which were analysed separately [44
Although all the aligned sequences showed highly significant E-values, only results with more than 75% of overall identity were targeted for a detailed manual curation. A web based informatic tool was created and used that provides easy access to the alignments and available annotation for each gene.
Among the genes examined, 90% of the sequences had a significant match, even though for a small fraction the alignments were not reliable. These included very short sequences and genes which had short alignments or that aligned with two different genes within the same genomic region, and were considered sequence or alignment artefacts. These "problematic" sequences were distributed throughout the genome and did not suggest the presence of localised regions with problems with the genome assembly or annotation (data not shown).
The current level of annotation of the bovine genome is not comparable with that of human, however the alignment of the annotated bovine genes with the human genome produced some interesting results. In some cases there was evidence to suggest new, presently unannotated, features in the human genome, including additional exons, as observed in the "gene variant" class, or potential new human genes, from the "new gene" and "intronic" classes. The latter were supported by the presence of other evidence in the region of the alignments, such as the coincident alignment of EST and genscan predictions. Indeed, some of the features identified appeared in later releases of Ensembl database, where additional human genes have been annotated exactly where the pipeline used here had aligned a bovine gene. This observation supports the value of this type of comparative approach. The "potential ortholog" class helped to identify additional orthology relationships, however, it also identified deficiencies in the genome sequence and errors in the annotation of many bovine genes. Generally, the annotation suggested that cow genes were shorter than the human orthologous genes, which in many cases was because exons had been missed at gene boundaries. Alignment of EST and genscan predictions, in the corresponding positions of the bovine genome, suggested the presence of new bovine exons. In addition many genes were identified in the bovine genome that had not been annotated.
It would be expected that genes with orthologs in human, mouse and dog should have homology relationships in cow, even though they had not been identified by the automated orthology prediction. Thus, the alignment of the human genes to the bovine genome should find new features to improve genome annotation in cow. From the results in the "new gene" class, 46% could be considered as new bovine genes, indeed in latest Ensembl releases half of those identified using the approach described here were added in a new Ensembl feature called "EST based genes", which were in agreement with our alignments. The interpretation of the results for genes in the "potential ortholog", "gene variant" and "intronic" classes, becomes more complex as it is not completely clear if the observed alignments and differences are due to species-specific features, or problems with the bovine annotation or the genome assembly. From the genes belonging to "potential orthologs", 20% may be considered as true orthologs which were missed by the Ensembl prediction pipeline, for the most part due to minor differences between the sequences. Accepting the current annotation of the bovine genome, 80% of the results in the "gene variant" class were highlighted the presence of new exons for genes currently annotated in cow.
The "not aligned" class may contain real non-orthologs between the four species but also orphan genes with no match with other species. This class was analysed for both cow and human genomes, by searching similarities with the complete non-redundant protein database from NCBI. For most of the human sequences, a match was identified with bovine proteins whose annotation and description is exactly the same as in human. These results most probably represent gene sequences that are still not annotated or assembled into the bovine genome, and hence were completely missed by the Ensembl orthology prediction. Some of the cow genes for which there is no match with the human genome may be indeed novel
, bovine orphan genes, as only 11% in this class had a significant match with a human sequence and 37% had no match at all in the NCBI database. Among these genes there are novel sequences which also have supporting protein evidence; these are interesting candidates among which to look for cow specific coding regions. The functions of orphan genes are generally poorly characterized [43
], they show distinctive features such as high tissue specificity, rapid evolution and short peptide size [34
]. Recent works have demonstrated that they evolve three to four times faster than the average genes in Drosophila [43
] and in primates [34
]. In some cases the sequence divergence between species may be so great that the orthology between the genes is not obvious. This situation is represented by the "Stella fragment" related gene (DPPE3), which is annotated and has good supporting evidence. Indeed this gene has human and mouse counterparts but with the sequences highly divergent between the species.
The discrimination between orthologs and paralogs still remains difficult, especially when comparing incomplete and large genomes, as addressed by Fulton et al. [45
]. Genes predicted as paralogs by Ensembl are 49% and 60% of the bovine and human libraries, respectively. Paralogs, which mainly arise from a duplication event and may undergo structural rearrangements during evolution [1
], are found in the non-orthologous sets described herein. Their sequence divergence might explain why they were missed as orthologs between species and in some cases can be traced back with the similarity approach used in this work.
Ontology descriptions, even if not complete for the bovine gene set, due to the lower level of annotation, were interesting in describing the groups of genes created in this work. Many of the genes with no apparent orthologs were clustered as proteins with binding properties. The typical modular composition of such proteins and their specificity for different ligands could explain structural differences which might have an effect on the orthology prediction. Despite the annotation and similarity search performed to retrieve GO terms for the bovine non orthologous genes, no valid annotation was found for the 75% of the cow genes in the "not aligned" group. This highlights the need to focus on this particular group of genes that might reveal orphan as well as species specific coding sequences.