Assigning function to protein coding genes is one of the most important tasks in the post-genome era. With the wealth of genomes available, automatic methods for identifying evolutionary relationships between genes becomes important when transferring functions from already annotated genes to unannotated. Consequently, it is of the outermost importance that the evolutionary relationships inferred between genes reflects their true evolutionary history. The term "homology" is simply not sufficiently well-defined when describing the evolutionary relationship between genes, and therefore previous publications have established more precise definitions. Orthologs are genes that derive from a single gene in the last common ancestor and have been separated by a speciation event [1
]. They can typically be considered as functional counterparts in different species. Paralogs, on the other hand, are genes that derive from a single gene that has been duplicated within a genome. When a gene has been duplicated, one of the copies could potentially be more free to adapt to new functions, whereas the other retains the original function. Paralogs can be further separated into two different subgroups, namely inparalogs and outparalogs, depending on when during evolution the duplication occurred [2
]. If the duplication occurred after the speciation event, the genes are considered to be inparalogs, meaning that they are co-orthologs to one or several genes in another species. Analysis of inparalogs can be used to detect lineage-specific adaptations. However, if the duplication event happened prior to the speciation event, the sequences are outparalogs and as such do not form any co-ortholog relationship with genes in another genome. Hence, outparalogs cannot be used to transfer functional assignments between species.
Several strategies have been employed for identifying orthologs, e.g
. bidirectional best-hits (BBH) [3
], InParanoid [4
], OrthoMCL [5
], KOG [6
], Ensembl Compara [7
], Homologene [8
], EggNOG [9
], and OMA [10
]. These include both pairwise matching-based methods and tree-based methods, and they may also differ regarding whether they can assign orthology across two or several species. The performance of these strategies have been previously compared [11
]. Although these comparative studies do not fully agree, it was found that InParanoid [4
] is one of the most accurate pairwise ortholog assignment algorithms. Particularly when analyzing evolutionary relationships among eukaryotic genes it becomes very important to distinguish inparalogs from outparalogs, which methods based on simple two-way best matching fail to accomplish. Therefore, the InParanoid algorithm was designed to separate inparalogs, that are to be included in the cluster, from outparalogs, that are to be excluded, and also supplies a confidence score for the inparalogs in the cluster (figure ). Moreover, the ortholog assignments are fully automatic and the algorithm is fast, thus enabling re-analysis of data upon new releases of genomes.
Figure 1 Graphical representation of an InParanoid ortholog cluster with the outparalogs outside the cluster indicated. The seed orthologs from the different species are denoted A1 and B1 and they are the bi-directional best Blast hits. Their similarity score (more ...)
Ever since the discovery of introns their evolution has been studied. It has been shown that introns often maintain their positions over very long evolutionary timescales [14
]. At these longer evolutionary distances, the sequence or length of the introns is never conserved. However, for very closely related species there might be selective pressure to maintain some intronic sequences due to presence of regulatory elements in the introns. Consequently, the conserved intron positions found between even distant species might be used to separate orthologs from other homologs. Indeed, this been done successfully in specific case studies of gene families, i.e
. chemoreceptors [16
], heat shock proteins [18
], and homeobox genes [19
]. Also, an algorithm called Exalign has been published where exon-intron gene structures are used to resolve phylogenetic relationships [20
]. This method relies solely on exon lengths and phase, when available, to infer gene structural alignments. A drawback is that genes need to have at least four to five internal exons to produce high scoring alignments with a significant E-value, which limits its applicability.
Intron insertion is not a random process; they preferentially insert into or are fixed at so-called protosplice sites [21
]. A study claimed that the majority of introns shared between distant species were the result of parallel gain into these sites [25
]. These findings were later disputed and it was shown that protosplice sites are no more conserved during eukaryotic evolution than random sites [26
]. In addition, simulation of intron insertion into protosplice sites with the observed protosplice sites frequencies and intron densities showed that parallel gain could account for only 5-10% of shared intron positions in distantly related species. Subsequently, this has been verified in other studies, where on average ~8% of shared intron positions in distantly related species were found to be due to parallel gain [27
]. However, across the eukaryotic lineages, the distribution of parallel gain was highly heterogeneous with evolutionarily closer species showing virtually no shared introns due to parallel gain, whereas evolutionarily more distant species, such as human and plants, exhibited up to 20% parallel gain. A complicating factor when analyzing intron position conservation (IPC), is that different lineages exhibit very divergent rates and patterns of intron loss or gain [15
]. It seems that intron loss is generally more prevalent than gain among orthologous genes [30
], although there are studies showing that the opposite can sometimes be true [33
The question still remains whether shared intron positions in different genes could be used on a global scale to aid the elucidation of evolutionary relationships, even between distant eukaryotic species. Therefore, in this study, we have analyzed the full genomes of seven eukaryotic species - human versus six other eukaryotes - to reveal if IPC can be used to distinguish orthologs from proteins that merely share amino acid similarity. More specifically, we examine if ortholog-ortholog (o-o) pairs have a higher IPC score compared to ortholog-closest non-ortholog (o-cno) pairs. In analogy, we also investigate whether inparalog-inparalog (i-i) pairs have a higher IPC score compared to inparalog-closest non-inparalog (i-cni) pairs. If this is the case, IPC could be used as a discriminatory variable when elucidating evolutionary relationships. Since sequences that are evolutionarily conserved tend to have a higher sequence identity compared to non-related sequences, we also examined the possible dependence between IPC and sequence identity. Finally, if IPC can be a predictor of orthology, it must agree at least to some extent with existing reliable orthology detection methods. Therefore, we analyzed the agreement between the InParanoid orthology score and the IPC score.