As in the previous release, we generated an orthology-based phylogenetic tree by UPGMA clustering of pairwise species distances derived from shared ortholog content. The distances were calculated as 1 minus the fraction of orthologous proteins, averaged over both directions (34
). This ‘orthophylogram’ is now too large to be shown as a figure but can be accessed online at http://InParanoid.sbc.su.se/download/current/orthophylogram.gif
The difference between this tree and sequence alignment-based trees is that it reflects the entire proteome's; content and the level of sequence similarity is not explicitly taken into account. Because of this, but also because of incompleteness in the proteomes themselves, it may differ from classical phylogenetic trees. For most species, it corresponds to the accepted phylogeny, but a number of noteworthy differences were observed. For instance, the guinea pig (Cavia porcellus), which is a new species in release 7, clusters with dog rather than with other rodents. The egg-laying venomous mammal platypus (Ornithorhynchus anatinus) is strangely placed at the root of all other vertebrates outside of birds, frog and fish.
Intriguingly, the macaque monkey (Macaca mulatta
) is placed far outside of the other primates, even outside cow and horse. This was not the case in release 6 and appears to be an artifact of the proteome sequence. As seen in , drastic changes have been made to the proteomes of human and chimpanzee between release 6 and 7 (>25% of the sequences have been modified), but macaque is essentially unchanged. Comparing the average identity of the best BLAST HSP between H. sapiens, P. troglodytes
, M. mulatta
, Bos taurus
and Canis familiaris
in both the previous and current versions showed no major changes (see Supplementary Table S2
Consistency for proteomes found in both InParanoid 6 and 7
However, looking at one-way fractions of shared orthologs reveals the problem. The distance ‘to H. sapiens’ was higher for M. mulatta than for all other species in the group. Also, the distance to chimpanzee and to orangutan was highest or second highest for M. mulatta. This indicates that macaque contains a large number of proteins that did not find orthologs in closely related species. It is possible that these are fragments or short splice variants, preventing them from being detected as orthologs. Even if the same splice variant exists in human, it would not be used by InParanoid if a longer variant exists, and the orthology may be lost due to small overlap. It thus seems that the macaque gene annotations should be updated to be more in line with other primates.
One of the orthophylogram anomalies found with InParanoid 6 was that Danio rerio was not grouped with other fishes. This is, however, the case in release 7, although as an outlier of the other fishes, not far from its placement in the previous release. Opossum, which was grouped within placental mammals, is still found in this clade, although in a different place. The orthophylogram is thus a useful tool for identifying inconsistencies in the proteome data and will hopefully spur genome annotators to improve gene predictions.
The average number of inparalogs per cluster ranged from 1.00 (between Cryptosporidium hominis
) to 5.31 (Trichomonas vaginalis
when compared with Giardia lamblia
, both protozoans). This is in concordance with the early divergence of T. vaginalis
and G. lamblia
) as well as with C. hominis
and C. parvum
being closely related (36
). The overall mean number of inparalogs per species was 1.46, and the median was 1.27. The distribution of cluster sizes is shown in .
Figure 3. Histogram of the average number of inparalogs/cluster per species for all species–species comparisons in InParanoid 7. Vertebrates and fungi generally have a lower number of inparalogs per clusters—always <3, whereas invertebrates, (more ...)
The input sequences used by InParanoid often changes with new releases. This can be due to a change in our sources for the data and/or changes in the genome annotations themselves. As this could result in different orthology assignment between versions, we examined whether each proteome differed with its corresponding proteome used in the previous version. For each species found in both versions, we compared sequences using checksums and identifiers. We computed a checksum for each sequence and counted the fraction of matching checksums between versions. Similarly, we counted the number of identifiers common to both versions. A large change in the number of proteins between versions (due to extensive genome reannotation, for example) could prevent a large fraction of sequences in one version from being matched in the other. We therefore calculated the fractions by dividing the matches with the number of sequences which is lowest between the two versions.
Most proteomes showed a large fraction of shared identical sequences while a minority was drastically changed (). The source for some species was changed between releases 6 and 7 of InParanoid, while in other cases all identifiers were changed by the source. A comparison of identifiers was therefore not possible in most cases, but where identifiers were comparable the consistency between the versions was generally high ().
The changes to the proteomes with a low fraction of shared identical sequences could potentially be large enough to affect the orthology assignment. In order to determine if this was the case, we performed whole-proteome BLAST comparisons of the proteomes with a low consistency between versions. Using the version with the fewest sequences as query and the version with the most sequences as database, we computed the average match identity as the number of identical residues in the best HSP divided by the length of the query. The results varied between 63% for A. mellifera to 96% for Cryptococcus neoformans, with most being above 90% (). These changes should reflect improvements in proteome quality. For example, the A. mellifera proteome previously used has been deprecated and removed from Ensembl, so the orthology assignment in the new version should be more accurate.