The complete sequencing of the human genome represented a major breakthrough for the genome era [1
]. Since then, a number of genome wide experimental and computational analyses have been performed that capture different aspects of the biology of the human cell. These analyses include, among many others, those of the so-called transcriptome [3
], proteome [4
], interactome [5
] and metabolome [6
]. The availability of such large datasets have added new dimensions to the study of the human organism; not only are they useful in elucidating the function of otherwise uncharacterized proteins, but they also provide information on the system-level properties of the cell [7
]. The reconstruction of the evolutionary histories of all genes encoded in a genome, the so-called phylome [8
], constitutes another source of genome-wide information. Analyses of complete phylomes, however, have traditionally been prevented by their large demands on time and computer power. Only recently have faster computers and algorithms paved the way for the application of phylogenetics to whole genomes. Such analyses have proven to be a very useful tool for the detection of specific evolutionary scenarios [9
] and for the functional characterization of genes and biological systems [10
]. Other large-scale phylogenetic analyses have focused on the establishment of orthology relationships among genes in model species. Most remarkably, the Ensembl database now includes phylogenetic trees [12
], and the TreeFam [13
] and HOVERGEN [14
] databases provide automatically derived and curated phylogenies of animal gene families. Other such databases focus on specific aspects of the evolution of gene families, such as the detection of adaptive events [15
]. These databases follow a family-based approach, since they first group the genes into families and subsequently build a single phylogeny for each family.
Using a different, gene-based approach that aims at maximizing both the coverage over the human genome and the taxon-sampling among fully sequenced eukaryotic genomes, we have developed a fully automated pipeline (Figure ) to reconstruct the phylogenies of every protein encoded in the human genome and its homologs in 39 eukaryotic species. Such a pipeline aims at resembling, as much as possible, the manual procedure used by phylogeneticists while remaining a fully automated process. In the search for a compromise between time and reliability, we always tried to adjust the balance towards the latter, thus assuring high quality in the resulting phylogenies. In contrast to the abovementioned TreeFam and Ensembl phylogenetic pipelines, our approach includes evolutionary model testing using maximum likelihood (ML), model parameter estimation and alignment trimming steps. Moreover, besides using neighbor joining (NJ) and ML approaches for phylogenetic reconstruction, our pipeline also implements a Bayesian phylogenetic reconstruction approach to provide posterior probabilities of every partition in the tree. As a result, building the human phylome presented here took two months on a total of 140 64-bit processors, which is roughly equivalent to 23 years in a single processor. To our knowledge, this represents the most sophisticated phylome reconstruction pipeline and the largest computing time investment for a single phylome reported to date.
Figure 1 Schematic representation of the phylogenetic pipeline used to reconstruct the human phylome. Each protein sequence encoded in the human genome is compared against a database of proteins from 39 fully sequenced eukaryotic genomes (Table 1) to select putative (more ...)
The availability of such a comprehensive collection of evolutionary histories of protein-coding human genes constitutes a valuable source of information that allows us to test several evolutionary hypotheses. For this purpose, we investigated the consistency of the individual phylogenies within the phylome with alternative evolutionary scenarios, namely those involving the relative positions of rodents and primates, amoebozoans and opisthokonts and, finally, insects, nematodes and chordates. We also scanned the human phylome for cases of putative horizontally transferred genes and found that such topologies are never highly supported, indicating that they are rather the result of phylogenetic artifacts. Moreover, we provide estimates for the number of gene duplications that have occurred at different evolutionary stages in the eukaryotic lineages leading to hominids and found several over-represented functional classes in the different duplication events. Finally, we explored an alternative, fully automated algorithm to infer orthology relationships from phylogenetic trees that does not require a fully resolved species phylogeny and, therefore, is less sensitive to topological variations. The choice for this novel methodology for orthology prediction is based on the fact that alternative tree reconciliation methods have difficulties in accounting for inherent phylogenetic noise, divergences in evolutionary histories for different genes and the low resolution level of available species trees. As will be shown below, the high degree of topological variation found in the human phylome for all scenarios considered also supports the choice of alternatives to classic tree reconciliation methods. All in all, the results presented here constitute a preliminary but broad overview of the evolutionary history of the human genome, which is not taken as an average or represented by a limited number of genes, but instead is regarded as a complex mosaic of thousands of individual phylogenies.