Seventeen new species have been added since TreeFam v1 (4
). TreeFam v4 contains predicted protein sequences from the fully sequenced genomes of 25 animal species: human, chimpanzee, macaque, mouse, rat, cow, dog, opossum, chicken, frog, two pufferfish (Takifugu
), zebrafish, medaka, stickleback, sea squirts (Ciona intestinalis
and C. savignyi
), two fruit-flies (Drosophila melanogaster
and D. pseudoobscura
), two mosquitoes (Aedes aegypti
and Anopheles gambiae
), the flatworm Schistosoma mansoni
, and the nematodes Caenorhabditis elegans
, C. briggsae
and C. remanei
. In addition, four outgroup genomes are included: baker's yeast, fission yeast, rice and thale cress (Arabidopsis
The C. briggsae
and C. remanei
proteins were downloaded from WormBase (16
), D. pseudoobscura
proteins from FlyBase (17
), fission yeast and flatworm proteins from GeneDB (18
), thale cress proteins from TIGR (19
), rice proteins from the Beijing Genomics Institute (20
) and the remaining sequences from Ensembl (15
). In addition to these species, TreeFam includes UniProt (21
) proteins from animal species whose genomes have not been fully sequenced. For TreeFam v4, all sequences were downloaded in October 2006.
TreeFam is a two-part database: a first part consisting of automatically generated trees (TreeFam-B) and a second part that consists of manually curated trees (TreeFam-A).
Automatically generating trees for TreeFam-B
TreeFam v1 used clusters of genes from PhIGs (11
) as seeds for B families. However, for TreeFam v4, each B seed consists of genes from ‘core’ species from the corresponding TreeFam-3 family. ‘Core’ species are those selected to have high-quality reference genome sequences and gene predictions with good phylogenetic representation of the phyla of biological or phylogenetic importance. These were human, mouse, opossum, chicken, frog, pufferfish (Takifugu
), zebrafish, sea squirt (C. savignyi
), flatworm (22
), D. melanogaster
, C. elegans
, baker's yeast, fission yeast, thale cress and rice. This change allowed TreeFam to use new gene sets that are absent from PhiGs, and to ensure that families remain stable from one release to the next.
Each seed family in TreeFam-B is expanded by using blast and hmmer to search for sequence matches among the animal and outgroup protein data sets, including animal sequences from UniProt. In TreeFam v1, we expanded each seed to form a full family. In TreeFam v4, we also made a ‘clean’ family from each seed, which only contains genes from fully sequenced genomes. The reasons for making a clean family were that (i) truncated proteins from UniProt sometimes cause problems for tree-building algorithms, and (ii) the algorithms we use to build trees (described subsequently) perform best when given both DNA and protein sequences, but many UniProt proteins lack easily identifiable DNA sequences.
Furthermore, for TreeFam v4, we employed a new approach to ensure that each animal gene only appears in one family. First, we assigned each transcript to the B or A family for which it had the highest-scoring hmmer match. Second, for each family, we only kept one transcript from each gene: the transcript with the highest-scoring hmmer match to the family. The one situation in which a gene is allowed to belong to more than one family is where the gene has transcripts with highest-scoring matches to different families. This can occur because EnsEMBL takes all the overlapping transcripts as one gene, whereas bad gene predictions or true gene fusion events may lead to transcripts that only share short fragments at the DNA level and have different functionalities.
After expanding the seed to a full family and a clean family, the protein sequences in each full or clean family are aligned using Muscle version 3.6 (23
). The alignment is then filtered to retain only conserved regions, as described in Li et al
). For TreeFam v1, the filtered alignment was used as input in a neighbour-joining (NJ) algorithm, which was used to construct a phylogenetic tree based on amino acid mismatch distances. Since TreeFam v1, we have greatly refined our tree-building process so that the automatic trees are substantially more accurate (24
). We describe the improvements to the tree building method used in TreeFam-4 subsequently.
For TreeFam v4, for each B ‘clean’ family five trees were built:
- a maximum likelihood (ML) tree built using phyml (25), based on the protein alignment with the WAG model;
- an ML tree built using phyml, based on the codon alignment with the HKY model;
- an NJ tree using p-distance, based on the codon alignment;
- an NJ tree using dN distance, based on the codon alignment; and
- an NJ tree using dS distance, based on the codon alignment.
For (i) and (ii), we used a modified version of phyml
release 2.4.5 (Heng Li, unpublished manuscript) which takes an input species tree, and tries to build a gene tree that is consistent with the topology of the species tree. This ‘species-guided’ phyml
uses the original phyml
tree-search algorithm (25
). However, the objective function maximized during the tree-search is multiplied by an extra likelihood factor not found in the original phyml
. This extra likelihood factor reflects the number of duplications and losses inferred in a gene tree, given the topology of the species tree. The species-guided phyml
allows the gene tree to have a topology that is inconsistent with the species tree if the alignment strongly supports this. The species tree was based on the NCBI taxonomy tree (see ‘Orthologue Inference’ section subsequently).
The final tree for a B clean family is made by merging the five trees into one consensus tree using a novel ‘tree merging’ algorithm (24
). This allows us to take advantage of the fact that DNA-based trees often are more accurate for closely related parts of trees and protein-based trees for distant relationships, and that some algorithms may outperform others under certain scenarios. The algorithm simultaneously merges the five input trees into a consensus tree. The consensus topology contains clades found in any of the input trees, where the clades chosen are those that minimize the number of duplications and losses inferred, and have the highest bootstrap support. Branch lengths are estimated for the final consensus tree based on the DNA alignment, using phyml
with the HKY model.
We cannot use tree merging for the B full families, because it requires DNA sequences, which many UniProt proteins in full families lack. Instead, for each B full family we built an ML tree that was based on the protein alignment, and was constrained to be consistent with the tree for the corresponding clean family. The constrained ML tree was built using a modified version of phyml release 2.4.5 (Heng Li, unpublished manuscript) that can take the topology of an input gene tree as a soft constraint.
The species-guided version of phyml
, the ‘constrained phyml
’, and the tree merging algorithm are available as part of the TreeBest software from http://treesoft.sourceforge.net/
Manually curating TreeFam-B trees
During curation, experts manually correct errors in the automatic trees for TreeFam-B families (4
). Since TreeFam v1, significant improvements to allow curation of larger trees and to speed up curation have been made to one of our in-house curation tools, tctool (Lachlan Coin, manuscript in preparation).
TreeFam is now able to support external curation from outside the Sanger Institute, and this is currently in testing with a number of groups who are collaborating on the TreeFam project. We have recruited and trained external curators at the University of Southern Denmark in Odense, University of Aarhus and the Beijing Genomics Institute, who have contributed many curations to TreeFam.
When a B tree has been curated, it becomes the seed tree for an A family, and is removed from TreeFam-B. Each seed family is expanded into a full and a clean family. If a new gene prediction set has been released since the last build of the TreeFam-A database, blast and hmmer are used to identify sequence matches in this gene set, which are added to the clean and/or full family. A filtered alignment is made for each full or clean family.
Trees of clean A families are built by using the tree merging algorithm to find the consensus of seven trees:
- a constrained ML tree built using phyml, based on the protein alignment with the WAG model;
- a constrained ML tree built using phyml, based on the codon alignment with the HKY model;
- an unconstrained NJ tree using p-distance, based on the codon alignment;
- an unconstrained NJ tree using dN distance, based on the codon alignment;
- an unconstrained NJ tree using dS distance, based on the codon alignment;
- a constrained NJ tree using dN distance based on the codon alignment; and
- a constrained NJ tree using p-distance based on the codon alignment.
Trees (i) and (ii) were built using ‘species-guided PHYML’, using the topologies of the curated seed tree and of an input species tree as soft constraints. Trees (vi) and (vii) were built using the ‘constrained NJ algorithm’ described in Li et al
), which uses the topology of the curated seed tree as a hard constraint.
For each full A family we used constrained phyml to build a ML tree based on the protein alignment, constraining the tree to be consistent with that for the corresponding clean family.
For both A and B families, orthologues and paralogues are inferred from the clean tree. We first use the ‘Duplication/Loss Inference’ (DLI) algorithm (4
) to identify duplication and speciation nodes. We then assume that genes belonging to different child clades of a duplication node are paralogues, while genes belonging to different child clades of a speciation node are orthologues.
Since TreeFam v1, we have introduced one change in the way that we infer orthologues, as follows. We infer that a duplication node is ‘dubious’ if there is no intersection between the species that belong to its two-child clades. A ‘dubious duplication’ is probably a tree-building artefact, and we assume that the genes belonging to the different child clades of the node are actually orthologues (not paralogues).
The DLI algorithm requires a species tree, and for this we use the NCBI taxonomy tree (7
), with two exceptions. We consider two parts of the tree as multifurcations because their topology is controversial: (i) the fungi, metazoans and plants and (ii) the chordates, arthropods, nematodes and schistosomes.
TreeFam database content
Release 4 of TreeFam contains curated trees for 1314 families and automatically generated trees for another 14 351 families. The number of curated families has increased since TreeFam v1, which contained 690 curated families. The 15 665 families represent 348 531 genes from 25 fully sequenced animal genomes and 78 209 genes from four outgroups and UniProt. TreeFam v4 includes 84.5% of the 22 855 protein-coding human genes, 84.8% of the 24 438 mouse genes, 71.6% of the 14 039 D. melanogaster genes and 66.2% of the 20 060 genes from C. elegans. shows the numbers of genes and human orthologues for each fully sequenced species in TreeFam v4.
The number of genes from each fully sequenced animal species that have human orthologues in TreeFam
TreeFam allows users to search for their genes of interest using accession numbers from the source sequence databases or GenBank accessions, or text searches of the gene and TreeFam family names, symbols and descriptions. Since TreeFam v1, we have added the ability for users to search using GO term identifiers, Pfam domain identifiers and identifiers from many other databases (the complete list can be found at http://www.treefam.org/cgi-bin/misc_page.pl?faq#u1
The webpage for a B family displays the clean tree, while the webpage for an A family displays both the clean and curated seed tree. Since TreeFam v1, we have added a link from the family page to the TreeView applet (), with which users can view the full, clean or seed tree. Next to the phylogenetic tree, TreeView displays Pfam protein domains and intron positions in the family members, mapped onto the family protein alignment. The user can click on a gene name in TreeView to see the hmmer score for the match between the gene and the family.
All the data can be freely downloaded from ftp://ftp.sanger.ac.uk/pub/treefam. This includes sequences, alignments, trees, orthologues and within-species paralogues.
Since TreeFam v1, we have made the mysql database publicly accessible (URI: db.treefam.org Port: 3308, with user ‘anonymous’). We have also developed a perl API for interacting with the database, which allows users to fetch alignments and manipulate trees. The API and examples of using it are found at http://treesoft.sourceforge.net/
Since TreeFam v1 we have helped the developers of the UCSC browser (26
), WormBase (16
) and HGNC (27
) to add links to TreeFam and TreeFam orthologue information to their databases.