PANTHER (Protein ANalysis THrough Evolutionary Relationships) is a database of phylogenetic trees of protein-coding gene families from all kingdoms of life (1
). Ancestral genes (representing most recent common ancestors of extant genes) are annotated with ontology terms describing gene function, and likely functional divergence events are identified and used to divide protein families into subfamilies of genes with similar function. Hidden Markov models (HMMs) are constructed for all families and subfamilies, which can be used for genome annotation projects, alone or as part of the InterPro database (2
) that includes PANTHER as well as several other well-known protein annotation resources.
The main goal of PANTHER is to infer the evolution of gene function across as many genes in as many genomes as possible, and apply these inferences to predict the functions of genes that have not been directly characterized by experiment. In particular, there are large communities of researchers elucidating gene function for so-called ‘model organisms’ (e.g. those listed in ) and these results provide a basis for inferring the functions of related genes in humans and other organisms. PANTHER applies both software tools and manual curation to perform these inferences as accurately as possible, and to keep them up-to-date as new experimental results accumulate. Gene function—or, more commonly, the function of gene products such as proteins—is described using terms from the Gene ontology (GO) (3
), or from representations of molecular pathways.
Sources for complete sets of protein-coding genes in PANTHER version 7
We have made several major modifications to the most recent version of PANTHER. One of the main developments is collaboration with the GO Consortium, in which PANTHER trees are being annotated with GO terms as part of the GO Reference Genome project (5
). For PANTHER version 7, all previous associations of PANTHER subfamilies with function terms have been updated to GO terms. Ongoing annotation within the Reference Genome Project includes a complete evidence trail for inferred annotations all the way to the experimental results (literature articles) and evolutionary events upon which the inferences are based. Other important developments include improvements to the phylogenetic trees, inference of inter-species orthologs, inclusion of more genomes and support for several alternate database identifier types.
Improved hidden Markov Models and phylogenetic trees, and ortholog identification
Gene families covering fully sequenced genomes
Previous versions of PANTHER focused on identifying subfamilies and the underlying functional divergence events. PANTHER 7 expands upon this focus by supporting accurate ortholog identification, and annotation of gene families ‘at any point in gene family evolution’, not just the major divergences. In order to meet these requirements, we made several important improvements to PANTHER. First, PANTHER trees aim to represent ‘all’ protein-coding genes from a phylogenetically diverse set of organisms. For PANTHER 7 trees, complete protein-coding gene sets for 48 different organisms were carefully constructed from a number of different sources, in collaboration with the GO Consortium, with an effort to use curated sources for model organism genomes (). These sets can be downloaded at ftp://ftp.pantherdb.org/genome/pthr7.0. We were careful to maintain stable PANTHER family and subfamily accession numbers from the previous version 6.1 to 7.0. To define protein family membership, each PANTHER 7 protein sequence was scored against the HMMs from version 6.1 and assigned to the family with the highest HMM score. If the resulting protein family contained over 1000 sequences, we attempted to manually divide it into smaller families to facilitate web browsing. We divided a total of 20 families from PANTHER 6.1, which have dramatically expanded due to numerous gene (or domain) duplication events, such as G protein-coupled receptors (GPCRs), ATP binding cassette (ABC) transporters, protein kinases, cytochrome P450s (CYP), and proteins containing ankyrin repeats, leucine-rich repeats (LRR), zinc finger and homeobox domains. shows the distribution of family sizes in terms of the number of distinct genes (A) and the number of distinct genomes (B) they contain.
Figure 1. Distribution of protein family sizes in PANTHER version 7. (A) The distribution of the total number of genes (in all 48 genomes) per family. The N50 is about 150, i.e. about half the genes are in families larger than 150 members, and half are in smaller (more ...) Improved multiple sequence alignments and HMMs
A multiple sequence alignment was constructed for each family using the MAFFT program (6
) and a phylogenetic tree was estimated from the protein multiple alignment. Subfamily identifiers from version 6.1 were then ‘forward tracked’ to ancestral nodes in the version 7.0 trees whenever possible. In addition, in many cases, due to improvements in the phylogenetic trees in PANTHER 7 (see below), subfamily boundaries were refined during manual curation. After manual review and correction, if necessary, of the locations of both forward tracked and new subfamilies, a new HMM was constructed for each family and subfamily. We modified our existing HMM construction process (7
) to make use of the multiple alignment from MAFFT. For PANTHER 7, we took the relevant sequences in the MAFFT alignment, trimmed it to include as match states only those columns aligned by ≥30% of the sequences in the subalignment [sequences were weighted using the same technique as in (1
)], and used it to construct an initial model using the modelfromalign program in SAM3.1. We then used this initial model as input, in addition to the sequences themselves, to the buildmodel program using the same parameters as in (7
). As a result, unlike in previous versions of PANTHER, the HMMs can have different lengths for different subfamilies, and now model any domains that are conserved across a single subfamily but not found in other subfamilies.
New algorithm for phylogenetic trees
PANTHER trees aim to accurately represent ‘all’ of the evolutionary events in the gene family; for PANTHER 7, this means accurately inferring speciation and gene duplication events. For the gene trees, we use a novel algorithm, GIGA (Gene tree Inference in the Genomic Age). GIGA makes use of the known species tree and the presumably complete gene sets to infer accurate gene trees and locate gene duplication events relative to speciation events. If more than one gene duplication event took place between given consecutive speciation events, this appears as a single, multifurcating duplication node (e.g. node ‘2’ in ). The algorithm also performs a fast, approximate reconstruction of ancestral protein sequences at each node in the tree, using an iterative procedure starting at the leaves of the tree (modern day sequences) that considers the descendant sequences and the nearest outgroup.
Figure 2. Example of human orthologs and LDO of the yeast RSP5 gene, identified using a phylogenetic tree. The figure shows part of the tree for PTHR11254 (HECT domain ubiquitin–protein ligase family), tracing the evolutionary relationship between RSP5 (more ...) Orthologs: identification of complete set of orthologs and best one-to-one (least diverged) ortholog
These improved gene trees provide the basis for accurate inference of orthologs, pairs of genes whose most recent common ancestor (MRCA) diverged due to a speciation event (8
). Orthologs of each gene can be viewed on PANTHER gene pages, and the entire set of pairwise ortholog inferences can be downloaded from the PANTHER website (http://www.pantherdb.org/downloads
). For orthologs, PANTHER reports not only one-to-one but also one-to-many (i.e. when gene duplication has occurred in one lineage following speciation) and many-to-many orthologs (i.e. when gene duplication has occurred in both lineages following speciation). In the case of multiple orthologs, PANTHER identifies the one-to-one relationship that has ‘diverged the least’ following any gene duplication events. The ‘least diverged ortholog’ (LDO) pairs therefore represent the most nearly ‘equivalent’ gene pairs between different organisms based on the phylogenetic tree. Following gene duplication, the most common fates of the copies are thought to be neofunctionalization (in which one copy retains the ancestral function, while the other adapts to a new function) and subfunctionalization (in which each copy specializes in a subset of the ancestral functions) (9
). If neofunctionalization has occurred, the LDO is the copy predicted to retain the ancestral function, i.e. the ‘same gene’ as the ancestor. An example of ortholog and LDO identification is shown in .
Expanded sets of genomes and sequence identifiers for PANTHER tools
Since its inception, the PANTHER website has provided, for a limited set of ‘fully supported’ genomes (human, mouse, rat and fruit fly), the following functionality: (i) stored classifications for all protein-coding genes, including family, subfamily, molecular function, biological process and pathway, (ii) visualization tools such as the whole genome pie chart view () of gene functions and (iii) analysis tools such as the Gene Expression Analysis Tool (10
) for analyzing user-generated data relative to PANTHER classifications. For version 7, we have increased the number of fully supported genomes from 4 to 12 organisms, those participating in the GO Reference Genome Project (5
), listed at the beginning of .
Figure 3. Annotating a PANTHER tree with GO terms, and inferring GO terms for other genes by homology. The tree is the same as in . The ‘x’ marks in the adjoining table (right panel) show the experimental GO annotations for each gene in (more ...)
In addition, we have increased the number of different database identifiers supported by PANTHER tools and in searches of the PANTHER database. Previously, for genes only identifiers from NCBI Entrez Gene (17
) or FlyBase (15
) were supported; for proteins only RefSeq (24
) or FlyBase identifiers. In PANTHER 7, we now also support identifiers from Ensembl (23
), model organism databases, the International Protein Index (IPI) (25
) and UniProt (18
). All of these identifiers are obtained through the mapping files provided by UniProt (ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/).
Pathway diagrams using SBGN
PANTHER 7 has adopted the Systems Biology Graphical Notation (SBGN) standard (26
) for the 165 pathway diagrams currently available on the PANTHER website. This standard was recently released at http://sbgn.org
and provides a consistent semantics for symbols used in pathway diagrams.
Collaboration with GO Consortium
For almost 2 years now, there has been a formal collaboration between the Gene Ontology Consortium and the PANTHER database (5
). As a result, in PANTHER 7, all molecular function, biological process and cellular component terms are exclusively GO terms [previous versions of PANTHER used the PANTHER/X ontology (1
), though a mapping file to GO was provided]. The PANTHER/X biological process ontology has been retired, but we have retained the PANTHER/X molecular function ontology and renamed it ‘Protein Class’ since many terms are quite different from those in GO, and we have gotten considerable feedback from users about its utility.
As part of the GO Reference Genome Project, GO curators are annotating trees from the PANTHER database with GO terms describing molecular function, biological process and cellular component. As described in (5
), the goal of this project is to provide accurate, complete and consistent GO annotations for all genes in 12 model organism genomes. GO terms based on experimental data from the scientific literature are used to annotate ancestral genes in the phylogenetic tree; thus, unannotated descendants of these ancestral genes are inferred to have inherited these same GO annotations by descent. An example of this annotation process is shown in .
This rigorous process for evolutionary inference provides a means for accurate inference of GO annotations by homology, as well as a means for comparing and consistency-checking annotations for related genes. While earlier versions of PANTHER have allowed annotation of ‘subfamily nodes’ (i.e. ancestral genes that founded a particular subfamily), this more generalized GO annotation process requires all ancestral genes to be annotatable in principle, which has only become supported with the release of PANTHER 7. For most end users, perhaps the most relevant outcomes of this collaboration will be: (i) an increased number of GO annotations, especially those inferred by homology and (ii) the ability to trace all of the evidence behind each homology-based annotation. This evidence includes not only the gene that was experimentally demonstrated to perform a particular function (and the scientific publication reporting the experiment), but also the ancestral gene in which the function was inferred to have evolved. In the long term, all PANTHER ontology annotations will be migrated to this new standard.