|Home | About | Journals | Submit | Contact Us | Français|
PANTHER is a large collection of protein families that have been subdivided into functionally related subfamilies, using human expertise. These subfamilies model the divergence of specific functions within protein families, allowing more accurate association with function (ontology terms and pathways), as well as inference of amino acids important for functional specificity. Hidden Markov models (HMMs) are built for each family and subfamily for classifying additional protein sequences. The latest version, 5.0, contains 6683 protein families, divided into 31705 subfamilies, covering ~90% of mammalian protein-coding genes. PANTHER 5.0 includes a number of significant improvements over previous versions, most notably (i) representation of pathways (primarily signaling pathways) and association with subfamilies and individual protein sequences; (ii) an improved methodology for defining the PANTHER families and subfamilies, and for building the HMMs; (iii) resources for scoring sequences against PANTHER HMMs both over the web and locally; and (iv) a number of new web resources to facilitate analysis of large gene lists, including data generated from high-throughput expression experiments. Efforts are underway to add PANTHER to the InterPro suite of databases, and to make PANTHER consistent with the PIRSF database. PANTHER is now publicly available without restriction at http://panther.appliedbiosystems.com.
The philosophy, as well as the basic methodology, behind the PANTHER database has been described previously (1,2); therefore, we focus here on the recent improvements to the database and to the functionality available on the website. In brief, there are two main parts to PANTHER: PANTHER/LIB, a library of protein families and subfamilies; and PANTHER/X, a set of ontology terms describing protein function. The database's main advantage is in the curator-defined grouping of protein sequences into functional subfamilies, allowing more detailed and accurate association with the ontology terms, and now biological pathways. Each family and subfamily is represented by a phylogenetic tree of ‘training sequences’, and a hidden Markov model (HMM) that represents these sequences as a statistical model. The HMM library can be searched to classify new sequences, or to provide a score to predict the likely functional consequence of a mutation (1). PANTHER is quite comprehensive for the annotation of protein sequences encoded by metazoan genomes: ~90% of mammalian protein-coding genes, and nearly two-thirds of Drosophila genes, are hit by a PANTHER HMM.
The PANTHER database has recently been expanded to include associations between protein sequences and the biological pathways they participate in. Like the molecular function and biological process ontology terms, these pathways are associated with individual protein sequences, and when possible with PANTHER subfamily HMMs, by expert curators.
We have also improved the methodology used to define protein families and subfamilies. These improvements are mainly in two areas: global clustering of protein sequence space to allow definition of family boundaries, and new algorithms that make use of ontology terms to provide a guide for curators to define both families and subfamilies.
There are also a number of significant improvements to the website. Perhaps most importantly for users, the site is now free of the previous restrictions on its use (3). In addition, HMMs can be downloaded, and/or searched interactively using a protein sequence as a query. Pathways can be interactively browsed and queried. Gene lists (e.g. from mRNA expression data) can be uploaded to the site and analyzed relative to molecular functions, biological processes and pathways.
PANTHER/LIB (library of protein family and subfamily HMMs), version 5.0 contains 256413 training sequences, grouped into 6683 families. These families were then divided further into 31705 subfamilies.
PANTHER HMMs have been used to annotate the protein-coding genes annotated in the human, mouse, rat and Drosophila melanogaster genomes. The fractions of these genes that were given a functional annotation by PANTHER 5.0 are shown in Table Table11.
Several resources are now available at the PANTHER website.
PANTHER has been mapped to existing InterPro (14) entries, and this file is available from http://panther.appliedbiosystems.com/downloads/. PANTHER will be incorporated into the InterPro suite of databases incrementally. PANTHER HMMs have also been mapped to existing PIRSF (15) entries, and a collaboration is currently underway to make PANTHER and PIRSF consistent and cross-referenced.
For version 5.0, we implemented a number of improvements to the PANTHER library building procedure as described previously (1). At the end of this process, we evaluated the HMM classifications of a test set of over 10000 sequences from SWISS-PROT to make sure that the new process did not lower the accuracy of the classifications reported (16). We found that the classification accuracy was nearly identical, and the coverage was slightly improved in 5.0, probably due to the new HMM building process outlined below.
PANTHER version 3.0 (1,2) used seed-based clustering to define protein families. The advantage of this approach was its modularity: new families could be easily added in areas that were inadequately covered in previous versions. However, the seed-based clustering resulted in significant redundancy for a number of large protein families, such as protein kinases and G-protein-coupled receptors, which were covered by a number of families that overlapped to varying degrees.
The current version, PANTHER version 5.0, addresses this issue by implementing a global clustering of proteins. Proteins from PANTHER version 4.0 were clustered using a similarity metric derived from the pairwise BLASTP scores:
where S(a, b) is the BLASTP raw score for the alignment of sequences a and b using the BLOSUM62 matrix and masked for low-complexity segments. The denominator is the largest self-alignment score, and therefore, the similarity is the fraction of the maximum score possible for an alignment of sequences a and b. In cases where there were multiple high-scoring pairs (HSPs; i.e. partial alignments), S(a, b) was set equal to the sum of the scores for the maximal set of non-overlapping HSPs.
This pairwise similarity was used to define single-linkage clusters (maximal clusters in which each protein is connected to at least one other protein in the cluster by a non-zero similarity score). A dendrogram was built for each single-linkage cluster using the UPGMA algorithm (17). The family labels from the PANTHER version 4.0 library were then used to define the optimal cut of each UPGMA dendrogram into family clusters, to maximize the correspondence to previous versions of PANTHER. In the great majority of cases, the PANTHER version 5.0 family was almost identical to the corresponding family in the previous version of the library. Only about 40 subtrees in the UPGMA dendrograms, primarily those that were represented by overlapping clusters in the previous version, had to be broken further into functionally homogeneous clusters using manual curation. Overall, the family clusters identified from the UPGMA dendrograms covered over 96% of the version 4.0 training sequences. The rest of the sequences were either singletons according to Equation 1 (often due to low-complexity masking), or lay outside the family boundaries defined by PANTHER version 4.0 family labels on the UPGMA dendrograms. Each of these ‘leftover’ sequences (unmasked) was scored against SAM HMMs built for the family clusters, and was brought into the family of the best scoring HMM if the NLL-NULL score was less than −50. Those leftovers not meeting this criterion were added as singleton families if they were from a primate or rodent species; otherwise they were removed from the library.
The UPGMA-derived family clusters allow us to simplify the HMM-building process detailed previously (1). Rather than building ‘initial’ and ‘extended’ HMMs, for PANTHER 5.0, we built the family HMM directly from the UPGMA family cluster in a single step. Because the HMM training sequences are of varying lengths, we pre-set the SAM buildmodel –modellength option to be 1.1 times the maximum sequence length in the cluster, and also added the option –sw2, to create a local HMM. Similar to previous versions of the library, this temporary HMM was used to create an alignment (using the SAM align2model procedure with the −sw2 option) that could be used to estimate the weights of the sequences in the initial HMM. A weighted model was then constructed followed by a weighted alignment.
In PANTHER 5.0, we used a faster version of TIPS (version 2.0, available from the Downloads section of the PANTHER website) to create the phylogenetic trees (18). As in previous versions, the MSA was used as input to the new TIPS2 algorithm, along with the following parameters. -prior uprior.9.com, -score_matrix BLOSUM 62, -cut_using_distance 0.5, -pair_type 1 and -use_are_as_branch_length 0.
Because the subfamily labels and associated ontology terms were expanded and reviewed by curators for both versions 3.0 and 4.0, and shown to have a high rate of accuracy (16), we developed an algorithm for optimally dividing a tree into subfamilies given subfamily labels on each sequence (18). These divisions were then reviewed once again by expert curators, and adjusted if necessary. This methodology will allow regular updates to PANTHER training sequences with minimal curation effort.
Another significant advantage of this approach is that any arbitrary grouping of sequences can be superimposed on our phylogenetic trees to define subfamilies (and associated HMMs). This approach will allow straightforward incorporation of external annotations such as those produced by single protein family databases, or from large ontology association projects such as GOA (19,20).