The PANTHER database was designed for high-throughput functional analysis of large sets of protein sequences (1
). It has been used to annotate the human genome (2
) as well as the Drosophila genome (3
). Like databases such as Pfam (4
) and SMART (5
), PANTHER uses a library of Hidden Markov Models (HMMs) to annotate sequences with information from homologous sequences. However, unlike these databases, the goal of PANTHER is not to annotate individual domains, but the overall biological function(s) of the molecule. Also unlike these other databases, because many protein families have branches that have diverged in function during evolution, the PANTHER library contains HMMs not only for families, but also for functionally distinct subfamilies. In these cases, subfamily annotation allows a much more precise definition of nomenclature and biological function.
PANTHER is composed of two main components: the PANTHER library (PANTHER/LIB) and the PANTHER index (PANTHER/X). PANTHER/LIB is a collection of ‘books’, each representing a protein family as a multiple sequence alignment, an HMM and a family tree. Functional divergence within the family is represented by first dividing the tree into subtrees (subfamilies) based on shared function, and then constructing a distinct HMM for each subfamily. PANTHER/X is an abbreviated ontology for summarizing and navigating molecular (biochemical) functions and biological processes (such as pathways, cellular roles or even physiological functions). Families and subfamilies are defined and named by biologist curators, who then associate each group of sequences with terms in the PANTHER/X ontology.
Protein query sequences can then be scored against the functionally-labelled family and subfamily HMMs. Query sequences are classified with the name and functional assignments of the best-scoring HMM, with the HMM score providing an estimate of the confidence level of the classification. Like other HMM-based approaches, PANTHER classification scales well for genome projects: the curated functional assignment is performed up-front on sets of training sequences that span many organisms, and can then be transferred to other organisms using the labelled HMMs. As a result, the PANTHER database classifies a significantly larger fraction of human genes than does LocusLink (Table ).
The percentage of human genes (approximated by LocusLink entries) having functional ontology classifications from PANTHER and from LocusLink GO associations
PANTHER has been available to Celera Discovery System (CDS) (7
) subscribers for almost two years, and is now publicly available to academic users at http://panther.celera.com
. The public version uses the GenBank non-redundant protein database to define sets of training sequences for HMMs. These HMMs are used to classify human gene products from LocusLink, and Drosophila melanogaster
gene products from FlyBase (http://www.fruitfly.org/sequence/release3download.shtml
). The CDS version includes training proteins from the sets curated at Celera, with additional HMM scoring of Celera-curated human and mouse gene products.