The InterPro database (http://www.ebi.ac.uk/interpro/) is a freely available resource that can be used to classify sequences into protein families and to predict the presence of important domains and sites. Central to the InterPro database are predictive models, known as signatures, from a range of different protein family databases that have different biological focuses and use different methodological approaches to classify protein families and domains. InterPro integrates these signatures, capitalizing on the respective strengths of the individual databases, to produce a powerful protein classification resource. Here, we report on the status of InterPro as it enters its 15th year of operation, and give an overview of new developments with the database and its associated Web interfaces and software. In particular, the new domain architecture search tool is described and the process of mapping of Gene Ontology terms to InterPro is outlined. We also discuss the challenges faced by the resource given the explosive growth in sequence data in recent years. InterPro (version 48.0) contains 36 766 member database signatures integrated into 26 238 InterPro entries, an increase of over 3993 entries (5081 signatures), since 2012.
Given the rapid increase of species with a sequenced genome, the need to identify orthologous genes between them has emerged as a central bioinformatics task. Many different methods exist for orthology detection, which makes it difficult to decide which one to choose for a particular application.
Here, we review the latest developments and issues in the orthology field, and summarize the most recent results reported at the third ‘Quest for Orthologs’ meeting. We focus on community efforts such as the adoption of reference proteomes, standard file formats and benchmarking. Progress in these areas is good, and they are already beneficial to both orthology consumers and providers. However, a major current issue is that the massive increase in complete proteomes poses computational challenges to many of the ortholog database providers, as most orthology inference algorithms scale at least quadratically with the number of proteomes.
The Quest for Orthologs consortium is an open community with a number of working groups that join efforts to enhance various aspects of orthology analysis, such as defining standard formats and datasets, documenting community resources and benchmarking.
Availability and implementation: All such materials are available at http://questfororthologs.org.
email@example.com or firstname.lastname@example.org
The goal of the Gene Ontology (GO) project is to provide a uniform way to describe the functions of gene products from organisms across all kingdoms of life and thereby enable analysis of genomic data. Protein annotations are either based on experiments or predicted from protein sequences. Since most sequences have not been experimentally characterized, most available annotations need to be based on predictions. To make as accurate inferences as possible, the GO Consortium's Reference Genome Project is using an explicit evolutionary framework to infer annotations of proteins from a broad set of genomes from experimental annotations in a semi-automated manner. Most components in the pipeline, such as selection of sequences, building multiple sequence alignments and phylogenetic trees, retrieving experimental annotations and depositing inferred annotations, are fully automated. However, the most crucial step in our pipeline relies on software-assisted curation by an expert biologist. This curation tool, Phylogenetic Annotation and INference Tool (PAINT) helps curators to infer annotations among members of a protein family. PAINT allows curators to make precise assertions as to when functions were gained and lost during evolution and record the evidence (e.g. experimentally supported GO annotations and phylogenetic information including orthology) for those assertions. In this article, we describe how we use PAINT to infer protein function in a phylogenetic context with emphasis on its strengths, limitations and guidelines. We also discuss specific examples showing how PAINT annotations compare with those generated by other highly used homology-based methods.
gene ontology; genome annotation; reference genome; gene function prediction; phylogenetics
PortEco (http://porteco.org) aims to collect, curate and provide data and analysis tools to support basic biological research in Escherichia coli (and eventually other bacterial systems). PortEco is implemented as a ‘virtual’ model organism database that provides a single unified interface to the user, while integrating information from a variety of sources. The main focus of PortEco is to enable broad use of the growing number of high-throughput experiments available for E. coli, and to leverage community annotation through the EcoliWiki and GONUTS systems. Currently, PortEco includes curated data from hundreds of genome-wide RNA expression studies, from high-throughput phenotyping of single-gene knockouts under hundreds of annotated conditions, from chromatin immunoprecipitation experiments for tens of different DNA-binding factors and from ribosome profiling experiments that yield insights into protein expression. Conditions have been annotated with a consistent vocabulary, and data have been consistently normalized to enable users to find, compare and interpret relevant experiments. PortEco includes tools for data analysis, including clustering, enrichment analysis and exploration via genome browsers. PortEco search and data analysis tools are extensively linked to the curated gene, metabolic pathway and regulation content at its sister site, EcoCyc.
We utilized a cohort of 828 treatment seeking self-identified white cigarette smokers (50% female) to rank candidate gene single nucleotide polymorphisms (SNPs) associated with the Fagerström Test for Nicotine Dependence (FTND), a measure of nicotine dependence which assesses quantity of cigarettes smoked and time- and place-dependent characteristics of the respondent’s smoking behavior. 1123 SNPs at 55 autosomal candidate genes, nicotinic acetylcholine receptors and genes involved in dopaminergic function, were tested for association to baseline FTND scores adjusted for age, depression, education, sex and study site. SNP P values were adjusted for the number of transmission models, the number of SNPs tested per candidate gene, and their intragenic correlation. DRD2, SLC6A3 and NR4A2 SNPs with adjusted P values < 0.10 were considered sufficiently noteworthy to justify further genetic, bioinformatic and literature analyses. Each independent signal among the top-ranked SNPs accounted for ~1% of the FTND variance in this sample. The DRD2 SNP appears to represent a novel association with nicotine dependence. The SLC6A3 SNPs have previously been shown to be associated with SLC6A3 transcription or dopamine transporter density in vitro, in vivo and ex vivo. Analysis of SLC6A3 and NR4A2 SNPs identified a statistically significant gene-gene interaction (P=0.001), consistent with in vitro evidence that the NR4A2 protein product (NURR1) regulates SLC6A3 transcription. A community cohort of N=175 multiplex ever smoking pedigrees (N=423 ever smokers) provided nominal evidence for association with the FTND at these top ranked SNPs, uncorrected for multiple comparisons.
dopamine transporter; Fagerström Test for Nicotine Dependence; single nucleotide polymorphism; candidate gene association scan; gene-gene interaction
Motivation: BioPAX is a standard language for representing and exchanging models of biological processes at the molecular and cellular levels. It is widely used by different pathway databases and genomics data analysis software. Currently, the primary source of BioPAX data is direct exports from the curated pathway databases. It is still uncommon for wet-lab biologists to share and exchange pathway knowledge using BioPAX. Instead, pathways are usually represented as informal diagrams in the literature. In order to encourage formal representation of pathways, we describe a software package that allows users to create pathway diagrams using CellDesigner, a user-friendly graphical pathway-editing tool and save the pathway data in BioPAX Level 3 format.
Availability: The plug-in is freely available and can be downloaded at ftp://ftp.pantherdb.org/CellDesigner/plugins/BioPAX/
Supplementary Information: Supplementary data are available at Bioinformatics online.
The data and tools in PANTHER—a comprehensive, curated database of protein families, trees, subfamilies and functions available at http://pantherdb.org—have undergone continual, extensive improvement for over a decade. Here, we describe the current PANTHER process as a whole, as well as the website tools for analysis of user-uploaded data. The main goals of PANTHER remain essentially unchanged: the accurate inference (and practical application) of gene and protein function over large sequence databases, using phylogenetic trees to extrapolate from the relatively sparse experimental information from a few model organisms. Yet the focus of PANTHER has continually shifted toward more accurate and detailed representations of evolutionary events in gene family histories. The trees are now designed to represent gene family evolution, including inference of evolutionary events, such as speciation and gene duplication. Subfamilies are still curated and used to define HMMs, but gene ontology functional annotations can now be made at any node in the tree, and are designed to represent gain and loss of function by ancestral genes during evolution. Finally, PANTHER now includes stable database identifiers for inferred ancestral genes, which are used to associate inferred gene attributes with particular genes in the common ancestral genomes of extant species.
dopamine receptor D2; PharmGKB; rs1799732; rs1800497; rs6277; rs1801028
A recent paper (Nehrt et al., PLoS Comput. Biol. 7:e1002073, 2011) has proposed a metric for the “functional similarity” between two genes that uses only the Gene Ontology (GO) annotations directly derived from published experimental results. Applying this metric, the authors concluded that paralogous genes within the mouse genome or the human genome are more functionally similar on average than orthologous genes between these genomes, an unexpected result with broad implications if true. We suggest, based on both theoretical and empirical considerations, that this proposed metric should not be interpreted as a functional similarity, and therefore cannot be used to support any conclusions about the “ortholog conjecture” (or, more properly, the “ortholog functional conservation hypothesis”). First, we reexamine the case studies presented by Nehrt et al. as examples of orthologs with divergent functions, and come to a very different conclusion: they actually exemplify how GO annotations for orthologous genes provide complementary information about conserved biological functions. We then show that there is a global ascertainment bias in the experiment-based GO annotations for human and mouse genes: particular types of experiments tend to be performed in different model organisms. We conclude that the reported statistical differences in annotations between pairs of orthologous genes do not reflect differences in biological function, but rather complementarity in experimental approaches. Our results underscore two general considerations for researchers proposing novel types of analysis based on the GO: 1) that GO annotations are often incomplete, potentially in a biased manner, and subject to an “open world assumption” (absence of an annotation does not imply absence of a function), and 2) that conclusions drawn from a novel, large-scale GO analysis should whenever possible be supported by careful, in-depth examination of examples, to help ensure the conclusions have a justifiable biological basis.
Understanding gene function—how individual genes contribute to the biology of an organism at the molecular, cellular and organism levels—is one of the primary aims of biomedical research. It has been a longstanding tenet of model organism research that experimental knowledge obtained in one organism is often applicable to other organisms, particularly if the organisms share the relevant genes because they inherited them from their common ancestor. Nevertheless this tenet is, like any hypothesis, not beyond question. A recent paper has termed this hypothesis a “conjecture,” and performed a statistical analysis, the results of which were interpreted as evidence against the hypothesis. This statistical analysis relied on a computational representation of gene function, the Gene Ontology (GO). As representatives of the international consortium that produces the GO, we show how the apparent evidence against the “ortholog conjecture” can be better explained as an artifact of how molecular biology knowledge is accumulated. In short, a complementarity between knowledge obtained in mouse and human experimental systems was incorrectly interpreted as a disagreement. We discuss the proper interpretation of GO annotations and potential sources of bias, with an eye toward enhancing the informed use of the GO by the scientific community.
InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
Ontologies and standards are very important parts of today's bioscience research. With the rapid increase of biological knowledge, they provide mechanisms to better store and represent data in a controlled and structured way, so that scientists can share the data, and utilize a wide variety of software and tools to manage and analyze the data. Most of these standards are initially designed for computers to access large amounts of data that are difficult for human biologists to handle, and it is important to keep in mind that ultimately biologists are going to produce and interpret the data. While ontologies and standards must follow strict semantic rules that may not be familiar to biologists, effort must be spent to lower the learning barrier by involving biologists in the process of development, and by providing software and tool support. A standard will not succeed without support from the wider bioscience research community. Thus, it is crucial that these standards be designed not only for machines to read, but also to be scientifically accurate and intuitive to human biologists.
ontology; standard; systems biology
Pythium ultimum is a ubiquitous oomycete plant pathogen responsible for a variety of diseases on a broad range of crop and ornamental species.
The P. ultimum genome (42.8 Mb) encodes 15,290 genes and has extensive sequence similarity and synteny with related Phytophthora species, including the potato blight pathogen Phytophthora infestans. Whole transcriptome sequencing revealed expression of 86% of genes, with detectable differential expression of suites of genes under abiotic stress and in the presence of a host. The predicted proteome includes a large repertoire of proteins involved in plant pathogen interactions, although, surprisingly, the P. ultimum genome does not encode any classical RXLR effectors and relatively few Crinkler genes in comparison to related phytopathogenic oomycetes. A lower number of enzymes involved in carbohydrate metabolism were present compared to Phytophthora species, with the notable absence of cutinases, suggesting a significant difference in virulence mechanisms between P. ultimum and more host-specific oomycete species. Although we observed a high degree of orthology with Phytophthora genomes, there were novel features of the P. ultimum proteome, including an expansion of genes involved in proteolysis and genes unique to Pythium. We identified a small gene family of cadherins, proteins involved in cell adhesion, the first report of these in a genome outside the metazoans.
Access to the P. ultimum genome has revealed not only core pathogenic mechanisms within the oomycetes but also lineage-specific genes associated with the alternative virulence and lifestyles found within the pythiaceous lineages compared to the Peronosporaceae.
Phylogenetic relationships between genes are not only of theoretical interest: they enable us to learn about human genes through the experimental work on their relatives in numerous model organisms from bacteria to fruit flies and mice. Yet the most commonly used computational algorithms for reconstructing gene trees can be inaccurate for numerous reasons, both algorithmic and biological. Additional information beyond gene sequence data has been shown to improve the accuracy of reconstructions, though at great computational cost.
We describe a simple, fast algorithm for inferring gene phylogenies, which makes use of information that was not available prior to the genomic age: namely, a reliable species tree spanning much of the tree of life, and knowledge of the complete complement of genes in a species' genome. The algorithm, called GIGA, constructs trees agglomeratively from a distance matrix representation of sequences, using simple rules to incorporate this genomic age information. GIGA makes use of a novel conceptualization of gene trees as being composed of orthologous subtrees (containing only speciation events), which are joined by other evolutionary events such as gene duplication or horizontal gene transfer. An important innovation in GIGA is that, at every step in the agglomeration process, the tree is interpreted/reinterpreted in terms of the evolutionary events that created it. Remarkably, GIGA performs well even when using a very simple distance metric (pairwise sequence differences) and no distance averaging over clades during the tree construction process.
GIGA is efficient, allowing phylogenetic reconstruction of very large gene families and determination of orthologs on a large scale. It is exceptionally robust to adding more gene sequences, opening up the possibility of creating stable identifiers for referring to not only extant genes, but also their common ancestors. We compared trees produced by GIGA to those in the TreeFam database, and they were very similar in general, with most differences likely due to poor alignment quality. However, some remaining differences are algorithmic, and can be explained by the fact that GIGA tends to put a larger emphasis on minimizing gene duplication and deletion events.
Computational predictions of the functional impact of genetic variation play a critical role in human genetics research. For nonsynonymous coding variants, most prediction algorithms make use of patterns of amino acid substitutions observed among homologous proteins at a given site. In particular, substitutions observed in orthologous proteins from other species are often assumed to be tolerated in the human protein as well. We examined this assumption by evaluating a panel of nonsynonymous mutants of a prototypical human enzyme, methylenetetrahydrofolate reductase (MTHFR), in a yeast cell-based functional assay. As expected, substitutions in human MTHFR at sites that are well-conserved across distant orthologs result in an impaired enzyme, while substitutions present in recently diverged sequences (including a 9-site mutant that “resurrects” the human-macaque ancestor) result in a functional enzyme. We also interrogated 30 sites with varying degrees of conservation by creating substitutions in the human enzyme that are accepted in at least one ortholog of MTHFR. Quite surprisingly, most of these substitutions were deleterious to the human enzyme. The results suggest that selective constraints vary between phylogenetic lineages such that inclusion of distant orthologs to infer selective pressures on the human enzyme may be misleading. We propose that homologous proteins are best used to reconstruct ancestral sequences and infer amino acid conservation among only direct lineal ancestors of a particular protein. We show that such an “ancestral site preservation” measure outperforms other prediction methods, not only in our selected set for MTHFR, but also in an exhaustive set of E. coli LacI mutants.
The rapid pace of technological advances in DNA sequencing methods is leading to the discovery of genetic variants at a remarkable rate. Indeed, it is conceivable that entire individual genomes will be sequenced routinely in the near future. While these platforms greatly increase our ability to catalog variation, they are also creating a downstream need to efficiently process and filter this information to ultimately identify genetic causes underlying human disease. Since empirical evaluation of the biological effects of mutation is not practical at such a scale, computational methods that predict such effects are needed. In this paper, we describe a novel methodology to predict whether mutations that lead to amino acid substitutions in proteins will impact protein function and, therefore, may be more likely to have physiological consequences. Specifically, we use orthologous proteins to reconstruct the likely sequences of ancestral proteins in the human lineage. We found that the longer a position has been preserved from direct ancestors in the lineage leading to the human enzyme, the more likely that mutation at that site will have a deleterious effect. We demonstrated that the method should be generally applicable to all proteins.
Protein Analysis THrough Evolutionary Relationships (PANTHER) is a comprehensive software system for inferring the functions of genes based on their evolutionary relationships. Phylogenetic trees of gene families form the basis for PANTHER and these trees are annotated with ontology terms describing the evolution of gene function from ancestral to modern day genes. One of the main applications of PANTHER is in accurate prediction of the functions of uncharacterized genes, based on their evolutionary relationships to genes with functions known from experiment. The PANTHER website, freely available at http://www.pantherdb.org, also includes software tools for analyzing genomic data relative to known and inferred gene functions. Since 2007, there have been several new developments to PANTHER: (i) improved phylogenetic trees, explicitly representing speciation and gene duplication events, (ii) identification of gene orthologs, including least diverged orthologs (best one-to-one pairs), (iii) coverage of more genomes (48 genomes, up to 87% of genes in each genome; see http://www.pantherdb.org/panther/summaryStats.jsp), (iv) improved support for alternative database identifiers for genes, proteins and microarray probes and (v) adoption of the SBGN standard for display of biological pathways. In addition, PANTHER trees are being annotated with gene function as part of the Gene Ontology Reference Genome project, resulting in an increasing number of curated functional annotations.
The InterPro database (http://www.ebi.ac.uk/interpro/) integrates together predictive models or ‘signatures’ representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Integration is performed manually and approximately half of the total ∼58 000 signatures available in the source databases belong to an InterPro entry. Recently, we have started to also display the remaining un-integrated signatures via our web interface. Other developments include the provision of non-signature data, such as structural data, in new XML files on our FTP site, as well as the inclusion of matchless UniProtKB proteins in the existing match XML files. The web interface has been extended and now links out to the ADAN predicted protein–protein interaction database and the SPICE and Dasty viewers. The latest public release (v18.0) covers 79.8% of UniProtKB (v14.1) and consists of 16 549 entries. InterPro data may be accessed either via the web address above, via web services, by downloading files by anonymous FTP or by using the InterProScan search software (http://www.ebi.ac.uk/Tools/InterProScan/).
Although the efficacy of pharmacotherapy for tobacco dependence has been previously demonstrated, there is substantial variability among individuals in treatment response. We performed a systems-based candidate gene study of 1295 single nucleotide polymorphisms (SNPs) in 58 genes within the neuronal nicotinic receptor and dopamine systems to investigate their role in smoking cessation in a bupropion placebo-controlled randomized clinical trial. Putative functional variants were supplemented with tagSNPs within each gene. We used global tests of main effects and treatment interactions, adjusting the P-values for multiple correlated tests. An SNP (rs2072661) in the 3′ UTR region of the β2 nicotinic acetylcholine receptor subunit (CHRNB2) has an impact on abstinence rates at the end of treatment (adjusted P = 0.01) and after a 6-month follow-up period (adjusted P = 0.0002). This latter P-value is also significant with adjustment for the number of genes tested. Independent of treatment at 6-month follow-up, individuals carrying the minor allele have substantially decreased the odds of quitting (OR = 0.31; 95% CI 0.18–0.55). Effect of estimates indicate that the treatment is more effective for individuals with the wild-type (OR = 2.14, 95% CI 1.20–3.81) compared with individuals carrying the minor allele (OR = 0.83, 95% CI 0.32–2.19), although this difference is only suggestive (P = 0.10). Furthermore, this SNP demonstrated a role in the time to relapse (P = 0.0002) and an impact on withdrawal symptoms at target quit date (TQD) (P = 0.0009). Overall, while our results indicate strong evidence for CHRNB2 in ability to quit smoking, these results require replication in an independent sample.
InterPro is an integrated resource for protein families, domains and functional sites, which integrates the following protein signature databases: PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D and PANTHER. The latter two new member databases have been integrated since the last publication in this journal. There have been several new developments in InterPro, including an additional reading field, new database links, extensions to the web interface and additional match XML files. InterPro has always provided matches to UniProtKB proteins on the website and in the match XML file on the FTP site. Additional matches to proteins in UniParc (UniProt archive) are now available for download in the new match XML files only. The latest InterPro release (13.0) contains more than 13 000 entries, covering over 78% of all proteins in UniProtKB. The database is available for text- and sequence-based searches via a webserver (), and for download by anonymous FTP (). The InterProScan search tool is now also available via a web service at .
PANTHER is a freely available, comprehensive software system for relating protein sequence evolution to the evolution of specific protein functions and biological roles. Since 2005, there have been three main improvements to PANTHER. First, the sequences used to create evolutionary trees are carefully selected to provide coverage of phylogenetic as well as functional information. Second, PANTHER is now a member of the InterPro Consortium, and the PANTHER hidden markov Models (HMMs) are distributed as part of InterProScan. Third, we have dramatically expanded the number of pathways associated with subfamilies in PANTHER. Pathways provide a detailed, structured representation of protein function in the context of biological reaction networks. PANTHER pathways were generated using the emerging Systems Biology Markup Language (SBML) standard using pathway network editing software called CellDesigner. The pathway collection currently contains ∼1500 reactions in 130 pathways, curated by expert biologists with authorship attribution. The curation environment is designed to be easy to use, and the number of pathways is growing steadily. Because the reaction participants are linked to subfamilies and corresponding HMMs, reactions can be inferred across numerous different organisms. The HMMs can be downloaded by FTP, and tools for analyzing data in the context of pathways and function ontologies are available at .
The vast amount of protein sequence data now available, together with accumulating experimental knowledge of protein function, enables modeling of protein sequence and function evolution. The PANTHER database was designed to model evolutionary sequence–function relationships on a large scale. There are a number of applications for these data, and we have implemented web services that address three of them. The first is a protein classification service. Proteins can be classified, using only their amino acid sequences, to evolutionary groups at both the family and subfamily levels. Specific subfamilies, and often families, are further classified when possible according to their functions, including molecular function and the biological processes and pathways they participate in. The second application, then, is an expression data analysis service, where functional classification information can help find biological patterns in the data obtained from genome-wide experiments. The third application is a coding single-nucleotide polymorphism scoring service. In this case, information about evolutionarily related proteins is used to assess the likelihood of a deleterious effect on protein function arising from a single substitution at a specific amino acid position in the protein. All three web services are available at .
The human genome contains an estimated 100,000 to 300,000 DNA variants that alter an amino acid in an encoded protein. However, our ability to predict which of these variants are functionally significant is limited. We used a bioinformatics approach to define the functional significance of genetic variation in the ABCA1 gene, a cholesterol transporter crucial for the metabolism of high density lipoprotein cholesterol. To predict the functional consequence of each coding single nucleotide polymorphism and mutation in this gene, we calculated a substitution position-specific evolutionary conservation score for each variant, which considers site-specific variation among evolutionarily related proteins. To test the bioinformatics predictions experimentally, we evaluated the biochemical consequence of these sequence variants by examining the ability of cell lines stably transfected with the ABCA1 alleles to elicit cholesterol efflux. Our bioinformatics approach correctly predicted the functional impact of greater than 94% of the naturally occurring variants we assessed. The bioinformatics predictions were significantly correlated with the degree of functional impairment of ABCA1 mutations (r2 = 0.62, p = 0.0008). These results have allowed us to define the impact of genetic variation on ABCA1 function and to suggest that the in silico evolutionary approach we used may be a useful tool in general for predicting the effects of DNA variation on gene function. In addition, our data suggest that considering patterns of positive selection, along with patterns of negative selection such as evolutionary conservation, may improve our ability to predict the functional effects of amino acid variation.
A major goal of human genetics research is to understand how genetic variation leads to differences in the function of genes. Genome sequencing projects have generated large amounts of sequence data, yet our ability to predict which specific sequence variants will result in functional differences is currently limited. To address this problem, the authors use an evolutionary model to predict the functional significance of genetic variation in the ABCA1 gene. To predict the functional impact of genetic variation in this gene, the authors compare the specific sites at which the variants occurred in evolutionarily related proteins and generated a likelihood score of functional impairment. These predictions were then compared to actual functional measurements of each variant. The authors show that it is possible to accurately predict which specific variants will affect ABCA1 function and to what extent. These results suggest that the evolutionary approach used may be a useful method in general for determining the functional consequence of genetic variation, which should aid in the study of how genetic variation contributes to phenotypic differences.
The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have associated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automatically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the GenBank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster. PANTHER is publicly available on the web at http://panther.celera.com.
The Celera Discovery System™ (CDS) is a web-accessible research workbench for mining genomic and related biological information. Users have access to the human and mouse genome sequences with annotation presented in summary form in BioMolecule Reports for genes, transcripts and proteins. Over 40 additional databases are available, including sequence, mapping, mutation, genetic variation, mRNA expression, protein structure, motif and classification data. Data are accessible by browsing reports, through a variety of interactive graphical viewers, and by advanced query capability provided by the LION SRS™ search engine. A growing number of sequence analysis tools are available, including sequence similarity, pattern searching, multiple sequence alignment and Hidden Markov Model search. A user workspace keeps track of queries and analyses. CDS is widely used by the academic research community and requires a subscription for access. The system and academic pricing information are available at http://cds.celera.com.