There are three services currently available in the ‘Tools’ section of the PANTHER website (http://www.pantherdb.org/tools
): the protein sequence classification service, the expression data analysis service and the nsSNP scoring service. Before describing these services in detail, for those already familiar with the PANTHER website we first list the newest features.
Since the previous publication mentioning the interactive tools at the PANTHER website [Mi et al
)], several new features have been added to the PANTHER web services.
The expression data analysis service can now be used for any set of genes or proteins from any organism, not just the organisms stored on the PANTHER website (human, mouse, rat and Drosophila melanogaster
). Users interested in analyzing other datasets, such as protein expression data mapped to UniProt, or gene expression data in canines or yeast, can now do so, using the new ‘PANTHER generic mapping file’ format described below. This format enables users to upload any proteins or genes, after scoring the associated protein sequences against the PANTHER HMM library using the scoring script available at http://www.pantherdb.org/downloads
The expression data analysis statistics now include a Bonferroni correction for multiple testing. The Bonferroni correction is important because we are performing many statistical tests (one for each pathway or each ontology term) at the same time. This correction multiplies the single-test P-value by the number of independent tests to obtain an expected error rate. For pathways, we now correct the reported P-values by multiplying by the number of pathways tested. Some proteins participate in multiple pathways, so the tests are not completely independent of each other and the Bonferroni correction is conservative. For ontology terms, the simple Bonferroni correction becomes extremely conservative because parent (more general) and child (more specific) terms are not independent at all: any gene or protein associated with a child term is also associated with the parent (and grandparent, etc.) terms as well. We therefore modify the Bonferroni correction to account for the nesting of child terms below parent terms. Because at more specific levels it takes a larger number of independent terms to span the ontology, the number of independent tests can be seen as depending on the level of specificity in the ontology. Level 1 terms are treated as independent of other level 1 terms, but since all level 2 and 3 terms are subsumed by all of the level 1 terms they are not independent tests. We apply the same basic idea for lower level terms, though we need to adjust the count slightly to span the ontology, since not all level 1 terms are subdivided into level 2 terms, nor all level 2 terms into level 3 terms. For example, for level 2 terms, we treat each test as independent of the tests for other level 2 terms, and as independent of any level 1 terms that have no children.
Users can now visualize the data underlying the binomial test statistics (see below). The counts of genes or proteins in each functional grouping can be viewed as pie or bar charts, and fold differences from the reference list can be graphed with statistically significant differences highlighted ().
Figure 1 Viewing data underlying statistical analysis of gene lists with respect to function. The data are from Cho et al. (16); each list comprises genes up-regulated at a given stage of the human cell cycle. (A) Overlay chart of four datasets (M, G1, S and G2 (more ...)
The coding SNP scoring service now provides a link to a combined view of the family tree and the multiple sequence alignment, where the data used to calculate the amino acid probabilities are highlighted. The column of the multiple alignment that corresponds to the position of the amino acid substitution is highlighted in red, while the subtree over which the amino acid is conserved is highlighted in gray (). In addition, the subPSEC score (1
) is now converted to a more readily interpretable probability of deleterious effect on protein function, Pdeleterious
, as described below.
Figure 2 Graphical view of the evolutionary data used to calculate coding SNP scores. The multiple sequence alignment of UniProt sequences (right) is displayed next to the protein family tree that shows the relationships between functionally distinct subfamilies. (more ...)
Finally, a new version of the PANTHER library (6.0) has been released. All protein subfamilies have been reviewed and updated by expert curators, who have also updated and corrected the ontology terms. UniProt sequences are now used for building the library of families, simplifying the links to detailed functional information for individual proteins. The HMMs underlying the sequence classification service and coding SNP scoring statistics therefore incorporate the most recent protein sequence and annotation data. Version 1.2 of PANTHER Pathway has also been released, containing 107 pathways (primarily signaling pathways) that can be viewed interactively.
Protein sequence classification service
The protein sequence classification service is available at http://www.pantherdb.org/tools/hmmScoreForm.jsp
. The user inputs a protein sequence (as a string of amino acid single letter codes or FASTA format) and presses the ‘submit’ button. The sequence is then scored against the entire PANTHER ‘library’ of family and subfamily HMMs (1
), and if the top hit is statistically significant (E
-value < 0.001), that hit is returned. The alignment and E
-value (Bonferroni-corrected P
-value for the match) is shown, as well as the HMM name and links to additional information about the HMM, including the molecular function, biological process and pathway component associations, and the training sequences used to build the HMM and the protein sequence family tree.
The top scoring HMM can be either a family or subfamily HMM. Family HMMs are generally associated with less specific functional information than subfamilies. The E-value of the hit is also important for interpreting the results, and in addition to the actual E-value, we provide a simple icon that shows the empirically derived confidence level of the classification. The icon shows three filled circles if the classification confidence level is high (E-value < 10−23), two filled circles if the confidence is medium (E-value < 10−11), and one filled circle if the confidence level is low. More detailed information is available by clicking on the help icon on the results page.
The PANTHER website also allows access to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and D.melanogaster
genomes. These data can be accessed using the text search box on the home page, or the batch search page accessible from the Genes section of the website (http://www.pantherdb.org/genes
Tools for finding statistically significant functional associations in genomic experimental data (Expression data analysis service)
There are two tools available in this section of the website (http://www.pantherdb.org/tools/genexAnalysis.jsp
). Both are designed to uncover statistically significant relationships between input data and gene or protein functions. The main applications for this type of analysis has been for finding functional trends in mRNA microarray data or protein expression data from mass spectrometry, although it has been used to aid in data interpretation from a number of genome-wide studies such as gene essentiality screens (12
) and comparative genomics studies such as tests for positive selection across many genes (9
The first tool is for analysis of gene or protein lists with respect to function. The test is the conceptually simple binomial test described in reference (6
). Each input list is divided into groups based on the functional classification (either molecular function, biological process, or pathway). A reference list (all of the genes/proteins from which the list was drawn) is divided into groups in the same way. Then, for each functional category, the binomial test is applied to determine whether there is a statistical over- or under-representation of genes/proteins in the input list relative to the reference list.
The input is from one to four lists of genes or proteins (plus, optionally, a ‘reference list’), which are uploaded onto the website. These lists are not stored on the site once the user ends the session. One of two formats must be used, depending on whether the user wishes to use the pre-calculated HMM scoring data stored on the PANTHER website, or to use a file they have generated by scoring their own protein sequence set against the PANTHER HMMs (available for download at http://www.pantherdb.org/downloads
). For using the pre-calculated PANTHER classification data, the format is simply a single column file of identifiers that can specify records in the PANTHER database. Currently the pre-calculated data covers only the human, mouse, rat and Drosophila
genomes. The supported identifiers are listed on the list upload page, but they include gene identifiers [Entrez Gene (5
) identifiers for human, mouse and rat, or FlyBase (13
) FBgn numbers for Drosophila
], protein identifiers (RefSeq or FlyBase) and gene symbols. For the user-generated data, the ‘PANTHER generic mapping file’ format must be used instead. This format consists of two columns: the first column is an arbitrary identifier that the website will temporarily store (again only for the session) which allows the user to uniquely specify each record in the dataset, so they can track that identifier on the website; the second column is the PANTHER HMM identifier (e.g. PTHR19266, or PTHR19266:SF40), which is used to look up molecular function, biological process and pathway associations.
The output of the tool is a list of P-values for under- or over-representation of each functional category in each of the input lists. From this output page, the user can export the statistics, or follow links to graphically view (as pie charts or bar graphs) the data that were used to compute the P-values, or to look at the list of genes/proteins in any functional group. When pathways are chosen as the functional categories, clicking on the pathway name brings up pathway diagrams colored according to preferences specified by the user.
The second tool in this section is for analysis of a complete list of genes/proteins that have numerical data associated with each gene/protein. The most commonly used numerical data are probably the fold-change value for each gene in a differential expression experiment, but the statistical test is general enough to handle any numerical data, continuous or discontinuous. The statistical tool builds a distribution of values for all input data in the list (this becomes the reference distribution), and then divides the input data into functional categories and builds a distribution of values for each category. The probability that the functional category distribution was drawn randomly from the reference distribution is estimated using the Mann–Whitney Rank-Sum Test (U-
test), as described in (9
). Using the whole distribution of values has been shown (9
) to provide a more sensitive test than the simple list-based test described above.
For the numerical data test, the user inputs a single file. Like the list comparison tool, there are two formats for the uploaded file, depending on the desired source of the PANTHER classification data: either the pre-calculated classifications available on the PANTHER site, or a user-generated file. For using the pre-calculated PANTHER data, the file must contain two columns: the first is the gene or protein identifier, and the second is the numerical value. For user-specified data, the file must contain three columns: an arbitrary tracking identifier (e.g. a UniProt identifier or gene symbol); the PANTHER HMM identifier indicating the classification of the gene/protein; and the numerical value.
The output of the tool is a list of P-values for each comparison between a functional category distribution and the reference distribution. Each distribution, and how it compares with the reference distribution, can be viewed graphically from the output page. We find that this is critical for interpreting the any deviation between the functional category distribution and the overall distribution. The genes/proteins in each category can also be viewed from the output page by clicking on the listed counts. In addition, for pathways, clicking on the pathway name will bring up an interactive Java applet that colors the pathway using a ‘heat map’ derived from the input values ().
Figure 3 Expression data analysis and visualization on the PANTHER website. (A) Mann–Whitney U-test results, and (B) CellDesigner (15) diagram of the T-cell activation signaling pathway from the PANTHER Pathway database (accession P00053, author Adam Douglass). (more ...)
Coding SNP scoring service
The non-synonymous SNP scoring service is available at http://www.pantherdb.org/tools/csnpScoreForm.jsp
. The methodology used to generate the scores is described in detail in (1
) and summarized in (14
). Briefly, the method uses a multiple alignment of a family of protein sequences, together with information about functional subfamilies within that family, to estimate the probabilities of different amino acids occurring at different positions in the protein family. High probability amino acids are likely to result in a functional protein, while low probability amino acids are likely to have a deleterious effect on protein function. We quantify the likely functional effect with a substitution position-specific evolutionary conservation (subPSEC) score, calculated as simply the log of the ratio of the probabilities of the two substituted amino acids: ln(Psub
), where Psub
is the probability of the substituted amino acid and Pwt
is the probability of the wild-type amino acid. Smaller (more negative) subPSEC scores indicate a higher likelihood of being deleterious. We have recently added a third parameter to the subPSEC score: the number of independent counts nic
, a measure of the (global) diversity of sequences over which a position has been conserved. In effect, this parameter gives a greater probability of functional impact for positions that have been conserved over greater evolutionary distances. Based on calibration using a large set of known disease-causing non-synonymous mutations, as well a large set of randomly sampled non-synonymous human SNPs (1
), we can express the probability of a nsSNP being deleterious (Pdeleterious
) as a function of the subPSEC score [details are given in reference (17
in which subPSEC = −0.88lnPa
, where Pa
is the larger and Pb
the smaller, of the two amino acid probabilities, and nic
is the number of independent counts. Pa
are all calculated in a position-specific manner, using the largest subtree of the family tree that both (1
) conserves the same amino acid as the input sequence and (2
) contains the PANTHER subfamily that had the best score to the input sequence.
To use the coding SNP scoring service, there are two boxes on the input form that must be filled out. The first is the sequence of the protein, in the same one-letter code format as for the classification service above. The second is a list of amino acid substitutions, in the standard mutation notation, e.g. D432A, where the wild-type amino acid is D (D must appear at position 432 in the sequence entered into the first box), and A is the substituted amino acid. Multiple substitutions may be entered into the second box, separated by a <return> character. The exact number of substitutions that can be handled per query (i.e. before the page times out) ranges from a minimum of 10 to a maximum of hundreds, depending on the length of the query protein sequence and the size of the PANTHER family it matches.
The wild-type protein sequence is then searched against the PANTHER HMM library to find the highest-scoring (statistically significant) PANTHER HMM, using the same methods as the classification service above. This search specifies the multiple sequence alignment, subfamily (if possible) and tree that will be used to estimate the substitution probabilities, and also specifies the position of the substitution in the multiple alignment (by aligning the user's sequence to the existing multiple alignment).
The output is a list of the substitutions entered by the user, with the amino acid probabilities derived from the multiple sequence alignment, Pwt and Psub, nic, the subPSEC score, and the predicted probability that the substitution is deleterious, Pdeleterious. The alignment and tree used to generate the amino acid probabilities can be viewed by clicking on the link from the ‘position’ column of the output data. If the substitution occurs at a position that does not appear in the multiple alignment, a subPSEC score cannot be generated and the output will return the text string ‘does not align to HMM’, indicating that the substitution occurs at a position that is inserted relative to the consensus HMM for the given family. In most cases, these positions are not modeled by the HMMs simply because they do not appear in most of the related sequences; as a result, substitutions at inserted positions are not generally likely to be deleterious.