|Home | About | Journals | Submit | Contact Us | Français|
Derivation of biological meaning from large sets of proteins or genes is a frequent task in genomic and proteomic studies. Such sets often arise from experimental methods including large-scale gene expression experiments and mass spectrometry (MS) proteomics. Large sets of genes or proteins are also the outcome of computational methods such as BLAST search and homology-based classifications. We have developed the PANDORA web server, which functions as a platform for the advanced biological analysis of sets of genes, proteins, or proteolytic peptides. First, the input set is mapped to a set of corresponding proteins. Then, an analysis of the protein set produces a graph-based hierarchy which highlights intrinsic relations amongst biological subsets, in light of their different annotations from multiple annotation resources. PANDORA integrates a large collection of annotation sources (GO, UniProt Keywords, InterPro, Enzyme, SCOP, CATH, Gene-3D, NCBI taxonomy and more) that comprise ~200 000 different annotation terms associated with ~3.2 million sequences from UniProtKB. Statistical enrichment based on a binomial approximation of the hypergeometric distribution and corrected for multiple hypothesis tests is calculated using several background sets, including major gene-expression DNA-chip platforms. Users can also visualize either standard or user-defined binary and quantitative properties alongside the proteins. PANDORA 4.2 is available at http://www.pandora.cs.huji.ac.il.
Due to advances in biological, experimental and computational methodologies, scientists are able to conduct high-level genomic and proteomic experiments. In most of these, biologists face the need of extracting meaningful biological insights from a large set of proteins or genes (1). A common approach for extracting such insights is to manually examine the set of proteins and attempt to derive biological conclusions. However, this method greatly relies on the expertise of the biologist examining the data and often produces a partial and biased view of the protein set (2). Another approach is using annotation-based computational methods. These methods enable the biologist to reach a global and more objective view of the data (3).
Typically, computational methods use a single annotation source, most commonly the Gene Ontology (GO) (4), and automatically detect annotations that appear at a frequency that is significantly greater than expected (5). However, the strong dependency of such methods on a single source restricts the biological information they can extract. Furthermore, these methods often provide only a limited biological view of the data set and are unable to detect groups that are characterized by sharing multiple biological properties in common. There are some exceptions, however, such as the DAVID (6) and EASE (7) resources, which provide statistical analysis of annotation subsets for the purpose of extracting biological knowledge.
We have developed a web server called PANDORA (Protein ANnotation Diagram ORiented Analysis) whose goal is the biological analysis of protein sets (8). Many protein and gene-annotation systems either explicitly, or implicitly, correspond to some hierarchical structure. For example, being annotated as localizing to the nucleolus necessarily implies localization to the nucleus, though the converse does not hold. Thus, several tools have been developed to address the visualization task for hierarchical annotations (9). We take this concept one step further by dynamically integrating multiple annotation sources into the natural hierarchy deriving from a particular set of user-defined proteins. PANDORA shows the protein set as a graph, which we refer to as the Concept DAG (Directed Acyclic Graph). The Concept DAG is a directed graph whose nodes represent protein subsets that share a unique combination of one or more biological annotations, and whose directed edges represent subset/superset relations between nodes [for further information on the graph construction see (8)]. Importantly, the graph still retains the annotation information for each protein while providing a richer and more accurate view of the data. Furthermore, PANDORA is based on the annotations as extracted from UniProtKB protein entry files. For each file, the annotation provided by UniProtKB and the mapping from external annotation resources encompassing extensive biological aspects is extracted. The rich collection of annotation resources covers biological functions at various levels: participation in biological processes, 3D structural classification, cellular localization, taxonomy, and more (see ‘Databases’ section). This overcomes the limitation of a single annotation source and permits helpful comparisons between various biological aspects.
We have previously described the underlying logic behind PANDORA and have demonstrated that PANDORA is useful in extracting meaningful and previously overlooked data from protein sets (3,8). PANDORA was valuable in interpretation of large-scale experiments as demonstrated in (10). PANDORA 4.2 is expanded to include most UniProtKB protein sequences and their associated annotations. In this article, we describe new and improved features in PANDORA 4.2 that further extend the power of biological analysis of sets through our system. These features include: (i) User Properties—PANDORA allows incorporation of external user properties, such as differential expression levels or quantitative information from mass spectrometry (MS) proteomics experiments. These custom properties can be included in the PANDORA analysis to further enhance the discovery of biological knowledge; (ii) Statistical evaluation of the input relative to several different background databases; (iii) Using the hit list of protein matches of NCBI-BLAST as an input set and using the BLAST e-values as quantitative properties; (iv) Incorporating PANDORA into external biological servers such as ProtoNet, which provides thousands of homology-based clusters for analysis; (v) Expanding PANDORA to handle MS proteomics data—PANDORA now also allows peptides as input for major model organisms. Peptides are mapped to peptide lists representing in-silico cleavage by proteases that are commonly used in MS proteomics research.
The user starts using the PANDORA server either by entering a user-defined set of proteins (‘User Set’ menu), entering a list of proteolytic peptides to be mapped to the proteins from which they are derived (‘Peptides’), searching for proteins with a particular annotation (‘Keyword’), or considering the proteins detected in a BLAST homology search (‘Blast’). Ultimately, these inputs are all transformed into the set of corresponding proteins and the process continues from there. Subsequently, pre-defined quantitative properties can be selected, as desired. Finally, the proteins being analyzed are displayed in their annotation-derived hierarchy, where each node represents a subset of proteins with particular biological properties. In addition, a statistical evaluation of annotation enrichment is provided.
PANDORA 4.2 supports almost 10 times as many proteins than in previous versions (Table 1), covering ~3.2 millions sequences from UniProtKB (11). A sample list of the keywords that are supported is shown in Table 2. On average, each protein is covered by 24 different annotation types (excluding taxonomy). PANDORA is based mainly on annotations extracted from UniProtKB (the UniProtKB/Swiss-Prot and UniProtKB/TrEMBL databases). The mapping to UniProtKB Keywords, ENZYME, GO annotations, InterPro and Taxonomy is based on the XML file for each protein sequence entry. For structural annotations (CATH, SCOP, GENE3D), a direct mapping was completed from the original resources or through the InterPro compendium. The individual sources underlying InterPro entries are maintained allowing focusing on any of the family and domain based resources (e.g. PROSITE, PRINTS, Pfam, SMART, SUPERFAMILY). All the information is stored locally in ProtoNet database (12). The size of the database supporting PANDORA 4.2 is stored in 88 GB. Several of the databases are structured and hierarchical (such as ENZYME, SCOP, CATH). For these resources, each level of the hierarchy can be selected separately, resulting in ~40 levels of annotations that can be selected for analysis (Figure 1). Note that the coverage of the annotation resources ranges from 8 to 78% (excluding taxonomy) (Table 2) but this level is higher for the main model organisms.
There are four methods of selecting an initial set for PANDORA to work with:
Generally, PANDORA receives a protein set as input, derives all information on the proteins from its integrated database and uses that information to build the Concept DAG (see example in Figure 1). However, in many cases, it would be helpful to let the user introduce external supplementary information about the proteins into the analysis. Examples of such external information are relative change in expression levels (which are typical for microarray experiments), a user-defined division of the protein set into several sets (allowing comparison of the sets from repeated experiments), or even an alignment score such as BLAST e-values (see ‘Input methods and integrated BLAST’ section). To this end, we have developed the ‘user properties’ option. Generally, properties introduced by the user can be divided into three categories:
In order to deal with a quantitative property, PANDORA ignores the property when building the graph, and then examines the distribution of the property on the graph (Figure 1). The PANDORA graph consists of nodes, where each node represents a subset of proteins that share certain biological properties. Each node, therefore, has a distribution of the quantitative property for its proteins. The distribution of each node is displayed as a histogram below the node. This allows the user to easily recognize nodes with distinct quantitative patterns. For example, if the quantitative property is change in expression level, we could easily identify subsets of proteins that are both related biologically and share similar expression patterns. Of course the user is not limited to any specific kind of quantitative property and could make creative use of this feature. For example, the integrated BLAST feature uses the BLAST e-values as a quantitative property (see ‘Input methods and integrated BLAST’ section) in order to facilitate the detection of biological groups that have statistically significant sequence similarity to an input sequence. For simplicity, the user may display up to three quantitative properties simultaneously, enabling the search for correlation between different orthogonal properties. We added pre-calculated quantitative properties for each protein in the database, including pI, molecular weight (in Dalton) and length (in amino acids). Experimental MS proteomics is a rich source for proteins and peptide sets. We thus added quantitative data that include the number of detectable peptides with various commonly used proteases and the number of validated phosphorylation sites. The later were extracted from UniProtKB XML under ‘amino-acid modification’. A further refinement is achieved by partitioning the phosphorylation type to Phosphothreonine, Phosphoserine and Phosphothyrosine.
One critical aspect in the evaluation of biological results is their statistical significance. PANDORA deals with this by coloring each node according to the node's sensitivity for that annotation. The node’s color represents the highest sensitivity of the node to any of its annotations. A white and red node has a sensitivity of 1 and 0, respectively. For some nodes the sensitivity is not well-defined and these nodes appear as a red–white swirl (undetermined).
In addition PANDORA provides an evaluation table together with each graph. The table gives P-values for the appearance of the annotations on the current protein set, estimating the probability that an annotation would randomly appear as frequently as it did. The calculation of P-value is based on a binomial approximation of the hypergeometric distribution following Bonferroni correction. An additional correction (multiple hypothesis testing) is added to the table that is based on the FDR adjustment (13). In conjunction with the ability to use several different annotation sources, this evaluation can label statistically significant enrichments (Figure 1). Of course, to properly estimate these P-values, it is necessary to know from which background pool of proteins the input proteins were taken and evaluate how frequent each annotation is in that background set. Although PANDORA generally does not assume anything about the origin of the protein set which is analyzed, it allows a selection of background models that fit various experimental models. For microarray experiments, PANDORA offers a variety of background sets, such as the most commonly used Affymetrix microarrays. For proteomic experiments, PANDORA offers background sets of proteomes of several model organisms and proteins according to their partition to SwissProt or TrEMBL. For other purposes, PANDORA simply uses the whole SwissProt+TrEMBL database as its background. Researchers that require background sets that are not currently included in PANDORA are encouraged to contact the authors.
PANDORA results can be saved at different formats (including FASTA format, accession ID list etc). In addition, PANDORA allows presenting a group of proteins that is unified by an annotation node by a multiple sequence alignment (i.e. CLUSTALW representation). PANDORA can easily interface with other biological servers that deal with protein set analysis. A web server that has recently been linked to PANDORA is ProtoNet (12), which uses PANDORA to gain biological insight into large protein clusters. Web server developers who are interested in interfacing directly with PANDORA may contact the authors.
PANDORA is based on an extensive database which integrates several biological databases. An underlying protein database is used as a basis for information on the protein entities, and several annotation sources whose annotations are mapped to the protein databases are used in conjunction.
The underlying protein database initially used by PANDORA (8) has been changed from SwissProt (114 035 proteins) to UniProtKB (3 188 835 proteins), giving a greatly enhanced representation of the proteomes of several model organisms (see examples in Table 1). The annotation sources used by PANDORA have also been updated, and now offer ~200 000 different annotations, spanning several different biological domains. All underlying protein and annotation databases are periodically updated in order to keep up with the most recent biological knowledge available. We are currently planning to add additional annotation sources to PANDORA in order to improve protein set analysis in further biological aspects such as protein–protein interactions.
Prospects consortium (EU framework VII) and the BSF (grant number 2007219); Sudarsky Center for Computational Biology (to N.R., M.F. and R.S.). Funding for open access charge: Prospects consortium (EU framework VII) and the BSF (grant number 2007219).
Conflict of interest statement. None declared.
The authors would like to thank Solange Karsenty for her support in maintaining and design the Web site. The authors thank Michael Dvorkin for support in managing the immense database and the ProtoNet team.