|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact email@example.com
PISCES is a database server for producing lists of sequences from the Protein Data Bank (PDB) using a number of entry- and chain-specific criteria and mutual sequence identity. Our goal in culling the PDB is to provide the longest list possible of the highest resolution structures that fulfill the sequence identity and structural quality cut-offs. The new PISCES server uses a combination of PSI-BLAST and structure-based alignments to determine sequence identities. Structure alignment produces more complete alignments and therefore more accurate sequence identities than PSI-BLAST. PISCES now allows a user to cull the PDB by-entry in addition to the standard culling by individual chains. In this scenario, a list will contain only entries that do not have a chain that has a sequence identity to any chain in any other entry in the list over the sequence identity cut-off. PISCES also provides fully annotated sequences including gene name and species. The server allows a user to cull an input list of entries or chains, so that other criteria, such as function, can be used. Results from a search on the re-engineered RCSB's site for the PDB can be entered into the PISCES server by a single click, combining the powerful searching abilities of the PDB with PISCES's utilities for sequence culling. The server's data are updated weekly. The server is available at http://dunbrack.fccc.edu/pisces.
For many purposes, it is useful to have a list of protein sequences of known structure from entries in the Protein Data Bank (PDB) (1) that fit certain user-defined criteria. These purposes include statistical analysis of certain structural features, such as side-chain rotamer distributions (2), benchmarking sequence alignment and structure prediction methods (3), and analysis of a set of proteins, such as those with a certain function (e.g. DNA-binding) (4). The criteria for limiting the list of sequences can be applied to each chain in a PDB file, to each entry as a whole or to the relationships between protein chains and entries. Chain-specific criteria include function, species, sequence length and whether the chain is presented with only Cα atoms in the PDB file or all atoms. Entry-specific criteria include the function of a complex, resolution, R-factor, experiment type (NMR or X-ray) and other structure quality criteria. Pairwise criteria include the sequence or structural similarity between proteins in the set.
There are a number of servers that provide such lists, including the REPRDB (5), ASTRAL (6), UniqueProt (7) sites and the PDB site itself. Each provides a variety of services for specific purposes. They vary in a number of respects, including how sequence identities are determined and the flexibility and extent of services available to the user. In this paper, we describe extensions to our PISCES server that provides some advantages over other such servers, depending on the purposes of the user.
First, we believe in general that local alignments are better than global alignments for the purpose of selecting sequences from the PDB dataset, since many proteins share homology in one domain that may represent only a portion of the complete sequence. Servers that use CLUSTAL W (8), which performs global alignments, may underestimate sequence identity for shared domains by aligning over unrelated portions of two sequences. It is also important that sequence alignments cover the full-length of related regions of each protein pair. BLAST [and to a lesser extent PSI-BLAST (9)] may overestimate sequence identity for distantly related proteins by providing incomplete alignments over only the most-conserved regions of the sequence pair. Overestimating sequence identities results in lists that are shorter than necessary since some pairs are removed because of the inaccurate estimation of sequence identity. Structure-based alignments are likely to provide the most complete and accurate alignments at low-sequence identity than any sequence-based alignment method (10). As described below, PISCES now uses a combination of structure alignments at low-sequence identity and sequence alignment using PSI-BLAST at high-sequence identity. Other criteria than strict sequence identity are sometimes used (7).
Second, sequence culling servers may provide services for a variety of purposes. [The verb to cull can be used in different senses. The original meaning (Webster's 1913) was to select or gather, as in flowers. More recently, the word is used in two ways: culling the weakest members of a herd, or culling the herd itself, leaving behind the healthiest animals. We are generally using it in the second sense, to cull the PDB by removing redundant or low-resolution structures.] In addition to culling the entire PDB, a user may wish to cull a subset of the PDB. For instance, a user may have a list of PDB entries that contain antibody structures and may want a subset of these with certain structural quality and mutual sequence identity. One can also apply the sequence identity ‘by-entry’ rather than ‘by-chain’. We define this as removing a PDB entry (that may contain more than one chain) from the list if another entry already in the list has at least one chain that has a sequence identity, higher than the cut-off, to any chain in the proposed entry. This might be useful if one wants a list of unique complexes in the PDB. PISCES provides both these services.
Another aspect of such a server is the kind of annotation provided in the results. Each server returns a list of chains and some servers are able to return a list of FASTA-formatted sequences. Usually, the annotation is simply the chain ID and sometimes what the RCSB provides in its ‘pdb_seqres.txt’ file (ftp://ftp.rcsb.org/pub/pdb/derived_data/pdb_seqres.txt). However, the RCSB does not provide consistent annotation for each chain in each PDB file with multiple chains. For instance, PDB entry 2HLA has two chains. Chain A is the human histocompatibility antigen HLA-Aw68.1. Chain B is β2-microglobulin. In the FASTA-formatted sequence database from the PDB, these two chains are given as:
In fact, the lines are unfortunately truncated, but even the annotation from the ‘Sequence Details’ resource for the entry 2HLA on RCSB's site has only this for Chain B: ‘HUMAN CLASS I HISTOCOMPATIBILITY ANTIGEN AW 68.1’. There is, however, more information in the full mmCIF file, including the name ‘beta-2-microglobulin’ as well as the Swiss-Prot entry accession number from which additional information may be obtained.
Our PISCES server was developed from a previous version referred to as ‘CulledPDB’ begun in 1999. CulledPDB used BLAST to determine sequence identities, while PISCES has used PSI-BLAST to determine sequence identities (11) by building a position-specific scoring matrix or profile for each unique sequence in the PDB from a multi-round search of the non-redundant protein sequence database (12). PISCES also provides a resource for culling a user-input list of chains or entries from the PDB or from a set of non-PDB sequences provided as either a list of GenBank IDs or in the FASTA format. We have shown that the use of the PSI-BLAST profiles resulted in much longer lists than while using BLAST (11), while not exhibiting errors associated with using a global alignment algorithm, such as CLUSTAL W.
We have made several important improvements in PISCES that are described in this paper:
PISCES uses the mmCIF-format files from RCSB to determine sequences, experiment type, resolution, R-factors and other features of PDB entries and chains. These mmCIF files are a result of the Uniformity Project, an effort by the RCSB to standardize and correct information across all the PDB files (15,16). Some missing values for resolution and R-factors are obtained from the PDBFINDER database (17). The PDB data used by the server are updated weekly.
PISCES works from a FASTA-formatted database of all sequences in the PDB called pdbaa, which is distinct from NCBI's database of the PDB sequences with the same name (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/pdbaa.gz). Our pdbaa database is available to users who want a complete set of sequences in the PDB. It is also used to provide the sequences and annotations returned by PISCES for subsets of the PDB. pdbaa provides basic information on chain length, experiment type (X-ray, NMR, etc.), resolution, R-factor and free R-factor as appropriate. These pieces of information are useful if a user wants to use a structure as a template for homology modeling, and wants to choose the best template by searching the whole PDB.
In order to integrate as much useful information as possible into the output of PISCES, we have gathered annotations on each PDB sequence from a number of sources. The goal is to have gene names for each sequence, Swiss-Prot or other database identifiers and species information. These are obtained from the following sources: the mmCIF files themselves, Swiss-Prot's Index of PDB entries (PDBTOSP.TXT, http://us.expasy.org/cgi-bin/lists?pdbtosp.txt), the Protein Identification Resource (PIR) and NCBI's non-redundant protein sequence database, in that order to obtain the desired information.
Our goal in culling the PDB is to provide the longest list possible of the highest resolution structures that fulfill the sequence identity and structural quality cut-offs. For this purpose, we need to identify potential evolutionary relationships and complete sequence alignment between related regions of each identified pair. Historically, 98% of requests to PISCES have been for culled lists at 25% identity or higher. We have determined that PSI-BLAST identifies 99.9% of such relationships with E-value of 1.0 or better (data not shown), and thus we do not use structure alignment to identify evolutionary relationships. Rather, we use the structure alignment program CE (13) to align structure pairs which PSI-BLAST determines as having a sequence identity of 50% or less or for which the PSI-BLAST alignment covers <80% of the shorter sequence in each pair. PSI-BLAST is used as described previously (11).
We have found some cases where structure alignment sequence identities are much lower than those calculated with PSI-BLAST even though the alignment lengths are comparable. This may occur when one single-domain protein is homologous to each of the two domains in another protein. CE may align the first protein to either of the two domains of the second protein, but not necessarily the more closely related one. To account for this, we use the sequence identity obtained by either PSI-BLAST or CE, whichever has the larger number of identically aligned residue pairs.
With the sequence identities in hand, PISCES uses the method of Hobohm and Sander (18,19) to cull the sequences that pass the chain and entry criteria input by the user. The details have been described previously (11).
Culling can be performed on a chain-or entry-level basis. Culling by-chain means treating each chain in each PDB entry as a separate entity. This is the standard procedure for creating culled PDB sequence lists. Based on the requests from a number of users of PISCES, we have added another functionality to PISCES: culling ‘by-entry’. For this procedure, the sequence identity between any two entries is defined as the highest sequence identity of any one chain in one entry with any one chain in the other entry. In this way, no two entries will appear in the same list if they share chains with sequence identity over the cut-off. PISCES further allows the user to choose whether to cull within each entry and allows the user to use another sequence identity cut-off for this culling procedure. So for instance, if an entry is a homodimer, the user can choose to have both sequences returned (no culling within entry) or just one of them (culling within entry by some value <100%).
PISCES first gives the user the option to: (i) cull the entire PDB; (ii) cull from a search at the PDB's re-engineered search engine site—this option takes the user to the PDB's web page (http://pdbbeta.rcsb.org) before returning to PISCES; (iii) cull from a user-input list of chains or entries; (iv) cull from a list of GenBank entries; and (v) cull from a FASTA-formatted set of sequences or from BLAST or PSI-BLAST output. For options (iv) and (v), sequence alignments are obtained by using PSI-BLAST on the list of input sequences to determine sequence identities. It is assumed that these sequences are not in the PDB and, therefore, structural criteria are not used.
Option (i) takes a user directly to a page for inputting structural quality criteria (experiment type, resolution, R-factor, Cα-status, chain lengths, etc.) and sequence identity cut-offs. Option (ii) take a user to the RCSB site or the user may start at the RCSB site directly. Once the RCSB server has returned a list of entries that satisfy search parameters, the user can click on the External Sites→PISCES menu on the RCSB page to return to PISCES and to a page displaying the list of hits (all or selected) from RCSB. Option (iii) takes a user to a page for inputting a list of PDB entries or chains. After the list of PDB entries or chains to be searched is confirmed by the user for Options (ii) and (iii), he/she is asked for the structural criteria to be used for culling. PISCES then confirms the input data and asks for user name, institution and email address. When the results are ready, almost always within minutes, the server sends an email to the user to download the results from a page given in the email. The results include a list of the input sequences (if used), the input cut-offs, the list of chains or entries that result from culling and a FASTA-formatted file of the selected sequences. These results are stored for 15 days.
In addition to the sequence culling service, PISCES also provides databases and programs that may prove useful to the user:
We investigated the effects of using structure-based alignments on the results of PISCES. On the PDB dataset, the number of related pairs was: 637 115 according to BLAST (E-value better than 1.0, alignment length >30) and 758 411 according to PSI-BLAST (E-value better than 1.0, alignment length >30). In Figure 1, we have plotted sequence identities and alignment lengths for PSI-BLAST versus BLAST and CE versus PSI-BLAST. The bottom two plots show that the PSI-BLAST alignments are generally longer than the BLAST alignments, and the CE alignments are longer than the PSI-BLAST alignments. The top two plots show that the sequence identities are lower in PSI-BLAST than in BLAST and lower in CE than in PSI-BLAST. As noted earlier, this is due to the fact that incomplete alignments result in artificially higher sequence identities, because the alignment occurs over the most-conserved regions of the two sequences. We have boxed a region of the top two plots to show pairwise relationships that would be used to cull at 30% sequence identity. For instance, in the upper right figure these points would be used to cull using PSI-BLAST-determined sequence identities but not CE-determined identities, since the points in the box have PSI-BLAST sequence identity >30% but CE sequence identity <30%. There are far more points in this box than in the corresponding rectangular box for inclusion by PSI-BLAST but not by CE (to the left and above the green box).
In Table 1, we show the lengths of lists under various sequence identity cut-offs as determined by using BLAST, PSI-BLAST and the REPRDB server (5), which use CLUSTAL W and structure alignments. CE within PISCES provides longer lists at low-sequence identity than the sequence-based methods of BLAST and PSI-BLAST. At very high-sequence identity, the results are based primarily on PSI-BLAST calculations and thus are very similar. Actually, the difference between CE and PSI-BLAST is not only the list length, but also the content in the lists. For example, for lists with 25% sequence identity cut-off, the CE list has only 57 more chains than the PSI-BLAST list, but in fact the two lists have 157 different chains that are not in common. REPRDB produces very short lists at low-sequence identity for unknown reasons. Its lists are longer at high-sequence identity, probably because the global alignment algorithm produces alignments over non-homologous regions of sequence pairs, thus lowering the sequence identity compared with a local alignment algorithm.
We now show an example of the improved annotations available in our pdbaa database. As shown in Introduction, chain-specific information is sometimes missing in the FASTA-formatted database available from the RCSB. For 2HLA, our pdbaa provides these annotations:
After the chain identifier is the sequence length, experiment type, resolution, R-factor, free R-factor (not available in this case), the gene name, the database reference in angular brackets (in this case, Swiss-Prot entry B2MG_HUMAN) and the species in square brackets. While this information is available in the mmCIF file for 2HLA, it is not available in the FASTA sequence database provided by the RCSB.
Finally, we expect the link to PDB's search capabilities to be very useful. Since the establishment of this link, almost 10% of the requests to PISCES have come via the PDB search page. The re-engineered PDB page is only in beta test phase, and so we expect that PISCES will find additional utility when RCSB releases the final version of their search site.
We thank Dr Adrian A. Canutescu for advice on the web server implementation of PISCES. This work was funded by NIH Grants R01-HG02302 to R.L.D. and CA06972 to Fox Chase Cancer Center, as well as the Pennsylvania Tobacco Settlement and an appropriation from the Commonwealth of Pennsylvania. Funding to pay the Open Access publication charges for this article was provided by NIH.
Conflict of interest statement. None declared.