For many purposes, it is useful to have a list of protein sequences of known structure from entries in the Protein Data Bank (PDB) (
1) that fit certain user-defined criteria. These purposes include statistical analysis of certain structural features, such as side-chain rotamer distributions (
2), benchmarking sequence alignment and structure prediction methods (
3), and analysis of a set of proteins, such as those with a certain function (e.g. DNA-binding) (
4). The criteria for limiting the list of sequences can be applied to each chain in a PDB file, to each entry as a whole or to the relationships between protein chains and entries. Chain-specific criteria include function, species, sequence length and whether the chain is presented with only Cα atoms in the PDB file or all atoms. Entry-specific criteria include the function of a complex, resolution,
R-factor, experiment type (NMR or X-ray) and other structure quality criteria. Pairwise criteria include the sequence or structural similarity between proteins in the set.
There are a number of servers that provide such lists, including the REPRDB (
5), ASTRAL (
6), UniqueProt (
7) sites and the PDB site itself. Each provides a variety of services for specific purposes. They vary in a number of respects, including how sequence identities are determined and the flexibility and extent of services available to the user. In this paper, we describe extensions to our PISCES server that provides some advantages over other such servers, depending on the purposes of the user.
First, we believe in general that local alignments are better than global alignments for the purpose of selecting sequences from the PDB dataset, since many proteins share homology in one domain that may represent only a portion of the complete sequence. Servers that use CLUSTAL W (
8), which performs global alignments, may underestimate sequence identity for shared domains by aligning over unrelated portions of two sequences. It is also important that sequence alignments cover the full-length of related regions of each protein pair. BLAST [and to a lesser extent PSI-BLAST (
9)] may overestimate sequence identity for distantly related proteins by providing incomplete alignments over only the most-conserved regions of the sequence pair. Overestimating sequence identities results in lists that are shorter than necessary since some pairs are removed because of the inaccurate estimation of sequence identity. Structure-based alignments are likely to provide the most complete and accurate alignments at low-sequence identity than any sequence-based alignment method (
10). As described below, PISCES now uses a combination of structure alignments at low-sequence identity and sequence alignment using PSI-BLAST at high-sequence identity. Other criteria than strict sequence identity are sometimes used (
7).
Second, sequence culling servers may provide services for a variety of purposes. [The verb to cull can be used in different senses. The original meaning (Webster's 1913) was to select or gather, as in flowers. More recently, the word is used in two ways: culling the weakest members of a herd, or culling the herd itself, leaving behind the healthiest animals. We are generally using it in the second sense, to cull the PDB by removing redundant or low-resolution structures.] In addition to culling the entire PDB, a user may wish to cull a subset of the PDB. For instance, a user may have a list of PDB entries that contain antibody structures and may want a subset of these with certain structural quality and mutual sequence identity. One can also apply the sequence identity ‘by-entry’ rather than ‘by-chain’. We define this as removing a PDB entry (that may contain more than one chain) from the list if another entry already in the list has at least one chain that has a sequence identity, higher than the cut-off, to any chain in the proposed entry. This might be useful if one wants a list of unique complexes in the PDB. PISCES provides both these services.
Another aspect of such a server is the kind of annotation provided in the results. Each server returns a list of chains and some servers are able to return a list of FASTA-formatted sequences. Usually, the annotation is simply the chain ID and sometimes what the RCSB provides in its ‘pdb_seqres.txt’ file (
ftp://ftp.rcsb.org/pub/pdb/derived_data/pdb_seqres.txt). However, the RCSB does not provide consistent annotation for each chain in each PDB file with multiple chains. For instance, PDB entry 2HLA has two chains. Chain A is the human histocompatibility antigen HLA-Aw68.1. Chain B is β2-microglobulin. In the FASTA-formatted sequence database from the PDB, these two chains are given as:
- 2hla_A mol:protein length:270 Human Class I Histocompatibility Antigen Aw
- 2hla_B mol:protein length:99 Human Class I Histocompatibility Antigen Aw 6
In fact, the lines are unfortunately truncated, but even the annotation from the ‘Sequence Details’ resource for the entry 2HLA on RCSB's site has only this for Chain B: ‘HUMAN CLASS I HISTOCOMPATIBILITY ANTIGEN AW 68.1’. There is, however, more information in the full mmCIF file, including the name ‘beta-2-microglobulin’ as well as the Swiss-Prot entry accession number from which additional information may be obtained.
Our PISCES server was developed from a previous version referred to as ‘CulledPDB’ begun in 1999. CulledPDB used BLAST to determine sequence identities, while PISCES has used PSI-BLAST to determine sequence identities (
11) by building a position-specific scoring matrix or profile for each unique sequence in the PDB from a multi-round search of the non-redundant protein sequence database (
12). PISCES also provides a resource for culling a user-input list of chains or entries from the PDB or from a set of non-PDB sequences provided as either a list of GenBank IDs or in the FASTA format. We have shown that the use of the PSI-BLAST profiles resulted in much longer lists than while using BLAST (
11), while not exhibiting errors associated with using a global alignment algorithm, such as CLUSTAL W.
We have made several important improvements in PISCES that are described in this paper:
- We now use the structure alignment program CE (13) to refine the relationship determination between proteins and recalculate their sequence identities that are found by using PSI-BLAST.
- PISCES now use CE-type sequence identity, i.e. identical pairs divided by all aligned pairs excluding gaps. PSI-BLAST, in contrast, calculates the sequence identity by the ratio of the number of identical pairs to the full-length of the alignment, including gaps. This change means that if two closely related proteins are aligned, but one has a large insertion, the sequence identity will remain high. This may occur for instance if a long disordered loop is engineered out of a protein to facilitate crystallization. The new definition of sequence identity is a more conservative criterion for sequence culling, since sequence identities are higher than they were previously using the PSI-BLAST values, and more sequences are removed for a target sequence identity.
- PISCES is now able to cull by-entry in addition to by-chain, as described above.
- PISCES now provides more annotation for each chain in the PDB than the PDB itself does. This annotation includes chain-specific gene names and functional and species information retrieved from the full mmCIF files of the PDB, and the Swiss-Prot and GenBank databases.
- From the re-engineered search engine of the PDB (now in beta release, http://pdbbeta.rcsb.org/pdb) (14), with a single click on the External Sites→PISCES menu on the PDB's search result page, all the hits or a selected subset of the results of a search can be transmitted to the PISCES site for culling using a user's input criteria for structural quality and sequence identity cut-offs.