Identifying nsSNPs in known genes
The necessary first step in the analysis of nsSNPs is to identify whether a given SNP is indeed non-synonymous. For this purpose we map SNPs onto known proteins on the basis of SNP DNA flanking sequences. Flanking genomic sequences of SNPs from HGVbase (13
) with length 25 bp each have been translated in all six possible frames and searched for in the proteins in the human proteins subset of the SWALL database (15
). Protein sequences and genomic fragments were pre-processed with the SEG (16
), XNU (17
), RepeatMasker (18
) and DUST programs, which are used to filter out areas of low compositional complexity, regions containing internal repeats of short periodicity and known human genomic repeat sequences. ALU subfamily proteins were also excluded from the set. We required that at least one translated flanking sequence should have an exact match with a database protein sequence. If this match was detected, we further required that the second flanking sequence had either an exact match with the protein sequence or matched the protein sequence in all positions until the end of the protein or a conventional exon/intron border is observed. The resulting mapping of a SNP onto a protein sequence is always unique.
The above procedure is available as a stand alone World Wide Web-based program snp2prot. The link to this program is provided from the main PolyPhen page. We also provide a link to the SNP annotation tool HNP (Y.Yuan, unpublished results).
After processing HGVbase v.12 (983 589 SNP entries), we obtained a set of 20 462 coding SNPs. Of these, 11 152 were non-synonymous, whereas 9310 were synonymous SNPs and do not produce any change in the amino acid sequence. The nsSNPs formed our dataset, which can be downloaded as one text file or searched against with a straightforward World Wide Web-based engine. The search results contain links to the other databases that provide additional information, e.g. chromosomal location of a nsSNP.
PolyPhen analysis of nsSNPs
Sequence-based characterisation of the substitution site. The substitution may occur at a specific site, e.g. active or binding, or in a non-globular, e.g. transmembrane, region. A query identifies the protein by its SWALL accession number or ID or by the sequence itself. In the latter case, PolyPhen tries to find the given sequence in the human subset of the SWALL database and use the FT (feature table) section of the corresponding entry. If the sequence cannot be found in the human subset of SWALL, this step is skipped. PolyPhen checks if the amino acid replacement occurs at a site that is annotated in the SWALL database feature table as DISULFID, THIOLEST or THIOETH bond, BINDING, ACT_SITE, LIPID, METAL, SITE or MOD_RES site or as a site located in a TRANSMEM, SIGNAL or PROPEP region.
PolyPhen also uses the TMHMM (19
) algorithm to predict transmembrane regions, the Coils2 (20
) program to predict coiled coil regions and the SignalP (21
) program to predict signal peptide regions of the protein sequences.
For a substitution in a transmembrane region, PolyPhen uses the PHAT (22
) transmembrane-specific matrix score to evaluate possible functional effect of a nsSNP in the transmembrane region.
At this step PolyPhen memorises all positions that are annotated in the query protein as BINDING, ACT_SITE, LIPID or METAL. At a later stage, if the search for a homologous protein with known 3D structure is successful, it is checked whether the substitution site is in spatial contact with these critical residues.
Profile analysis of homologous sequences.
The amino acid replacement may be incompatible with the spectrum of substitutions observed at that position in a family of homologous proteins. PolyPhen identifies homologues of the input sequences via a BLAST (23
) search of the NRDB database. The set of aligned sequences with sequence identity to the input sequence in the range 30–94% (inclusive) is used by the new version of the PSIC (position-specific independent counts) software (24
) to calculate the so-called profile matrix (http://strand.imb.ac.ru/PSIC/
). Elements of the matrix (profile scores) are logarithmic ratios of the likelihood of a given amino acid occurring at a particular site to the likelihood of this amino acid occurring at any site (background frequency). PolyPhen computes the absolute value of the difference between profile scores of both allelic variants in the polymorphic position. PolyPhen also shows the number of aligned sequences at the query position; this may be used to assess the reliability of profile score calculations.
Mapping of the substitution site to known protein 3-dimensional structures. Mapping of an amino acid replacement to a known 3D structure reveals whether the replacement is likely to destroy the hydrophobic core of a protein, electrostatic interactions, interactions with ligands or other important features of a protein. If the spatial structure of a query protein is unknown, one can use a homologous protein of known structure.
PolyPhen carries out a BLAST query of a sequence against a protein structure database [PDB (25
) or PQS (26
), see below] and retains all hits that meet the given criteria. For instance, the default sequence identity threshold is set to 50%, since this value guarantees the conservation of basic structural characteristics. Minimal hit length and maximal length of gaps are by default set to 100 and 20, respectively. The position of the substitution is then mapped onto the corresponding positions in all retained hits. By default, a hit with 3D structure is rejected if its amino acid at the position under study differs from the amino acid in the input sequence. Hits are sorted according to the sequence identity or E-value of the sequence alignment with the input protein.
Structural parameters used to evaluate the effect of amino acid substitution.
Structural analysis performed by PolyPhen is based on the use of several structural parameters, as suggested previously (7
). Importantly, although all parameters are reported in the output, only some of them are used in the final decision rules.
PolyPhen uses the DSSP (27
) database to obtain the following structural parameters for the mapped amino acid residues: secondary structure (according to the DSSP nomenclature); solvent accessible surface area (absolute value in Å2
–ψ dihedral angles.
The following values are also calculated by PolyPhen: normalised accessible surface area [the absolute value divided by the maximal area defined as the 99% quantile of surface area distribution for this particular amino acid type in PDB (25
)]; change in accessible surface propensity (knowledge-based hydrophobic ‘potentials’) resulting from the substitution; change in residue side chain volume (in Å3
); region of the
–ψ map (Ramachandran map) derived from the dihedral angles (9
); normalised B factor (temperature factor) for the residue [following Chasman and Adams (9
)]; loss of a hydrogen bond [following Wang and Moult (8
)] according to the HBplus program (28
By default, the parameters above are calculated for the first hit only.
Contacts with ‘critical sites’, ligands and other polypeptide chains.
The presence of specific spatial contacts of a residue may reveal its role in protein function. PolyPhen checks three types of contacts for a variable amino acid residue. First, contacts with ligands (defined as all heteroatoms excluding water and ‘non-biological’ crystallographic ligands). Second, interactions between subunits of the protein molecule. Technically these are defined as contacts of a polymorphic residue with residues from other polypeptide chains present in the PDB (PQS) file. For this particular type of interaction, it is more advantageous to use the PQS (Protein Quaternary Structure) database (26
) rather than PDB, since PQS entries are supposed to provide a more adequate picture of protein quaternary structure architecture.
The third type of contact analysed by PolyPhen is represented by contacts with ‘critical’ residues, where the latter are derived from the sequence annotation. The suggested default threshold for all contacts to be displayed in the output is 6 Å. However, a value of 3 Å is used in the decision rule. For evaluation of a contact between two residues or between a residue and a ligand molecule, PolyPhen finds the minimal distance amongst all possible between atoms of two residues. By default, contacts are calculated for all hits with structure. This is essential for cases where several structures correspond to one protein but carry different information about complexes with other macromolecules and ligands (see for example figure in ref. 7
Figure 2 Results of the PolyPhen analysis of the HGVbase database v.12. hs_swall denotes the Homo sapiens subset of the SWALL database. snp2prot is an in-house command line tool to map HGVbase SNPs onto sequences of known human proteins. 11 152 nsSNPs (more ...) Prediction rules.
PolyPhen uses empirically derived rules (Table ) to predict that an nsSNP is damaging, i.e. is supposed to affect protein function, or benign, i.e. most likely lacking any phenotypic effect. The rule is based on the analysis of the ability of various structural parameters and profile scores to discriminate between disease mutations and substitutions between human proteins and closely related mammalian orthologues (7
). We introduced two categories of prediction: nsSNPs possibly damaging protein function/ structure and nsSNPs probably damaging protein function/structure. The scheme presented in Table successfully predicts ~82% (~57% for the more stringent set of rules) of disease-causing mutations annotated in SwissProt database 14 and produces ~8% (~3% for the more stringent set of rules) false positives given the control set of between-species substitutions. We note that many parameters, though computed by the server, were excluded from the decision rule. Due to correlation with other parameters they did not help to increase sensitivity without significant loss of specificity of predictions. Multiple alignment-based profile scores provided the major contribution to the prediction. Therefore, even in the case of proteins with no homologue with known 3D structure, predictions remain reasonably reliable.
Rules used by PolyPhen to predict effect of nsSNPs on protein function and structure