Delta alignment score
In pairwise sequence alignments, alignment scores can be used as a measure of sequence similarity to assess how likely the sequence pairs are homologous or related. In keeping with this idea, one can interpret a change in the alignment score caused by an amino acid variation as the impact of the variation on protein function. Specifically, given a protein A, let us assume there is a homologous protein B which is functional. To measure the effect of a variation on protein A, we can measure the similarity of protein A to B before and after the introduction of the variation. Our assumption is that a variation that reduces the similarity of protein A to the functional homolog protein B is more likely to cause a damaging effect. For this purpose, we suggest a change in the “alignment score” to be used as a measure of change in “similarity” caused by a variation.
To quantify the degree of impact of a variation on protein function, we define a delta alignment score
(or simply delta score
) of a protein query sequence
and its variation
with respect to another protein subject sequence
as the change in semi-global alignment score (i.e., no penalty on end gaps in global alignment 
. More formally,
is the variant sequence of
is the semi-global alignment score between two protein sequences
, which is computed based on a given amino acid substitution matrix (e.g. BLOSUM62) and gap penalties.
The delta score can be used to measure the effect of a variation. That is, low delta scores are interpreted as amino acid variations leading to a deleterious effect on protein function (), while high delta scores are interpreted as variations with neutral effect on protein function (). Since the delta score is computed from alignment scores and that the alignment scores are computed based on a substitution matrix, the delta score approach has advantages over other tools as described below.
Figure 1 Examples of computing and interpreting delta alignment scores for six different known variations, (A) deleterious substitution (MIM:151623), (B) neutral substitution (dbSNP:rs1042522), (C) deleterious deletion (MIM:219700), (D) neutral deletion (dbSNP:rs72471101), (more ...)
First, the delta score approach naturally utilizes a substitution matrix which implicitly captures information on the substitution frequency and chemical properties of 20 amino acid residues. Given a single amino acid substitution, if the reference amino acid residue is found to be conserved or similar to the aligned amino acid in a homologous sequence and that the frequency of substitution from the reference residue to the variant residue in question is low based on the substitution matrix score, then the substitution will produce a low delta score which suggests a deleterious effect of the substitution (). Conversely, if the variant amino acid residue instead of the reference residue is found to be similar to the aligned amino acid in the homologous sequence, then the substitution will produce a high delta score to suggest a neutral effect of the variation (, Homolog 1).
Second, the delta score is not only determined by the amino acid position where the variation is observed but can also be determined by the neighborhood that surrounds the site of variation (i.e., sequence context). The delta score is computed from alignment scores that encompass regions flanking both sides of the site of variation. In the scenario when an amino acid variation does not cause a change in the flanking sequence alignment (e.g. in ungapped regions, , Homolog 1), the delta score is simply determined by looking up two values from the substitution matrix scores and computing their differences (e.g. a BLOSUM62 score of “6” for a G→G change and a score of “-3” for a C→G change as shown in ). In a different scenario when an amino acid variation causes a change in the sequence alignment in the neighborhood area of the site of variation (e.g. in gapped regions, , Homolog 2) or when the neighborhood area is aligned with gaps (, Homolog 3), the delta score is determined by the alignment scores derived from the flanking regions. In such cases, existing tools which base on frequency distribution or identity count of the aligned amino acids can be misled by the poorly aligned residues in a gapped alignment (, Homolog 2), or simply cannot make use of the homologous protein alignment because no amino acid can be aligned to derive count statistics (, Homolog 3).
Finally, the most important advantage of our method is that the delta score approach considers alignment scores derived from the neighborhood regions and therefore can be directly extended to all classes of sequence variations including indels and multiple amino acid replacements. That is, the delta scores for other types of amino acid variations are computed in the same way as for single amino acid substitutions. In the case of amino acid insertion or deletion, the amino acids are inserted into or removed respectively from the variant sequence prior to performing the pair-wise sequence alignment and computing the alignment scores and delta score (). Using the delta alignment score approach, PROVEAN was developed to predict the effect of amino acid variations on protein function. An overview of the PROVEAN procedure is shown in . The algorithm consists of (1) collection of homologous sequences, and (2) computation of an “unbiased averaged delta score” for making a prediction (See Methods
for details). As an example, PROVEAN scores were computed for the human protein TP53 for all possible single amino acid substitutions, deletions, and insertions along the entire length of the protein sequence to demonstrate that PROVEAN scores indeed reflect and negatively correlate with amino acid conservation (Figure S1
An overview of the PROVEAN procedure.
New prediction tool PROVEAN
To test the predictive ability of PROVEAN, reference datasets were obtained from annotated protein variations available from the UniProtKB/Swiss-Prot database. For single amino acid substitutions, the “Human Polymorphisms and Disease Mutations” dataset (Release 2011_09) was used (will be referred to as the “humsavar”). In this dataset, single amino acid substitutions have been classified as disease variants (n
20,821), common polymorphisms (n
36,825), or unclassified. For the reference dataset, we assumed that the human disease variants will have deleterious effects on protein function and common polymorphisms will have neutral effects. Since the UniProt humsavar dataset only contains single amino acid substitutions, additional types of natural variation, including deletions, insertions, and replacements (in-frame substitution of multiple amino acids) of length up to 6 amino acids, were collected from the UniProtKB/Swiss-Prot database. Each variant in this dataset was annotated in-house as deleterious, neutral, or unknown based on keywords found in the description provided in the UniProt record (see Methods
). A total of 729, 171, and 138 human protein variations of deletions, insertions, and replacements were collected, respectively. The number of UniProt human protein variants used in the predictability test is shown in .
Number of human protein variations collected from the UniProt/Swiss-Prot database.
The PROVEAN tool was applied to the above dataset to generate a PROVEAN score for each variant. As shown in , the score distribution shows a distinct separation between the deleterious and neutral variants for all classes of variations. This result shows that the PROVEAN score can be used as a measure to distinguish disease variants and common polymorphisms.
PROVEAN score distribution for deleterious and neutral human protein variations.
To optimize the predictive ability of PROVEAN for binary classification (the classification property is being deleterious), a PROVEAN score threshold was chosen to allow for the best balanced separation between the deleterious and neutral classes, that is, a threshold that maximizes the minimum of sensitivity and specificity. In the UniProt human variant dataset described above, the maximum balanced separation is achieved at the score threshold of −2.282. With this threshold the overall balanced accuracy was 79% (i.e., the average of sensitivity and specificity) (). The balanced separation and balanced accuracy were used so that threshold selection and performance measurement will not be affected by the sample size difference between the two classes of deleterious and neutral variations. The default score threshold and other parameters for PROVEAN (e.g. sequence identity for clustering, number of clusters) were determined using the UniProt human protein variant dataset (see Methods
Prediction accuracy for the UniProt human protein variations given a PROVEAN score threshold of −2.282.
To determine whether the same parameters can be used generally, non-human protein variants available in the UniProtKB/Swiss-Prot database including viruses, fungi, bacteria, plants, etc. were collected. Each non-human variant was annotated in-house as deleterious, neutral, or unknown based on keywords in descriptions available in the UniProt record. When applied to our UniProt non-human variant dataset, the balanced accuracy of PROVEAN was about 77%, which is as high as that obtained with the UniProt human variant dataset ().
Number of the UniProt non-human protein variations and prediction accuracy given a PROVEAN score threshold of −2.282.
As an additional validation of the PROVEAN parameters and score threshold, indels of length up to 6 amino acids were collected from the Human Gene Mutation Database (HGMD) and the 1000 Genomes Project (, see Methods
). The HGMD and 1000 Genomes indel dataset provides additional validation since it is more than four times larger than the human indels represented in the UniProt human protein variant dataset (), which were used for parameter selection. The average and median allele frequencies of the indels collected from the 1000 Genomes were 10% and 2%, respectively, which are high compared to the normal cutoff of 1–5% for defining common variations found in the human population. Therefore, we expected that the two datasets HGMD and 1000 Genomes will be well separated using the PROVEAN score with the assumption that the HGMD dataset represents disease-causing mutations and the 1000 Genomes dataset represents common polymorphisms. As expected, the indel variants collected from the HGMD and 1000 genome datasets showed a different PROVEAN score distribution (). Using the default score threshold (−2.282), the majority of HGMD indel variants were predicted as deleterious, which included 94.0% of deletion variants and 87.4% of insertion variants. In contrast, for the 1000 Genome dataset, a much lower fraction of indel variants was predicted as deleterious, which included 40.1% of deletion variants and 22.5% of insertion variants.
Number of deletions and insertions collected from the Human Gene Mutation Database and the 1000 Genomes Project and used for validation of PROVEAN.
PROVEAN score distribution of deletions and insertions collected from the Human Gene Mutation Database (HGMD) and the 1000 Genomes Project.
Comparison with other tools for single amino acid substitutions
Many tools exist to predict the damaging effects of single amino acid substitutions, but PROVEAN is the first to assess multiple types of variation including indels. Here we compared the predictive ability of PROVEAN for single amino acid substitutions with existing tools (SIFT, PolyPhen-2, and Mutation Assessor). For this comparison, we used the datasets of UniProt human and non-human protein variants, which were introduced in the previous section, and experimental datasets from mutagenesis experiments previously carried out for the E.coli LacI protein and the human tumor suppressor TP53 protein.
For the combined UniProt human and non-human protein variant datasets containing 57,646 human and 30,615 non-human single amino acid substitutions, PROVEAN shows a performance similar to the three prediction tools tested. In the ROC (Receiver Operating Characteristic) analysis, the AUC (Area Under Curve) values for all tools including PROVEAN are ~0.85 (). The performance accuracy for the human and non-human datasets was computed based on the prediction results obtained from each tool (, see Methods
). As shown in , for single amino acid substitutions, PROVEAN performs as well as other prediction tools tested. PROVEAN achieved a balanced accuracy of 78–79%. As noted in the column of “No prediction”, unlike other tools which may fail to provide a prediction in cases when only few homologous sequences exist or remain after filtering, PROVEAN can still provide a prediction because a delta score can be computed with respect to the query sequence itself even if there is no other homologous sequence in the supporting sequence set.
ROC curves of four different prediction tools for single amino acid substitutions found in human and non-human proteins. All tools show a similar predictive ability with the AUC value of ~0.85.
Binary classification performance of four different tools for single amino acid substitutions in human and non-human proteins.
We also compared PROVEAN with other prediction tools using two experimental datasets obtained from mutagenesis experiments that had been performed on LacI and TP53 (see Methods
). shows that PROVEAN provides a good overall performance consistently and is among the top two best performers for both datasets based on the AUC values.
ROC curves of different prediction tools for single amino acid substitutions in (A) E. coli lac repressor protein and (B) human TP53 tumor suppressor protein.
In addition, we also tested Condel, a tool which provides a consensus prediction by combining results from multiple prediction tools, for the human protein variant dataset used in this study (see Methods
). Condel provided a balanced accuracy of 70% and 76% respectively, when using the published default threshold (0.469) and a threshold selected to maximize the balanced accuracy (0.790) (Table S1
Delta score correlates with biological activity of TP53 and ABCA1 variations
In addition to binary classification, we also investigated if the PROVEAN score can be used for predicting the degree of deleteriousness of a protein variation. Two experimental datasets obtained from the TP53 and ABCA1 (ATP-binding cassette transporter 1 protein) genes were examined, both of which also included the corresponding functional activity levels measured for each mutation found in the protein sequence 
For the TP53 variation dataset, the single amino acid variations were divided into 15 classes based on a functional assay which measured the median transactivation level of the mutant protein. The distribution of the PROVEAN scores was computed for each class. As shown in and Figure S2
, the PROVEAN score increases and correlates with the reported transactivation level, especially for those classified as either non-functional or partially functional.
PROVEAN score distribution of TP53 variations binned into 15 classes based on transactivation levels.
For the ABCA1 variation dataset, cholesterol efflux was measured in 17 mutants and wild-type to assess ABCA1 functional activity. PROVEAN scores were generated for all ABCA1 mutants. A total of 15 mutants (88%) were correctly predicted by PROVEAN with reference to the UniProt disease versus common polymorphism classification of the amino acid variants (Table S2
). shows that the PROVEAN score increases and correlates with the level of cholesterol efflux (Pearson's correlation coefficient of 0.74). In general, an increase in score correlates with an increased cholesterol efflux activity.
Correlation of cholesterol efflux values with the PROVEAN scores for ABCA1 variations.