|Home | About | Journals | Submit | Contact Us | Français|
UDP glucuronosyltransferases (UGTs) are an important class of Phase II enzymes involved in the metabolism and detoxification of numerous xenobiotics including therapeutic drugs and endogenous compounds (e.g. bilirubin). To date, there are 21 human UGT genes identified, and most of them contain single-nucleotide polymorphisms (SNPs). Non-synonymous SNPs (nsSNPs) of the human UGT genes may cause absent or reduced enzyme activity and polymorphisms of UGT have been found to be closely related to altered drug clearance and/or drug response, hyperbilirubinemia, Gilbert’s syndrome, and Crigler-Najjar syndrome. However, it is unlikely to study the functional impact of all identified nsSNPs in humans using laboratory approach due to its giant number. We have investigated the potential for bioinformatics approach for the prediction of phenotype based on known nsSNPs. We have identified a total of 248 nsSNPs from human UGT genes. The two algorithms tools, sorting intolerant from tolerant (SIFT) and polymorphism phenotyping (PolyPhen), were used to predict the impact of these nsSNPs on protein function. SIFT classified 35.5% of the UGT nsSNPs as “deleterious”; while PolyPhen identified 46.0% of the UGT nsSNPs as “potentially damaging” and “damaging”. The results from the two algorithms were highly associated. Among 63 functionally characterized nsSNPs in the UGTs, 24 showed altered enzyme expression/activities and 45 were associated with disease susceptibility. SIFT and Polyphen had a correct prediction rate of 57.1% and 66.7%, respectively. These findings demonstrate the potential use of bioinformatics techniques to predict genotype–phenotype relationships which may constitute the basis for future functional studies.
The online version of this article (doi:10.1208/s12248-009-9126-z) contains supplementary material, which is available to authorized users.
The UDP glucuronosyltransferases (UGTs) represent a superfamily of endoplasmic-reticulum-bound enzymes that catalyze glucuronidation, a process that increases the polarity of xenobiotics and some endogenous compounds to facilitate their excretion via the bile or urine (1). Glucuronidation accounts for ~35% of all drugs metabolized by Phase II drug-metabolizing enzymes and therefore plays an important role in the detoxification and excretion of drugs and/or their metabolites (1,2). Additionally, UGTs play a critical role in the disposition of several important endogenous substrates, including bilirubin, bile acids, steroids, thyroxine, biogenic amines, and fat-soluble vitamins (3,4).
Up to date, there are 21 human UGTs identified (http://www.ugtalleles.ulaval.ca, access date: 11 June 2009). Based on sequence identities, human UGTs comprise UGT1, 2, 3, and 8 families (3,5). The UGT1 locus in humans has been mapped to chromosome 2q37 (3,6). All UGT1 genes contain unique first exons, and they are subsequently spliced into common exons 2 through 5, leading to different N-terminal halves but identical C-terminal halves of the gene products (3). Unlike the UGT1 family, the human UGT2 mRNAs are transcribed from separate genes and are divided into UGT2A and 2B subfamilies (4,7). In humans, the genes encoding several UGT2 enzymes form a cluster on chromosome 4 (UGT2B7-UGT2B4-UGT2B15; 4). UGT3 has two members, namely UGT3A1 and 3A2, and UGT8 family has one member.
In the human genome, the most frequent type of DNA variation is single base change, namely single-nucleotide polymorphisms (SNPs) (8,9). Currently, there are a total of 14,708,752 SNPs in human genome identified and deposited to the NCBI dbSNP (http://www.ncbi.nlm.nih.gov/sites/entrez, Build 129). A non-synonymous (nsSNPs) or missense variant is a single base change in a coding region that leads to amino acid substitution in the corresponding protein (8,10,11). Many nsSNPs do not cause any change in the corresponding protein and are regarded as tolerated nsSNPs (9,12). On the other hand, when mutations occur and nsSNPs do cause clinical phenotypic consequences in individuals (e.g. disease or biochemical abnormality), they are termed deleterious nsSNPs (9,12).
In the case of UGTs, deleterious nsSNPs of UGT have shown to alter bilirubin metabolism, drug disposition, and predisposition to diseases in individuals carrying the polymorphism (13,14). Variation in UGT1A coding sequences may lead to deficiency of UGT1A1 activity, often resulting in high level of unconjugated bilirubin which is closely related to diseases such as Crigler-Najjar syndrome (CN; 15) and Gilbert’s syndrome (16). Nonsense mutations in the UGT1 gene resulting in a premature termination codon and absent UGT1A1 activity have been reported in CN patients (17,18). Furthermore, some genetic variations in the UGT1A1 promoter (*28) is associated with increased drug toxicity and hyperbilirubinemia in cancer patients receiving irinotecan-based chemotherapy (19–22), and reports also show increased risk of tranilast-induced hyperbilirubinemia (23). UGT1A7*3 leading to a low-enzyme activity has been linked to increased risk of the cancer of the colon (24), mouth (25,26), esophagus (26), stomach (26), and the liver (27). A number of UGT2 genetic variants have been shown to relate to altered drug metabolism and disease risk (1,28,29). For example, the UGT2B7 promoter variant is associated with significantly reduced glucuronidation of morphine in sickle cell disease and contributes to the variability in hepatic clearance of morphine in these patients (28). A UGT2B15 variant has been shown to relate to prostate cancer (30).
The identification of deleterious nsSNPs in the human genome is important. This process relates human phenotypes to variation at the DNA level and links genetic and phenotypic differences in individuals; it is believed that nsSNPs and SNPs in regulatory regions together have the highest impact on phenotype (8). The identification and characterization of SNPs of major drug-metabolizing enzymes such as UGTs is crucial for understanding individual differences in drug metabolism, therapeutic efficacy, inherited diseases and also for the predisposition toward diseases such as cancer (13). However, humans have vast numbers of SNPs and presently it is unlikely to investigate their functional impact of these SNPs by experiments. Alternatively, bioinformatics approaches have gained increased use in the prediction of the functional effect of SNPs. Two algorithms sorting intolerant from tolerant (SIFT) and polymorphism phenotyping (PolyPhen) represent powerful bioinformatic tools, they enable high-throughput prediction of the potential impact of nsSNPs and large-scale polymorphism analyses. This process has the prospective to reduce the number of SNPs that may have clinical implications and needs to be evaluated in future clinical studies.
The elementary theory of SIFT have been developed and illustrated by Ng and Henikoff (9,31,32); it is used to predict the effect of an amino acid substitution on protein function according to sequence homology and the physical properties of amino acids. SIFT examines sequence similarity among related genes and does not require the knowledge of protein structural or functional information; therefore, it can be applied to a much larger number of proteins (9). The PolyPhen algorithm is a structure-sequence-based amino acid substitution prediction method; it utilizes the data available in Swiss-Prot (http://au.expasy.org/sprot/userman.html) and predicts the possible impact of amino acid substitutions on the structure and function of a human protein (8). Mapping of an amino acid replacement to a known 3D structure reveals whether the replacement is likely to destroy the hydrophobic core of a protein, electrostatic interactions, interactions with ligands or other important features of a protein (8,33). Several programs including TMHMM algorithm, Coils2 program, SignalIP programs, and PHAT transmembrane-specific matrix scores are also used to predict the possible functional effect of nsSNPs (8,34–37). Multiple alignment-based profile scores provide the major contribution to the prediction therefore reliable prediction can be reasonably achieved even if the proteins has no known 3D structure (8). In this study, we have investigated the potential effect of known human UGT nsSNPs on protein function using both SIFT and PolyPhen algorithms.
The human UGT genes examined in this study were named in agreement with the UGT Nomenclature Committee (http://www.ugtalleles.ulaval.ca/, access date: 11 June 2009). The data on human UGT genes were collected from Entrez Gene on NCBI Website (http://www.ncbi.nlm.nih.gov/sites/enterz, access date: 11 June 2009). Expired and merged gene names were excluded from the study. The majority of the variants included in this analysis were identified during the screening of 21 human UGT genes from MutDB and PolyDoms. MutDB integrates genetic variants from Swiss-Prot and dbSNP and links to functional disruption prediction scores. PolyDoms incorporate the results of multiple algorithmic procedures and functional criteria applied to the entire Entrez dbSNP dataset. Information including gene symbol, gene name, mRNA accession number (NM), protein accession number (NP), SNP ID, amino acid residue 1 (wild-type), amino acid position, and amino acid residue 2 (missense) were collected. Most of the data and information describing the genes and variants, including Entrez Gene ID are available at http://www.mutdb.org and http://www.polydoms.cchmc.org/polydoms. Supplementary variants were identified from Entrez Gene on NCBI (http://www.ncbi.nlm.nih.gov/sites/entrez, access date: 11 June 2009) and added to the dataset after cross-examination. The information on the effect of the nsSNPs on enzyme activity and the correlation between the nsSNPs and disease were extracted from in vivo and in vitro experiments (e.g. site-directed mutagenesis analysis) according to literature.
Identified UGT nsSNPs and prediction effect of the variant amino acid substitution on protein function was performed using SIFT (http://blocks.fhcrc.org/sift/SIFT.html) and Polyphen (http://genetics.bwh.harvard.edu/pph/), listed in Supplementary Table 1.
In this study, SIFT version 3 was used. This algorithm uses a query sequence to search for similar sequences that may share similar function, generates the alignment of the chosen sequences, and predicts the probability score of the impact of an amino acid substitution on protein function effects. SIFT scores ranges from 0 to 1, outcome scores from 0.00–0.05 are elected as intolerant, 0.051–0.10 as potentially tolerant, 0.101–0.20 as borderline, and 0.201–1.00 as tolerant (31,38). Information such as dbSNP ID and GI number of an amino acid substitution can be employed to predict the effect on protein functions in SIFT. The algorithm and instructions for analysis of amino acid substitutions are available at http://www.blocks.fhcrc.org/sift/SIFT.html. The data employed in the SIFT analysis were the UGT gene sequences available in the NCBI non-redundant database (http://www.ncbi.nlm.nih.gov, access date: 11 June 2009), orthologous sequences were used as paralogous sequences confounds predictions (32).
PolyPhen uses empirically derived rules based on previous research in protein structure, interaction, and evolution that automatically predict whether a replacement is likely to be deleterious for the protein on the basis of three-dimensional structure and multiple alignments of homologous sequences (8,33). PolyPhen input is a protein amino acid sequence or the SWALL database (http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-page+LibInfo+-newId+-lib+SWALL) ID or accession number together with sequence position and two amino acid variants characterizing the polymorphism (8). PolyPhen scores range from 0≤2≤X, outcome scores of 0.00–0.99 are classified as benign, 1.00–1.24 as borderline, 1.25–1.19 as potentially damaging, 1.50–1.99 as possibly damaging, and ≥2 as damaging (38). Additional details of the algorithm and instructions for analysis of amino acid substitutions are available at http://www.bork.embl-heidelberg.de/PolyPhen/.
nsSNPs with experimental evidences of changing enzyme activity or disease association were regarded as deleterious. The phenotypic data are from both in vivo and in vitro studies, in which analysis of site-directed mutagenesis or enzymatic changes often provide direct evidence indicating the functional impact of nsSNPs. Prediction accuracy was analyzed according to these positive findings from these experiments.
As a test for the ability of both SIFT and PolyPhen algorithms to identify substitutions impacting enzymatic activity of UGT proteins, scores were obtained and compared for the collected nsSNPs of human UGT genes related to loss of enzyme activity and disease based on experimental and clinical studies.
Given that SIFT and PolyPhen employ different approaches and also different datasets as foundations for their analysis, it is important to find the concordance of the two prediction tools on functional consequences of each nsSNP prediction. Concordance analysis of each nsSNP predicted by SIFT and PolyPhen were assessed using Spearman’s rank correlation coefficient ρ using SPSS 15. Prediction scores of each nsSNP were plotted on scatter graphs and analyzed using linear trend lines.
Two hundred and forty eight amino acid substitution variants were identified in the systemic screening of 21 human UGT genes for the analysis of the potential impact of all nsSNPs in human UGT genes. With development and updated data in bioinformatics, some previously reported SNPs in dbSNP have been identified as invalid by later studies due to wrong sequencing and alignment. These incorrect SNPs have either been terminated or merged into other SNPs. We have cross-examined the databases and removed those invalid SNPs. Number of variants from four UGT subfamilies (UGT1, UGT2, UGT3, and UGT8) are shown in Table I. Three SNPs were identified in the screening of UGT2B17 but none were non-synonymous, therefore not included in this study.
SIFT scores for all 248 nsSNPs identified in this study was found, Polyphen classified UGT1A1 (L15R) as benign but did not provide a prediction score, therefore 247 nsSNPs in this study were scored by both SIFT and PolyPhen, these nsSNPs were used in the statistical concordance test.
As shown in Table II, 88 of 248 or 35.5% of identified UGT nsSNPs exhibited SIFT scores of <0.05 and are classified as “Intolerant” variants by SIFT. One hundred fourteen of 248 or 46.0% of identified UGT nsSNPs had prediction scores of ≥1.5 and are classified as “probably damaging” by PolyPhen. SIFT predicted a few nsSNPs of UGT genes to have no deleterious nsSNPs causing functional effects on protein function of UGT2A3, of 2B15, of 2B28, 3A1, and 8 (Table III), including Ala496Thr, Asp84Tyr, Lys522Thr, Leu364His, Asn441Ser, Cys120Gly, and Pro225Leu. PolyPhen also predicted some nsSNPs of UGT genes to have no deleterious nsSNPs, these included UGT2A3, 2B15, and 8 (Table III). UGT2A3, 2B15, and 8 did not have any deleterious nsSNPs predicted by both SIFT and PolyPhen (Table III), which are Ala496Thr, Asp84Tyr, Lys522Thr, and Ile367Met.
Representative deleterious nsSNPs and the corresponding amino acid substitution of various UGT genes are listed in Table IV. As many as 31 UGT1A1 nsSNPs of 80 UGT nsSNPs were predicted as deleterious by both algorithms. For UGT1A3, 1A4, 1A5, 1A6, 1A7, 1A8, 1A9, and 1A10, deleterious nsSNPs predicted by both SIFT and PolyPhen includes Arg45Trp, Ile322Thr, Leu60His, Ser69Tyr, Ile318Thr, Thr202Ala, Val264Glu, and Ile211Thr. Thirteen UGT nsSNPs (Gly307Arg, Gly309Arg, Phe396Leu, Asp457Glu, Leu46Pro, Ser299Phe, Ala381Thr, Leu497Pro, Cys155Arg, Pro288Leu, and Lys367Ile) predicted deleterious by both SIFT and PolyPhen algorithms were from the UGT2 family, including UGT2A1, 2A2, 2B4, 2B7, 2B10, and 2B11. For UGT3A2, Ala343Thr was predicted as deleterious, respectively, by both SIFT and PolyPhen algorithms.
Table V shows the concordance analysis between the functional consequences for 247 nsSNPs predicted by SIFT and PolyPhen. Raw scores rather than arbitrarily defined categories were used for the correlation analysis. The row percentage of PolyPhen scores that falls into a SIFT category was calculated. Remarkably, 85.7% of “benign” predictions from PolyPhen falls into SIFT “tolerant” predictions and 84.2% of “probably damaging” predictions from PolyPhen falls into SIFT “intolerant” predictions. Scatter graphs plotted using 247 prediction scores from SIFT and PolyPhen showed negative correlation (Fig. 1). Spearman’s rank correlation coefficient ρ=−0.709 (P≤0.01) illustrate the significant concordance between the prediction scores from SIFT and PolyPhen algorithms.
The confirmed phenotype of nsSNPs manifests as alteration of enzyme activity and susceptibility to diseases such as CN I, CN II and Gilbert’s syndrome. Up to date, a total of 63 nsSNPs of human UGT genes are identified to demonstrate a relation to decreased activity, loss of enzyme activity, or susceptibility to diseases based on experimental and clinical studies. Using positive findings from the experiments, if the variants were predicted to be deleterious, it is considered a correct prediction. An incorrect/error prediction was considered when such nsSNPs were predicted as tolerant.
The confirmed variants were collected from results derived from site-directed mutagenesis studies of the enzyme using biochemical characterization (18,39–47) or clinical data from family-based and association studies (48–56). The biochemical/in vitro and in vivo UGT variants and the predictions for their functional impact scores by SIFT and PolyPhen are displayed in Table VI. If a confirmed UGT nsSNP was predicted as “intolerant” by SIFT and “probably damaging” by PolyPhen, then it is termed “true positive finding”. If a confirmed UGT nsSNP was predicted as “tolerated” by SIFT and “benign” by PolyPhen then it is termed “false negative finding”. According to these criteria, approximately 57% and 67% of characterized UGT nsSNPs were correctly predicted as deleterious and are “true positive predictions” by SIFT and PolyPhen, respectively (Table VII). SIFT predicted 27 confirmed UGT alleles as “tolerated” and Polyphen predicted 21 confirmed UGT alleles as “benign”, based on this finding, false negative error for SIFT and PolyPhen was 33% and 43%, respectively (Table VII). Ninety-seven percent of confirmed allelic variants of human UGTs were distributed in the UGT1A family, with two additional variants from the UGT2B family. Even more interesting is UGT1A1 alone contributes 74.6% of the confirmed allelic variants of human UGTs, this correlates with the fact that UGT1A1 was the top UGT gene with most frequent deleterious nsSNPs predicted by both SIFT and PolyPhen.
Based on the results of in vivo and in vitro research, among 63 functionally characterized nsSNPs in the UGTs, there were 24 showing altered enzyme expression/activities and 45 associated with disease susceptibility. In agreement with earlier studies, reduced levels of UGT enzyme activities were observed in CN II and Gilbert’s syndrome, whilst absence of UGT enzyme activities were observed only in CN I patients. Additional deleterious nsSNPs of human UGT were predicted by SIFT and PolyPhen but the phenotypic prediction of these nsSNPs has not yet been confirmed using experiments.
More than two-thirds of confirmed variants were UGT1A1 variants (Table VI). In the confirmed variants of UGT1A1, the associated phenotypes were CN I, CN II (15), and Gilbert’s syndrome (16) which are all closely related to bilirubin levels. To date, 47 mutant UGT1A1 alleles have been identified (Table VI; 17,18,39,42–51,53–59). Many of these SNPs were predicted to have phenotypical effects by the algorithms and the correct prediction rates were 68% and 81% by SIFT and PolyPhen, respectively.
A number of the confirmed allelic variants of UGT1A1 associated with Gilbert’s syndrome, including UGT1A1*27 (Pro229Gln), *29 (Arg367Gly), *62 (Phe83Leu), *69 (Ile159Thr), *70 (Ala321Gly), *72 (Asp359Asn), and *73 (Pro364Leu). Both SIFT and PolyPhen correctly predicted the phenotype of Por364Leu. In addition, UGT1A1*7 (Tyr486Asp) and *9 (Gln331Arg) closely associated with Crigler-Najjar II syndrome (3,18,45,59), were predicted as probably damaging with high PolyPhen scores. Furthermore, another UGT allelic variant UGT1A1*11 (Gly308Glu) identified in Crigler-Najjar I syndrome (3,43), was also correctly predicted by both SIFT and PolyPhen algorithms.
Only two variant alleles of the human UGT2 mRNAs have been confirmed in the UGT2 gene family, which are Asp67Tyr and Asp85Tyr (60,61). Hyperbilirubinemia can be caused by a number of reasons including liver disease and hemolysis (62). Rate of production of bilirubin also needs to be considered when bilirubin level is abnormal, e.g. in inherited hemolytic syndromes such as glucose 6-phosphate dehydrogenase deficiency and sickle cell anemia. UGT2B7 promoter variant 840G>A is associated with significantly reduced glucuronidation of morphine in sickle cell disease and contributes to the variability in hepatic clearance of morphine in these patients (28). UGT2B15 variant (Asp85Tyr) have been shown to relate to prostate cancer risk (30,63).
Humans have vast numbers of SNPs and presently there are more than a million SNPs in dbSNP that can be screened for association with diseases. Prediction tool SIFT was tested against unbiased experimental datasets in which mutagenesis was performed throughout the entire protein, and both wild-type and negative phenotypes were assayed, only three datasets fit the criteria and the scarcity of unbiased data sets indicates how difficult characterization of mutant proteins on a large scale can be (31). Therefore, determining disease causing SNPs via site-directed mutagenesis experiments and gene knockout/knockin experiments is complex, lengthy, and unrealistic taking into consideration the mass amount of SNPs (64). Using bioinformatic tools to predict nsSNPs most likely to be damaging, this process acts as a filter and reduces the number of SNPs required to be screened for association with disease to those that most likely alter gene function (8,9). By predicting nsSNPs of UGT, we are able to distinguish mutations that are unlikely to affect protein function (9), reducing the number of nsSNPs for experimentation thus saving time and resources. Based on prediction results of deleterious nsSNPs, on a much smaller scale we then via experiments and clinical studies are able to screen for polymorphisms of UGT that may potentially cause disease and increase drug toxicity. Moreover, the detection of individual single-nucleotide polymorphisms that alter enzyme function can be a useful tool for the identification of disease risk or to personalize drug regimens (65,66).
A total number of 248 nsSNPs were identified through the screening of 21 human UGT genes from NCBI dbSNP, Mutdb, and Polydoms. However, according to published results from in vivo and in vitro studies, only about a quarter of nsSNPs in the dataset of validated human UGT genes were found to attribute to alteration of enzyme activity and correlation with diseases. These confirmed phenotypes of nsSNPs were related to reduced or absence of enzyme activity, and correlated to susceptibility to diseases such as CN, Gilbert’s syndrome and hyperbilirubinemia related to certain therapeutic drugs.
There are 49 non-synonymous SNPs in exons 1–5 of UGT1A1 (Supplementary Table 1). To date, 47 mutant UGT1A1 alleles have been identified (17,18,39,42–51,53–59). Deleterious nsSNPs of UGT1A1 such as Ser375Phe, Gly308Glu, Ala291Val, and His39Asp which are phenotypically presented as CN I were all correctly predicted by both SIFT and PolyPhen. The Tyr486Asp (UGT1A1*7) mutation in exon 1 and is the most abundant mutation in CN II in Japanese patients (67), was correctly predicted as deleterious nsSNP by both SIFT and PolyPhen. Variant UGT1A1*27 (Pro229Gln) identified in CN II with reduced UGT enzyme activities in vitro and in vivo was not predicted to be deleterious by SIFT and PolyPhen. In seven confirmed alleles associated with Gilbert’s syndrome (UGT1A1*27, *29, *62, *69, *70, *72, and *73), only Arg367Gly and Pro364Leu were correctly predicted as deleterious nsSNPs of UGT1A1. Other examples of incorrect predictions includes UGT1A1*6 (Gly71Arg) and UGT1A1*15 (Cys177Arg). Overall, SIFT and PolyPhen predicted UGT1A1 SNPs to have phenotypical effects with correct prediction rates of 68% and 81%, respectively.
SIFT and PolyPhen are in silico algorithm tools that use protein sequence alignment, physicochemical differences, mapping to know protein 3-D structures to predict the functional impact of nsSNPs on protein structure and activities (8,9,31,33). Even though SIFT and PolyPhen employ different approaches and types of reference data for their predictions and different scales for scoring, there was significant concordance observed on functional consequences of each nsSNP prediction on human UGTs (Spearman’s ρ=–0.709, P≤0.01). SIFT and PolyPhen can discriminate diseased variants from neutral variants, results from this study are consistent with reports by Ng and Henikoff (9) who predicted 757 of 3,084 or 25% nsSNPs from dbSNP to be damaging and having protein activity impacts. Ramensky et al. (8) correctly predicted 27.6% of nsSNPs to affect protein function in the human genome variation database. Xi et al. (38) reported 30–50% of over 500 amino acid substitution variants identified in DNA repair genes were predicted to exhibit reduced activity. Zhang et al. (68) studied variants in H+/peptide cotransporter (PEPT1) involved in drug transportation, SIFT correctly predicted the SNP that reduced transport capacity to “affect protein function”, confirming the accuracy of SIFT on prediction of individual protein predictions. Our recent study on phenotype prediction of 791 validated nsSNPs in human cytochrome P450s using SIFT and PolyPhen, found that 70% of nsSNPs were correctly predicted as damaging (69).
Furthermore, the scatter graph plotted using functional consequence predictions from SIFT and PolyPhen showed negative correlation between the two sets of scores. Low scores from SIFT indicate an “intolerant” prediction, correlated with high scores from PolyPhen which indicates a “probably damaging” prediction. High scores from SIFT indicate a “tolerant” prediction, correlated with low scores from PolyPhen indicating a “benign” prediction. This correlation was further established by row percentage calculated between scores for each one of 247 nsSNPs predicted by SIFT and PolyPhen. The two algorithms show agreements at two extremes of the scores, a high percentage (85.7%) of “benign” predictions from PolyPhen was predicted “tolerant” by SIFT and a high percentage (84.2%) of “probably damaging” predictions from PolyPhen was predicted “intolerant” by SIFT. There were slight correlations for scores in between, showing a lack of agreement crossing different prediction categories of the two algorithms, demonstrating a strong correlation in their predictions on potential effect of known human UGT nsSNPs on protein function. The concordance between SIFT and PolyPhen prediction scores suggest they can be used in combination in the future to improve accuracy of prediction studies.
Although bioinformatics tools show their potential in reducing the number of nsSNPs for disease association studies by filtering nsSNPs that are most likely to be disease related, error predictions do occur. In this study, based on collected data from 63 in vitro and in vivo studies, the false negative predictions for SIFT and PolyPhen is 43% and 33%, respectively. There are several aspects affecting the prediction accuracy for prediction tools like SIFT and PolyPhen algorithms. Firstly, SIFT and PolyPhen rely on several different databases for SNP information, polluted databases with erroneous SNP reports and bias of the data towards disease-related allelic variants are likely to lead to an over prediction of the number of deleterious nsSNPs (8). For example, Ramensky et al. (8) compared a fraction of nsSNPs predicted to be damaging for HGV-base entries and found that the overall prediction rate for the category “Suspected” nsSNPs was 31.4%, for the category “Proven” nsSNPs was 28.9% and for “Proven” nsSNPs from systematic studies on healthy individual was 27.6%. Furthermore, programs identifying SNPs may detect base differences between the functional gene and a pseudogene and erroneously report these differences as SNPs in the function protein. Including nsSNPs erroneously mapped from pseudogenes in the SNP database will affect prediction accuracy on prediction tools using SNP information from these databases (9). The growth of public SNP data and improvements of SNP database data will assist in acquiring correct information on SNPs to assist in improving prediction accuracy of bioinformatics tools. Secondly, SIFT has a weighted false positive error of 19% and PolyPhen 9%, indicating that if all of the nsSNPs from dbSNP were functionally neutral, 19% or 9% would have been predicted as damaging by SIFT and PolyPhen, respectively (9). Additionally, there are factors overlooked when the predictions occurs, SIFT predicts on amino acid substitution in the protein product and does not take into account mutations that affect transcription, translation, splicing, and other possible pretranslational alterations (9), erroneous predictions may then arise. Furthermore, the prediction might appear incorrect based on the lack of association with obvious altered phenotypes. SIFT and PolyPhen may be sensitive to a mutation and predict it to be damaging to protein function, but if the phenotype is undiagnosed, or has not yet been assayed for therefore condemned an “error prediction”. On the other hand, SIFT and PolyPhen may predict a deleterious mutation to be “tolerated” and because it is not an obvious phenotype, we believe that it is a correct prediction whereas in actual fact it is a deleterious mutation that is recessive or undiagnosed. Therefore it is important to identify the association of SNPs with various phenotype/diseases. Identifying deleterious nsSNPs using bioinformatics tools like SIFT and PolyPhen may lay the foundation and initiate the process of identifying SNPs that has clinical implications. There lies the ability to reduce the number and/or the field of possible deleterious nsSNPs that needs to be evaluated in clinical studies.
In summary, in silico analysis predicted that 35.3–46% of over 200 amino acid substitution variants currently identified in the human UGT genes might cause reduced enzyme activity and/or elevated bilirubin levels. The identification of these variants using bioinformatic tools such as SIFT and PolyPhen narrows down study fields and is the first step of the progress to evaluate their possible phenotypic importance in costly clinical studies.
Below is the link to the electronic supplementary material.
(XLS 158 kb)
The authors appreciate the technical assistance of Dr Lin-Lin Wang of Institute of Reproductive and Child Health, Peking University, Beijing, China. Ms. Yuan Ming Di is a holder of RMIT University PhD Scholarship. The authors appreciate the support of RMIT Health Innovations Research Institute, RMIT University, Bundoora, Victoria, Australia. We also would like to thank Associate Professor Clifford Da Costa (Department of Mathematics and Statistics, RMIT University, Melbourne, Australia) for his assistance in the statistical analysis of the data in this paper.