|Home | About | Journals | Submit | Contact Us | Français|
A critical task in pharmacogenomics is identifying genes that may be important modulators of drug response. High-throughput experimental methods are often plagued by false positives and do not take advantage of existing knowledge. Candidate gene lists can usefully summarize existing knowledge, but they are expensive to generate manually and may therefore have incomplete coverage. We have developed a method that ranks 12,460 genes in the human genome on the basis of their potential relevance to a specific query drug and its putative indications. Our method uses known gene–drug interactions, networks of gene–gene interactions, and available measures of drug–drug similarity. It ranks genes by building a local network of known interactions and assessing the similarity of the query drug (by both structure and indication) with drugs that interact with gene products in the local network. In a comprehensive benchmark, our method achieves an overall area under the curve of 0.82. To showcase our method, we found novel gene candidates for warfarin, gefitinib, carboplatin, and gemcitabine, and we provide the molecular hypotheses for these predictions.
Pharmacogenomics is the large-scale study of how genes and their variations impact drug response. A critical step in pharmacogenomics is identifying genes that are important for drug response. A given drug may have pharmacogenes—genes important for its pharmacology—that are involved in its absorption, distribution, metabolism, and excretion (pharmacokinetics) or that are involved in its mechanism of action (pharmacodynamics). However, it may take many years of study to identify these pharmacokinetic or pharmacodynamic genes completely. This delay in understanding impedes our ability to identify, evaluate, and use genetics to optimize drug selection and dose.1
Recently, high-throughput genomic technologies have been successful in identifying new pharmacogenetic (PGx) interactions.2 RNA expression arrays can measure differential gene expression related to drug response;3 genes showing significant changes in expression are more likely to be relevant to the drug response. Similarly, genome-wide association studies (GWASs) analyze data related to patients with different drug-response phenotypes and seek polymorphisms (most often single-nucleotide polymorphisms, SNPs) that correlate with drug response;4 genes with variants that correlate with drug response are more likely to be pharmacogenetically relevant.
One of the challenges with these high-throughput technologies is the statistical analysis required for the assessment of genome-scale data. The large number of hypotheses that undergo testing and the multiple sources of experimental noise may lead to the overlooking of important genes or the embracing of irrelevant ones.5,6 In expression studies, some genes may show expression changes that are not pertinent to the drug response. In GWASs, many of the correlated polymorphisms may be chance occurrences, not indicative of any biological connection. There is therefore an acute need for principled ways of applying current biological knowledge to the interpretation and ranking of these data sets.5 By combining the weak statistical signals in these data sets with an existing model favoring genes for which we have some molecular evidence, we can find pharmacogenes more reliably.
One way to use current biological knowledge is to focus on candidates—genes for which there is some biological or chemical reason to hypothesize their possible involvement in the response to a particular drug. Candidate gene lists are typically compiled from pathway databases and gene annotations as well as by mining the literature.7 These methods can be successful, but their quality varies depending on the care with which they are created and our understanding of the relevant biology.8–11 Nevertheless, the notion of using automated and systematic methods to create candidate gene lists is very appealing. Given a candidate gene list, we can (using Bayes’ Rule) mathematically encode methods for combining our candidate list with high-throughput data.
In this article, we describe a method that takes as input (i) a query drug and (ii) a query indication for its use and creates a ranking of 12,460 genes in the human genome with respect to the likelihood that each gene is an important modulator for the input drug and indication for its use. Our method can, in principle, be applied to the entire genome, but this requires for all genes that at least some information be available about their primary protein–protein interactions. The method uses a high-quality protein–protein interaction network to describe the context (subnetwork of genes) in which each gene operates. For each gene-specific subnetwork, our method compares the drugs and indications associated with its subnetwork with the query drug and indication. If the gene and its subnetwork interact with drugs that are similar to the query drug/indication, then the gene is assigned a high score as a potential pharmacogene. This assessment is made for every gene in the protein network in order to produce a rank-ordered list. The list is imperfect and does not replace the need to perform experiments, but it provides an independent source of information that can be used to inform the analysis of the data from such experiments.
Our method is logically derived from the following assumptions: (i) proteins tend to work collectively in modules to carry out cellular tasks; (ii) drugs interacting with these modules provide a clue to the chemistry and disease–treatment phenotypes pertinent to that module; (iii) chemical similarity provides information about similarity with respect to pharmacology; and (iv) similarity in indication of use provides information about similarity in the molecular mechanisms (within modules or pathways) targeted by the drugs. These four assumptions are not universally true, but their degree of truth determines the accuracy and utility of our method.
Figure 1 shows the inputs and the subnetwork we use to derive features for each gene–drug combination. We determine the interacting partners for a gene using the InWeb human gene–gene interaction network.12 We associate each gene with particular drugs on the basis of a corpus of genetic interactions registered in the Pharmacogenetics and Pharmacogenomics Knowledge Base13 (PharmGKB) and a corpus of known physical drug–target interactions (DTIs) from DrugBank.14 We define a similarity metric based on chemical structure that allows us to assess whether the query drug is similar to the drugs interacting with the gene or its network neighbors. We also define a metric for similarity of indication of use that allows us to assess whether the query indication is similar to the indications of use for drugs that interact with the query gene or its network neighbors. We used the set of known pharmacogenes in PharmGKB as a benchmark to train a logistic regression that takes the similarity of chemistry and indication for both PharmGKB and DrugBank interactions and produces an overall score. We apply this regression to all genes and rank them according to score. We have validated our model on a fivefold cross-validation, a genome-scale prediction, and an independent test set, and we demonstrate sample results for warfarin, gefitinib, carboplatin, and gemcitabine.
We extracted 2,488 primarily genetic interactions from PharmGKB and complemented these with 3,600 physical DTIs extracted from DrugBank. A total of 1,139 drugs and 1,546 genes were involved in the extracted interactions. Each of the genes was mapped to the InWeb interactome that encodes 313,524 physical interactions among 12,460 human gene products. We mapped all drugs and drug classes to chemical structures and indications of use as described in the Methods section.
In a set of preliminary investigations, described in the Supplementary Data online, we showed that (i) gene products that interact in a protein–protein interaction network are more likely to interact with a common small molecule (Supplementary Tables S1–S4 and Figures S1 and S2 online) and (ii) that drugs with a similar indication of use are more likely to interact with each other in a protein–protein interaction network (Supplementary Figure S3 online). These two observations are not perfectly predictive, but they adjust probabilities. This is all that our method requires, because the goal is simply to rank genes on the basis of higher and lower aggregate probabilities.
Integrating the data from InWeb, PharmGKB, and DrugBank, we created a gene–gene–drug network (an example of the data structure can be seen in Figure 1). On average, each gene interacts with 24.9 other genes. The 844 pharmacogenes each interact with an average of 2.95 drugs. The structural similarity between any two drugs in the data set is measured by the Tanimoto coefficient (ranging between 0 and 1) and has an average value of 0.31 in our data set. The similarity in indication of use (also measured by the Tanimoto coefficient) spanned the same range but had an average coefficient of 0.02 (distributions of pairwise similarities can be found in Supplementary Figures S4 and S5 online). We defined a set of four features that can be computed for any combination of a drug and an indication of application, given a gene (see Figure 1 and the Methods section for more details). These features capture structural similarity of drugs with respect to both genetic and physical interactions and similarity in indication of use with respect to both genetic and physical interactions. We calculated these features for each drug–gene combination in the gene–gene–drug network.
We used PharmGKB as a resource for training gene–drug interactions. As described in the Methods section, we trained a logistic regression model using fivefold cross-validation. The resulting model correctly classified 81.2% of the instances and had an area under the receiver operating characteristic curve of 0.82.
In order to test our model, we used it to make whole-genome predictions for all drugs with PGx interactions registered in PharmGKB. We removed the known interactions for each drug as it was tested. The known interacting genes for each drug constituted the gold standard. We scored all genes using the logistic regression algorithm and assessed the rank of the gold-standard genes. We also assessed the ability to recall these gold-standard genes with different cutoffs. Figure 2 shows sensitivity (true-positive detection rate) of the model as a function of the specificity (true-negative detection rate). The area under the curve is 0.86. Importantly, our method associates 81% of the drugs in PharmGKB with at least one gene scoring above zero. At that score, we can recall 41% of the known pharmacogenes in PharmGKB. Overall, nearly 7,000 genes have scores above zero, suggesting that there are many potential candidate genes with a relatively high likelihood of interacting with the tested drugs. Therefore, as expected, our method orders the human genome so as to produce a list that is enriched at the top for true pharmacogenes.
While we built our model, PharmGKB staff registered 225 new articles manually, thereby providing an independent set of 199 gene–drug interactions for testing. We ranked the genome for each of the drugs in this new set and made a receiver operating characteristic curve for performance on this external validation set (Figure 2). The area under the curve for the external validation was 0.81. Figure 2 shows that the performance on the external validation set is comparable with the performance on the cross-validation training set, and that we have not overfitted our model to the available examples.
As an illustration, consider warfarin. The literature on the pharmacogenetics of warfarin is quite mature.15,16 This anticoagulant drug is diffcult to dose because it depends strongly on a patient’s genetic background. In particular, two genes affect the appropriate dosage: vitamin K epoxide reductase complex 1 (VKORC1) and cytochrome P450 2C9.17,18 Leaving out all known warfarin–gene interactions, we ranked all genes in the genome using warfarin and all its indications, as recorded in the PharmGKB. VKORC1 and cytochrome P450 2C9 ranked 11 and 16, respectively (out of 12,460). The data for VKORC1 is shown in Figure 3. The recently discovered warfarin pharmacogene cytochrome P450 4F219 had weaker support, ranking in the top 500 (4%) of the genome.
Experimental evidence for novel candidates provides independent validation of our method. To show the broad spectrum of applicability of our method, we obtained: (i) a GWAS of warfarin by Cooper et al.,20 (ii) an expression study of gefitinib-treated cancer cell lines by Coldren et al.,21 (iii) a study of carboplatin response in cancer cell lines from Huang et al.,22 (iv) a study of the correlation between gene expression and cytotoxicity of gemcitabine from Li et al.,23 and (v) another study of gene expression and cytotoxicity of gemcitabine by Jarjanazi et al.24 For each of these studies, we ranked genes and compared the top five predictions with the experimental data (Supplementary Table S7 online); we also combined the overall rank to the experimental ranks to generate a new, combined ranking (Supplementary Table S8 online). We summarize our findings here.
First, one of our top five genes, NSUN6, has never before been described in the literature as a pharmacogene for warfarin, but it is associated with response to warfarin in a GWAS (P < 3.1 × 10−3). Second, we combined our rankings with the experimental rank, which put the well-known VKORC1 and cytochrome P450 2C9 right at the top; more interestingly perhaps, the third gene is the novel candidate HSD3B7 that scores highly in both our ranking and the GWAS data. Finally, high-scoring candidates rank significantly higher in the experimental rank than may be expected on the basis of chance (P < 1.1 × 10−2).
First, of the top five genes, SH3BGRL was significantly underexpressed (P < 3.0 × 10−2) in cells showing sensitivity to gefitinib. AMH and PTPRS were significantly overexpressed (P < 4.1 × 10−3 and P < 2.9 × 10−2). Second, when we combined our ranks and the experimental data, the gene with the best rank sum (i.e., supported by both) was FAM96B (evidence shown in Figure 3). Finally, high-scoring genes in our ranking had overall significantly higher expression in the experiment (P < 6.5 × 10−8).
Of the top five genes, only MAK3 was expressed in the cell lines studied by Huang et al.22 The expression was correlated with a sensitivity to carboplatin in the Yoruban population (P < 5.6 × 10−5), but not in the Ceph population.
First, of the top five genes, the gene expressions of AICDA, SLC29A1, and SH2D5 were significantly correlated with cytotoxicity according to the study by Li et al.23 (P < 6.8 × 10−4, P < 1.9 × 10−3, and P < 3.2 × 10−2, respectively). To our knowledge, of these three validated top-ranking candidates, AICDA and SH2D5 are novel pharmacogenes not previously associated with drug–response phenotypes. The indirect evidence supporting the rank of AICDA is represented in Figure 3. Second, when we combined the experimental ranks and the ranks obtained by our method, the novel pharmacogene candidate NT5C3 had the best combined rank (supporting evidence shown in Figure 3). In addition, we found novel genes, GPM6A and BNIP3. Finally, an overall comparison showed high correlation between the predicted and experimental ranks (P < 4.2 × 10−5).
We compared our novel predictions with a second data set provided by Jarjanazi et al.24 Although there were no data for NT5C3, we were able to replicate one of the remaining two, BNIP3 (P < 4.9 × 10−3). This prediction therefore looks solid and replicable.
Finally, we assessed the impact of improved (i.e., larger) training sets on our performance. Figure 4a shows the anticipated rate of accumulation of drug–gene interactions in the literature, and Figure 4b shows the correlation between the performance of our method and the size of the training set. The performance improves substantially as our training knowledge increases.
Our validation experiments demonstrate that our method can rank 12,460 genes in the human genome (i.e., those appearing in the InWeb protein–protein interaction map) and that the top of the list is enriched for true pharmacogenes. Because the PharmGKB is manually curated, the quality of the annotations is fairly high, but the volume is low. In fact, PharmGKB associates only 844 genes with PGx effects. Using protein–protein interactions, we have been able to extend predictions to cover all of the InWeb interactome with 12,460 genes.
Recently, Yamanishi et al. successfully developed a method for predicting physical DTIs.25 Although physical interactions could indicate a phenotypic drug–gene relationship, this method did not attempt to predict PGx interactions that extend beyond the physical interactions. A recently published drug–target network shows how drug targets and disease genes cluster in protein–protein interaction networks.26 The drug–target network gives a comprehensive framework for analysis of networks related to drug response (see comparison with the PGx pipeline data structure in Supplementary Table S11 online), but it does not quantify, automate, or focus on the process of prioritizing pharmacogene candidates.
While scientists carrying out pharmacogenomics studies are likely to use any or all available informatics resources to assess “hits,” we have automated this process to allow quantitative genome-scale integration of our rankings (and associated scores) with experimental data sets. This is, to our knowledge, the first automated method for prediction of pharmacogenes.
Our method is most useful as a source of information that can be integrated with other data, as illustrated by combining our ranking results with the ranks from the warfarin GWAS and the gefitinib expression data. By examining Figure 3, one can see that the predictions include both expected and novel candidates. For expected candidates, the evidence comes from related drugs interacting directly with the gene of interest, as with the interaction between warfarin and VKORC1. In contrast, for novel predictions, the candidate gene often links a set of known interactants together, as with gemcitabine and AICDA in Figure 3. Importantly, our method provides molecular explanations (on the basis of interacting genes and drug structure) that can generate new hypotheses and guide further experimentation.
Our method requires as input both a drug and an indication, and it can yield very different results for the same drug, given different indications. For example, consider the nucleoside analog gemcitabine, a cancer chemotherapeutic. In addition to targeting cancer, gemcitabine has side effects such as neutropenia, hypertension, and proteinuria. Ranking the genome for gemcitabine in combination with the cancer indication yields a very different result from ranking the genome for gemcitabine and its side effects. Whereas the cancer query ranks CDA and ABCB1 in the top 20, the side-effect query includes both ACE and ACE2, important parts of the blood pressure control pathways (Supplementary Tables S9 and S10 online).
There are important limitations to this approach to pharmacogene prediction. First, the predictions are limited to the parts of the genome that have protein–protein interactions. Second, the indication-for-use associations for some drugs are not complete. Third, we do not have sufficient data about the extensive off-label use of drugs and general-use approved indications.27 Fourth, our structural-similarity methods assume small organic molecular structures and do not work with biological agents such as antibodies or RNA therapeutics. Finally, our method does not explicitly predict whether genes are involved in the pharmacokinetics or pharmacodynamics of a drug. In some cases, this can be inferred by inspection, on the basis of the types of genes that are implicated. We are investigating ways to distinguish pharmacokinetics and pharmacodynamics predictions.
Because our method is modular, the system can be adapted when better protein–protein interactions become available. Likewise, novel DTIs or PGx interactions can easily be added to improve the performance of the model. The method is made available online at http://pgxpipeline.stanford.edu/. The validation of the PGx pipeline shows that it is capable of recovering known pharmacogenes as well as predicting novel candidates. Most important, we can apply the method to novel molecules and their putative indications in order to generate an initial genome-scale candidate gene list.
Full details of the method are provided in the Supplementary Data online.
From the PharmGKB core tables, we extracted 2,488 PGx (“genetic”) interactions between drugs and genes. Although PharmGKB encodes some direct physical interactions, it is dominated by genetic interactions, and so we believe DrugBank to contain fundamentally complementary information about physical interactions. Similarly, we extracted 3,600 drug–target (“physical”) interactions and 1,173 chemical structures from DrugBank. The InWeb interactome12 is a human protein–protein interaction network created with data from experiments on both humans and model organisms. In total, the interactome includes data on 12,278 proteins. We added proteins with corresponding gene–drug interactions in PharmGKB as unconnected nodes, bringing the total number of proteins in the interactome to 12,460.
For each protein in the InWeb, we looked for direct interaction partners (first-degree partners)—for a virtual pull-down experiment.12 For each of the proteins in the pull-down, we looked for interactions between the corresponding gene and drugs. The gene–drug interactions can be classified in two categories: genetic interactions (PGx) derived from PharmGKB and physical DTIs derived from DrugBank. The resulting networks link the query gene to a group of first-degree interacting genes that, in turn, connect to a set of associated drugs (as shown in Figure 1). We separately considered both the structural similarity and the indication similarity of direct physical interactions (from DrugBank) and indirect genetic interactions (from PharmGKB), leading to the use of four features, as shown in Figure 1. We computed the structural similarity of drugs using MACCS fingerprints and a Tanimoto coefficient over 166 predefined features.28,29 We computed similarity in indications of use, using a similar bit string of 2,000 diseases in PharmGKB and the Tanimoto coefficient. We defined four features of the gene–gene–drug network that describe the similarity between a drug (D) and indication (I) and drugs/indications interacting with the local neighborhood of a gene (G) (Figure 1).
We take the average score of structural similarities in the subnetwork to the query drug for associated drugs that come from PharmGKB gene–drug interactions.
We take the average score of similarity of indication for all the drugs in the subnetwork that come from PharmGKB.
We take the average score of structural similarities in the subnetwork to the query drug for associated drugs that come from DrugBank gene–drug interactions.
We take the average score of similarity of indication for all the drugs in the subnetwork that come from DrugBank.
The PharmGKB has evidence of 2,412 interactions between drugs and genes. In order to provide negative interactions, each positive interaction from PharmGKB was matched with interactions between the same drug and three randomly selected genes that were not present in the positive set. We trained a logistic model on the other examples, using four-way cross-validation in the WEKA software.30 A summary of the performances of other classifiers can be found in Supplementary Tables S5 and S6 online. We validated the method by running the classifier on all drugs with at least one PGx interaction manually registered in PharmGKB and for all 12,460 genes in the InWeb network. During all validations, we always removed all information related to the drug from the interaction database before running our prediction algorithm.
As an independent validation set, we tested the method on recently discovered drug–gene interactions that were added to the PharmGKB after the training data set was downloaded. This validation set included 84 drugs, 94 genes, and a total of 199 interactions. We used receiver operating characteristic curves to compare the genome-wide performance on the training and external validation set. We found that a cutoff of −1.075 gave the best separation of positives and negatives of the genome-wide validation.
We used three published prospective data sets and one database example to validate the PGx pipeline. For warfarin, we used the P values of all 500,000 SNPs in the dose-dependence study of warfarin by Cooper et al.20 The SNPs were mapped to associated genes and, where there were multiple SNPs, a gene was associated with the SNP with the lowest P value. For gemcitabine, we used two prospective experimental data sets.23,24 For gefitinib, we used normalized data from a study of gene expression in 45 non-small-cell lung cancer cell lines with varying resistance to gefitinib, through the Gene Expression Omnibus.21,31 For each of the drugs warfarin, carboplatin, gemcitabine, and gefitinib, we carried out the method and evaluated the top five ranking genes. Supplementary Table S7 online includes the five highest scoring predictions for each of the four drugs. For gemcitabine, we summed the ranks of the predictions with the experimental results from Li et al.23 and validated three genes with the lowest summed ranks in the experiment by Jarjanazi et al.24 For each of the drugs warfarin, gemcitabine and gefitinib, the 10 genes with the highest summed ranks are summarized in Supplementary Table S8 online.
From the genes in Supplementary Tables S7 and S8 online, we chose four examples to illustrate the molecular data we integrate for each PGx pipeline prediction. In Figure 3, we split each feature into components showing the contributions both from genes and from drugs to a given prediction. The contributions were calculated as percentages of the total logistic score, after correcting for the intercept.
The drug–gene interaction data for calculation of the four features stem from both PharmGKB and DrugBank. We simulated more sparse data by removing 0 to 100% of the PGx interactions from PharmGKB in 5% intervals. For each simulation, we evaluated the performance using whole-genome ranks, calculating the enrichment at recall rates of 0 to 100% in 5% intervals. We fitted a curve using linear regression.
We thank G. Cooper for providing us with genome-wide data for warfarin, M.E. Dolan and A. Stark for data on carboplatin, K. Lage for access to InWeb interactome, R. Whaley for PharmGKB data, and B. Daigle and Y. Garten for their valuable input on the manuscript. We also acknowledge support from the National Institutes of Health (grant GM61374 and LM05652), the National Science Foundation (award CNS-0619926), the Villum Kann Rasmussen Foundation, and the NABIIT (Nanoscience and Technology, Biotechnology and IT) program.