Drug targets, drug indications and canonical simplified molecular input line entry specification (SMILES) (Weininger, 1988
) of the drugs were extracted from DrugBank (Wishart et al, 2008
). Additional drug targets were obtained from the DCDB (Liu et al, 2010
), Matador (Gunther et al, 2008
) and KEGG DRUG (Kanehisa and Goto, 2000
) databases. FDA drug labels and additional drug indications were downloaded from the DailyMed site (http://dailymed.nlm.nih.gov
) and from http://drugs.com
. Drug side effects were obtained from SIDER (Kuhn et al, 2010
). Human PPIs were compiled from experimental and literature curated data (Xenarios et al, 2002
; Rual et al, 2005
; Stelzl et al, 2005
; Ewing et al, 2007
; Breitkreutz et al, 2008
). Protein sequences and GO annotations (Ashburner et al, 2000
) were downloaded from UniProt (Jain et al, 2009
). Clinical trials data were downloaded from a registry of federally and privately supported clinical trials (http://clinicaltrials.gov/
). Disease signatures were downloaded from the Gene Expression Atlas of ArrayExpress (Parkinson et al, 2009
Assembling drug–disease associations
We assembled associations between diseases listed in the OMIM (Hamosh et al, 2002
) and their indicated drugs, registered in DrugBank (Wishart et al, 2008
). OMIM is a comprehensive disease phenotype database, encompassing thousands of phenotypic descriptions of diseases and disorders, including single-gene as well as complex multiple-gene disorders, allowing for the construction of phenotypic similarity measures.
In order to handle variations in disease names between OMIM and drug indication sources, we mapped disease names to UMLS concepts (Bodenreider, 2004
), serving as a standardized terminology. We began by mapping OMIM disease names to UMLS concepts, exploiting the integrated curated mapping between OMIM and UMLS concepts and augmented the mapping by using the MetaMap tool (Aronson, 2001
). The MetaMap tool uses symbolic, natural language processing and computational linguistic techniques to map biomedical text to the UMLS metathesaurus, and was previously reported to exceed human mapping (to UMLS concepts) capabilities (Pratt and Yetisgen-Yildiz, 2003
). We permitted only disease-related UMLS concept types (e.g., ‘Disease or Syndrome', ‘Anatomical Abnormality' or ‘Neoplastic Process') and filtered ambiguous or generic UMLS concepts (e.g., ‘Infections' or ‘Syndrome'). We further augmented the MetaMap mapped UMLS concepts with the rich synonymy relationships embedded in the UMLS. Supplementary Table S6
lists the mapping between OMIM disease names and UMLS concepts.
The associations between drugs and UMLS disease concepts were integrated from four different sources using three different methods: (i) direct mapping to drugs, exploiting embedded UMLS links between concepts and drugs; (ii) drug–condition associations downloaded from http://drugs.com
, where conditions were mapped to UMLS concepts using MetaMap; and (iii) indication-based mapping. For the latter, we extracted UMLS concepts using the MetaMap tool from textual drug indications downloaded from FDA package inserts (available in the DailyMed website, http://dailymed.nlm.nih.gov
) and DrugBank. In addition, we manually added 44 associations occurring in phase IV (post-marketing) clinical trials.
In order to deal with variations in drug names, we used generic and synonymous drug names; if no match was found, we used also brand names retrieved from DrugBank. We removed UMLS concepts matching hard to treat congenital anomalies such as congenital malformations, chromosomal abnormalities and inborn errors of metabolism using the ICD-10. We filtered disease names classified under the ICD-10 chapter XVII (Congenital malformations, deformations and chromosomal abnormalities (Q00–Q99)) and metabolic disorders (E70–E90) by mapping them to UMLS concepts using MetaMap followed by manual curation of the list, resulting in the removal of 263 OMIM diseases. Finally, performing a manual curation of the extracted UMLS concepts from textual description of drug indications, we observed that they are more prone to false positives. We thus required that associations extracted from drug indications appear also in at least one more source.
We defined and computed five drug–drug similarity measures and two disease–disease similarity measures. Three of the drug–drug similarities (3–5) are gene related, based on the drug targets downloaded from DrugBank. For drugs associated with more than one gene, maximal similarities between the associated genes were averaged (averaging over the contribution of each member in a drug pair for symmetry). All similarity measures were normalized to be in the range (0, 1).
We used the following drug–drug similarity measures:
(1) Chemical based:
Canonical SMILES (Weininger, 1988
) of the drug molecules were downloaded from DrugBank (Wishart et al, 2008
). Hashed fingerprints were computed using the Chemical Development Kit with default parameters (Steinbeck et al, 2006
). The similarity score between two drugs is computed based on their fingerprints according to the two-dimensional Tanimoto score (Tanimoto, 1957
), which is equivalent to the Jaccard score (Jaccard, 1908
) of their fingerprints, that is, the size of the intersection over the union when viewing each fingerprint as specifying a set of elements.
(2) Side effect based:
Drug side effects were obtained from SIDER (Kuhn et al, 2010
), an online database containing drug side effects associations extracted from package inserts using text mining methods. We augmented this list by side effect predictions for drugs that are not included in SIDER based on their chemical properties (Atias and Sharan, 2011
). Following this latter work, we defined the similarity between drugs according to the Jaccard score between either their known side effects or top 10 predicted side effects in case they are unknown.
(3) Sequence based:
Based on a Smith–Waterman sequence alignment score (Smith et al, 1985
) between the corresponding drug-related genes. Following the normalization suggested in Bleakley and Yamanishi (2009)
, we divide the Smith–Waterman score by the geometric mean of the scores obtained from aligning each sequence against itself.
(4) Closeness in a PPI network:
The distances between each pair of drug-related genes were calculated on their corresponding proteins using an all-pairs shortest paths algorithm on the Human PPI network. Distances were transformed to similarity values using the formula described in Perlman et al (2011)
′) is the computed similarity value between two proteins, D
′) is the shortest path between these proteins in the PPI network and A
were chosen according to Perlman et al (2011)
to be 0.9 × e and 1, respectively. Self similarity was assigned a value of 1.
(5) GO based:
Semantic similarity scores between drug-related genes were calculated according to Resnik (1999)
, using the csbl.go R package (Ovaska et al, 2008
) selecting the option to use all three ontologies.
For diseases, we employed two sets of measures, depending on whether we wished to exploit phenotypic (1–2) or gene signature information (3–6). As in drugs similarities, maximal values between the two lists of associated genes were averaged (taking into account both sides for symmetry). The disease–disease similarity measures we used include:
(1) Phenotype based:
We used the phenotypic similarity constructed by van Driel et al (2006)
. The phenotypic similarity was constructed by identifying similarity between MeSH terms (Lipscomb, 2000
) appearing in the medical description of diseases from the OMIM database (Hamosh et al, 2002
(2) Semantic phenotypic similarity:
We used the hierarchical structure of the HPO (Robinson and Mundlos, 2010
) together with the mapping provided by HPO between ontology nodes and OMIM diseases to construct a semantic similarity score based on Resnik (1999)
. Applying this semantic similarity on the HPO data was previously shown to provide consistent clustering of diseases (Robinson et al, 2008
(3) Genetic based: Given genetic signatures of diseases obtained from gene expression experiments, we used a Jaccard score between every pair of signatures, taking into account the direction of the response of each gene. That is, the total number of mutual upregulated genes and mutual downregulated genes over the unified list of all genes. Signature genes with inconsistent regulation directionality for the same disease across various experiments (i.e., registered as both upregulated and downregulated across various experiments for the same disease) were filtered, allowing for up to 10% expression measurement errors.
(4–6) Signature sequence based, signature PPI based and signature GO based: Similar to the drug–drug sequence-, PPI- and GO-based similarity, respectively, we used these similarities between genes participating in the two signatures.
Combining measures to classification features
The classification features that we used were constructed from association scores calculated on pairs of drug–drug and disease–disease similarity measures, resulting in 10 features when using the phenotypic disease similarities and 20 features when using the signature-based similarities. For a given similarity measure pair (i.e., feature), the score of a given drug–disease association (dr
) is calculated by considering the similarity, according to the given pair, of all known drug–disease associations to this association. The computation is done as follows: First, for each known associations (dr
′) we compute the drug–drug similarity S
′) and the disease–disease similarity S
′). Next, we follow the method of Perlman et al (2011)
to combine the two similarities to a single score by computing their weighted geometric mean. Here, we chose to use a simple (unweighted) geometric mean as the resulting AUC score was robust to the weighting parameter under a wide range (AUC difference<0.01, data not shown). Thus:
We used a 10-fold cross-validation scheme to evaluate the accuracy of our prediction algorithm. To concentrate on ‘hard' cases, we hid 10% of the drugs in each iteration rather than 10% of the associations. The training set used for the cross-validation included 1933 true drug–disease associations and a randomly generated set of drug–disease pairs (not part of the positive set), twice as large as the positive set. To obtain robust AUC score estimates, we performed 100 independent cross-validation runs, in each of which a different random partition of the training set to 10 parts was used; we then averaged the resulting AUC scores. We note that taking a negative set of the same size as the positive set, three-fold larger or even 50 times as large had a negilible effect on the resulting AUC score (<0.01). In order to test the effect of redundant drugs on prediction accuracy, we created non-redundant sets of drugs filtered by three different methods: (i) chemical similarity above a Tanimoto coefficient ranging from 0.85 to 0.4; (ii) normalized sequence similarity between their targets ranging from 1 to 0.7; and (iii) target sharing. To this end, we iteratively selected the most similar pair and randomly removed one of the pair's drugs. We repeated this procedure 50 times for each similarity threshold to construct different non-redundant sets and averaged over the resulting AUC score (reporting also the AUC standard deviation).
To evaluate the benefit of using a feature selection scheme, we employed both forward feature selection and backward feature elimination in a cross-validation setting to select the best-performing features. We found that the difference between the best feature set according to each feature selection method and using all the features was negligible (AUC difference <0.005). We thus retained the entire set of features.
We used the MATLAB implementation of the logistic regression classifier (glmfit function with binomial distribution and logit linkage) for the prediction task.
Comparison with other methods
We implemented the GBA method of Chiang and Butte (2009)
and applied it to the set of drugs and diseases appearing in our data. Since it cannot be tested using a cross-validation scheme that involves the removal of entire drugs, we tested it by removing associations instead.
In order to compare our results with the CMap, we obtained disease gene signatures from the Gene Expression Atlas of ArrayExpress (Parkinson et al, 2009
). We mapped the ArrayExpress disease titles to UMLS concepts using the same filtering procedures described above. These concepts were then matched to OMIM disease mapped UMLS concepts. We queried the online web tool of CMap with the signatures and followed CMap suggestion to select drugs obtaining negative enrichment (the signature has opposite effect to the drug expression profile) and non-null P
-value. An ROC curve was obtained by choosing different P
-values as cutoffs.
Hu and Agarwal (2009)
suggested a similar method to CMap, whereby they predicted drug–disease associations based on disease and drug profiles downloaded from GEO. The authors' published set of predictions contain 43 profiles tagged as diseases or infections and 45 profiles tagged as agents. We mapped disease names to OMIM identifiers using the same procedure used for constructing the gold standard, resulting in 17 mapped OMIM diseases. Mapping drug names, we matched generic, synonymous and brand names from DrugBank followed by additional manual mapping of unmapped names, obtaining 13 drugs successfully mapped to DrugBank entries. However, only five diseases and five drugs, connected by 10 predicted associations (out of the 17 diseases and 13 drugs) intersected our collection, preventing a proper comparison with this method.
To predict novel indications for drugs, we used a training set that included all the known associations and a two-fold larger randomly generated set of drug–disease pairs that are not known to be associated (i.e., associations that do not appear in our drug indication gold standard). We applied the trained classifier to all remaining drug–disease pairs to form our prediction set. We repeated the analysis with another randomly picked negative set, distinct from the first one, to assign prediction scores also to the random negative set that we initially used for training. Overall, we obtained classification scores for all 183 676 unknown drug–disease pairs.
We validated the predictions in two ways: (i) by comparing predicted drug–disease associations with drug–disease associations being investigated in clinical trials (phases I–IV) and (ii) by testing their agreement with tissue-specific gene expression information—that is, by comparing the tissues where the drug targets are expressed with tissues associated with the diseases predicted to be treated by the drug. All P-values were computed using a hypergeometric test.
For the clinical trial validation, we downloaded phases I–IV clinical trials from http://clinicaltrials.gov/
and mapped the disease names to UMLS concepts using the MetaMap tool and drug names to DrugBank using the same procedures described above. For the tissue-related validation, we assembled two drug–tissue association matrices. The first matrix was constructed based on the tissue-specific gene expression of Su et al (2004)
. A drug was declared to affect a tissue if at least one of its targets is expressed in that tissue over a certain tunable threshold (see below). The second drug–tissue matrix was assembled by exploiting the disease–tissue associations computed by Lage et al (2008)
. Lage et al
scored disease–tissue associations according to co-mentioning of diseases and tissues across PubMed (Korbel et al, 2005
).Thus, a drug was associated with a tissue if at least one of its indicated diseases was associated with that tissue. We manually matched the tissue names specified by Su et al
, to tissue names specified by Lage et al
, resulting in 67 common tissues.
Both the gene expression–tissue associations and the disease–tissue associations require thresholds to define true associations. In order to decouple the prediction task from tuning these thresholds, we randomly split the gold standard set of drugs into two disjoint groups (two thirds and a third), where we provide predictions only based on the larger group and use the smaller group for tuning the thresholds. We used the Wilcoxon rank sum test to test whether the drug targets are significantly expressed in tissues related to the diseases these drugs are predicted to treat. By the learned thresholds, genes with an expression value <380 were declared unassociated with a certain tissue, interestingly corresponding to the average expression in the entire data, and diseases with co-occurrence score <18 in a certain tissue were declared unrelated to that tissue. During learning, the Wilcoxon rank sum test yielded a P-value of 0.003 on the smaller group of known drug indications. Using the same tuned thresholds on the predictions made using the second group, we obtained a significant P-value of 0.006.