Details relating to DrugBank's overall design, querying capabilities, curation protocols, quality assurance and drug selection criteria have been described previously (8
). These have largely remained the same between release 1.0 and 2.0. Here, we shall focus primarily on describing the changes and enhancements made to the database and to the annotation processes for release 2.0. More specifically, we will describe the: (i) enhancements to the DrugBank's size and coverage; (ii) expanded database linkages; (iii) data field additions; (iv) improvements in data querying and data viewing and (v) improvements to DrugBank's data handling processes.
Expanded database size and coverage
A detailed content comparison between DrugBank (release 1.0) versus DrugBank (release 2.0) is provided in . As seen here, the latest release of DrugBank now has detailed information on 1467 FDA-approved drugs corresponding to 28 447 brand names and synonyms. This represents an expansion of nearly 60% over what was previously contained in the database. The latest DrugBank release also includes 123 biotech (peptide or protein) drugs and 69 nutraceuticals (nutritional supplements), which corresponds to an increase of ~10% over what was in the previous DrugBank release. While many of these additions represent newly approved drugs (about 50 new drugs are approved each year), a number of these new entries are little known, hard-to-find or infrequently prescribed drugs that are not contained in most drug databases. To the best of our knowledge, DrugBank now contains all (or almost all) drugs that have been approved in North America, Europe and Asia. In addition, DrugBank's collection of experimental or unapproved drugs (or drug-like) compounds, which is primarily derived from the PDB's Ligand database, has expanded to include 3116 compounds, compared to 2896 compounds in the first release. We are pleased to note that these experimental drugs have now been more completely annotated, via BioSpider (14
), than in the previous DrugBank release.
Comparison between the data content in DrugBank (release 1.0) versus DrugBank (release 2.0)
In response to many user requests, we have also added two new drug categories: (i) Withdrawn drugs and (ii) Illicit drugs. Withdrawn drugs are those that have been withdrawn from the market or certain market segments due to safety concerns (such as Vioxx and Bextra). Illicit drugs include those that are legally banned or selectively banned in most developed nations (such as cocaine and heroin). Chemical, pharmaceutical and biological information about these classes of drugs is extremely important, not only in understanding their adverse reactions, but also in being able to predict whether a new drug entity may have unexpected chemical or functional similarities to a dangerous drug. The number of drugs in the ‘Withdrawn’ category is 57, while the number of drugs in the ‘Illicit’ category is 188. As with all other entries in DrugBank, the same level of drug, drug target and drug action information has been collected for these drugs as with all other drug entries in DrugBank. If one counts all drug entries in DrugBank (FDA-approved, Experimental, Biotech, Nutraceutical, Withdrawn, Illicit), the total number of drugs or drug-like molecules comes to 4897, which represents an increase by 25% over the previous release.
A significant increase in the number (and coverage) of identified drug targets in DrugBank has been achieved for this release of DrugBank, with 1565 non-redundant protein/DNA targets being identified for FDA-approved drugs compared to 524 non-redundant targets identified in release 1.0. The identification of so many more targets was aided by PolySearch (http://wishart.biology.ualberta.ca/polysearch/
), a text-mining tool developed in our laboratory to facilitate these kinds of searches. Additional details about PolySearch appear later in this article. All of these newly identified protein targets are fully referenced to an average of four PubMed citations each.
Of particular interest to many is DrugBank's list of drug targets. Several other drug target lists have been compiled or presented including those in TTD (3
), as well as others by Hopkins et al.
), Drews and Ryser (16
), Imming et al.
) and Overington et al.
). These report 578 molecular targets (out of 1512 total targets including disease and organism targets), 248 protein targets (out of 399 molecular targets), 483 molecular targets, 218 molecular targets and 324 molecular targets, respectively. DrugBank's list of drug targets is 3–4 times larger than these. The primary reasons are: (i) DrugBank has a much larger collection of small molecule drugs (approximately two times larger than any other resource), (ii) DrugBank includes biotech drugs and nutraceuticals (which average 5–10 unique target proteins per drug), (iii) most other drug target lists only include a single ‘primary’ target rather than all targets that have been found to have physiological or pharmaceutical effects, (iv) DrugBank fully accounts for the fact that many drug targets are protein complexes composed of multiple subunits or combinations of subunits and (v) DrugBank annotators identify molecules as drug targets if they play a critical role in the transport, delivery or activation of the drug.
As a general rule, when more than one drug target is listed in DrugBank, the ordering of the drug targets corresponds ‘approximately’ to their order of physiological effect or their importance regarding the drug's therapeutic indication(s).
Expanded database linkages
DrugBank is a database that contains extensive links to almost all major bioinformatics and biomedical databases (GenBank, SwissProt/UniProt, PDB, ChEBI, KEGG, PubChem and PubMed). It also contains many links to numerous drug and pharmaceutical databases (RxList, PharmGKB and FDA labels). Over the past year, DrugBank has also been reciprocally linked by SwissProt/UniProt, Wikipedia, BioMOBY (19
) and PubChem (October 2007). Because of DrugBank's appeal as an educational or public information resource, we are actively seeking to expand these reciprocal linkages with other databases and online resources. For example, all drug entries in Wikipedia are now linked to DrugBank and most drug ‘fact boxes’ in Wikipedia are actually generated from DrugBank tables. For the latest release of DrugBank, several new database links have been added including hyperlinks to Wikipedia, PDRHealth, the Drug Product Database (DPD), the Human Genome Nomenclature Commission (HGNC), GeneCards (20
) and GeneAtlas (21
Data field additions
As seen in , DrugBank now contains 107 data fields, compared to 88 data fields in release 1.0. Some of these data fields have arisen to facilitate cataloging, but most have been added in response to user needs and user requests. Specifically, these new data fields include: (i) a primary accession number; (ii) a secondary accession number; (iii) drug synonyms; (iv) a compound description; (v) drug brand names; (vi) SwissProt name (if the drug is a peptide/protein drug); (vii) monoisotopic molecular weight; (viii) isomeric SMILES string; (ix) water solubility predicted via ALOGPS (22
); (x) LogP predicted via ALOGPS; (xi) CACO permeability; (xii) experimental water solubility (LogS); (xiii) drug–drug interactions; (xiv) food–drug interactions; (xv) Human Protein Reference Database ID; (xvi) HGNC ID; (xvii) GeneCards ID and (xviii) GeneAtlas ID. A total of 194 experimental LogS values and 82 experimental Caco-2 permeability values were obtained from the UCSD ADME databases (23
). These values, along with the structural and physico-chemical data in DrugBank, are particularly useful for computational ADMET (Absorption, Distribution, Metabolism, Excretion and Toxicty) prediction. Additionally, 714 food–drug interactions and 13 242 drug–drug interactions were compiled (through a variety of web and textbook resources), checked by an accredited pharmacist and entered manually. As far as we are aware, these drug/drug and food/drug compilations represent the most complete, publicly accessible collection of its kind. This interaction information is particularly useful for physicians, pharmacists and patients. However, it is also of increasing interest to those involved in pharmacogenomics and nutrigenomics.
Enhanced querying and viewing capabilities
A key feature that distinguishes DrugBank from other online drug resources is its extensive support for higher level database searching and selecting functions. In addition to standard data viewing and sorting features, DrugBank also offers a generic text search, a local BLAST search (SeqSearch), a higher level Boolean text search (TextQuery), a chemical structure search utility (ChemQuery) and a relational data extraction tool (Data Extractor). Each of these search utilities has a number of useful bioinformatics or cheminformatic applications, many of which were described in the first DrugBank publication (8
). For the latest release of DrugBank, we have added a number of improvements to both the generic text search and ChemQuery (). In particular, the generic text search has been enhanced so that users now have the option of clicking on check boxes to limit their search to either a drug's common name, its synonyms/brand names or all text fields. Because the vast majority of queries to DrugBank are related to drug names/synonyms, the default query always has these two boxes checked off. Users wishing to search through the other 100+ data fields in DrugBank can select the ‘all text fields’ box. This change has also substantially improved the query response times for most DrugBank text searches.
A screenshot montage of some of DrugBank's new or modified querying tools including ChemQuery, TextQuery and an example of the new generic text query output.
Because the spelling of many drug names, chemical compound names and protein names is often difficult or non-intuitive, DrugBank now supports an ‘intelligent’ text search, where alternative spellings to misspelled or incompletely entered names are automatically provided. In addition to this change, the results from text queries have also been enhanced so that the standard tabular output (primary accession number, generic drug name, chemical formula and molecular weight) is supplemented with the query word highlighted in the selected DrugCard field(s) from which it was retrieved.
To accommodate a variety of user requests and preferences, the ChemQuery tool has been modified for release 2.0 to allow two different types of chemical drawing applets to be used: the MarvinSketch (http://www.chemaxon.com
) structure drawing tool (new) and the ACD structure drawing tool (old). The MarvinSketch applet is somewhat more intuitive and easier to use, while the ChemSketch (ACD) applet is somewhat more complex but offers more structural drawing options. The default ChemQuery tool for this release is the MarvinSketch applet. DrugBank's structure querying capabilities have also been enhanced with the addition of a ‘Show Similar Structure(s)’ button located at the top of every DrugCard. This allows users to rapidly search for structurally similar small molecules, without having to redraw the molecule and search the database through the ChemQuery interface. Users can also limit their structure similarity search to selected DrugBank subdatabases (Approved drugs, Nutracueticals, Illicit drugs, etc.) through a pull-down menu located by the ‘Show Similar Structure(s)’ button. Both ‘Show Similar Structures’ and ChemQuery use a locally developed SMILES string comparison method to identify related structures and to perform structure similarity searches. All structures are converted to SMILES strings and a substring-matching program (similar to BLAST) is used to identify similar structures. The scoring scheme is based simply on the number of character matches for the longest matching substring.
Improved data handling (entry, export and annotation)
For most of the past 5 years, DrugBank has existed as a series of text files that were manually edited or flat files that were populated by writing Perl scripts to reformat existing text to the DrugBank file format. Most of the annotation in DrugBank (release 1.0) was assembled, entered and validated manually. With the rapid growth in the size and scope of DrugBank, along with the continuing needs for updates, we have had to become far more efficient in our data management. Specifically, we have had to streamline our methods for data entry, data export and database annotation. However, we have continued to maintain our same rigorous standards for manual data validation.
To facilitate manual data entry and export for release 2.0, we have developed customized scientific data management software (SDMS) called DrugBank–SDMS. This web-enabled database system was built using the open source Ruby-on-Rails web application framework. This SDMS overlays a MySQL database that contains all of the DrugBank data. The publicly viewable version of DrugBank is directly linked to the DrugBank–SDMS such that every night the SDMS data is automatically exported to the DrugBank server. This ‘near synchrony’ between the SDMS and DrugBank allows our database annotators to remotely access the SDMS, to add data, to check entries or to make corrections in real time, without the need to write (or wait for) custom Perl scripts for data uploads. The use of a SDMS also allows for more extensive error checking. This is done both at the time of entry (via automated format and spelling checks) and later (once a week), through the use of ‘sanity checker’ (Supplementary Table 1) that checks the consistency of chemical structure files, chemical formulae and chemical properties using a variety of custom-built prediction and file-formatting programs (8
). The development of a custom SDMS has also facilitated the export of publicly downloadable DrugBank files. In particular, our SDMS allows rapid generation of all of DrugBank's flat file (text) downloads and facile creation of XML-formatted DrugBank files—all of which are available through DrugBank's download link.
To improve our manual annotation efficiency and coverage, the programming staff at DrugBank has developed several automated text and web-mining tools including BioSpider (14
) and PolySearch. BioSpider is a web spider that automatically gathers biological, chemical and pharmacological data from approximately 30 trusted, content-rich web sites using only a compound name, SMILES string or Chemical Abstract Service (CAS) number as input. It then combines this data with a variety of in-house molecular structure and property prediction tools to generate data tables that corresponds to many of the data fields in DrugBank. BioSpider allows many of the tedious, error-prone or repetitive annotation activities in DrugBank to be handled by a computer, allowing our annotation team to concentrate on higher level annotation tasks (such as, gathering data on pharmacology, mechanism of action, metabolism or drug interactions). BioSpider has been extensively evaluated (14
) and has been found to perform much better and much faster than skilled human annotators in these low-level annotation tasks. To complement BioSpider's role in low-level annotation, we have also developed PolySearch to enhance higher level annotation and research. PolySearch is a text-mining tool designed to mine data from abstracts in PubMed. It is similar in concept and design to EBIMed (25
) and MedGene (26
), but has been modified to facilitate the extraction of informative sentences or informative abstracts related to drugs, drug targets, drug metabolites, diseases, proteins and drug–protein interactions. PolySearch is used as an adjunct to our manual annotation efforts and has greatly aided the identification of numerous or little-known drug targets.
All textual data acquired from the BioSpider and PolySearch annotation programs are manually inspected by a minimum of two individuals, with at least one individual having an MD or a life science PhD. Additional spot checks are routinely performed on each entry by senior members of the curation group, including a physician, an accredited pharmacist and two PhD-level biochemists. While most information listed in the ‘Drug Description’, ‘Pharmacology’, ‘Mechanisms of Action’, ‘Half Life’, ‘Biotransformation Data’, ‘Protein Binding’, ‘Toxicity’, ‘Absorption’ and ‘Indications’ data fields is manually entered, those entries that are acquired from our automated annotation tools are all manually verified and edited (or rewritten) for readability and consistency. All PolySearch-derived drug target data, in particular, has been verified through multiple text sources (PubMed, drug references, online sequence databases, online drug databases and FDA labels) by at least two members of the DrugBank curatorial staff. Drugs with near-identical structures and modes of action are cross-checked to ensure that their drug target lists are nearly identical. In addition to these manual checks, nearly 40 automated data consistency checks are performed to ensure a uniformly high level of data integrity (Supplementary Table 1). Even with these added checks and references we still recommend that users carefully study the data sources prior to making decisions about using it.