The study of how drugs affect cellular and physiological processes has been aided by major advances in genomic technologies. One such technology is the development of RNA gene expression microarrays1
. Ten years after their invention, these arrays are commonly used in biomedical research, primarily because they allow for the quantitative measurements of tens of thousands of genes simultaneously. The high throughput capability provided by gene expression arrays make them particularly attractive for pharmacogenomic studies.
Two recent studies highlight the utility of gene expression arrays for chemical and therapeutic discoveries. In one study, Stegmaier et al.2
used gene expression signatures to develop a high-throughput screening assays for 1,660 chemical compounds involved in inducing terminal differentiation in cellular models of acute myeloid leukemia (AML). Recently, Lamb et al
compiled the gene expression signatures of 164 small molecule compounds on 564 arrays into a reference database. Query signatures from other drugs and diseases were then pattern-matched to the reference signatures for “connections” among drugs, genes, and diseases. In this fashion, Lamb et al.
successfully identified new mechanisms of action and indications for existing drugs.
While innovative and high-impact, the vast amount of resources required to undertake these two large-scale studies precludes the participation of most laboratories. Given the rapidly increasing amount of gene expression data in international repositories, we propose automatic methods for identifying drug-related microarray experiments from gene expression databases by the semantic connections between these data resources. The data extracted using these methods could be further used for meta-analysis as well as enable the discovery of novel drug indications and classifications.
There is currently an abundance of public gene expression repositories. The Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information4
, the ArrayExpress at the European Bioinformatics Institute5
, and the Stanford Microarray Database (SMD) at Stanford University6
are a few examples of such publicly available databases. Unfortunately, annotations in all of these repositories are stored in free-text form, thus, making the identification of desired experiments difficult. For our study, we chose to use NCBI GEO. As of this writing, GEO holds 108,371 samples from 5,037 experiment sets over 3,070 types of microarrays, and triples in size on an annual basis.
One major drawback to GEO is the lack of a controlled vocabulary used to describe the context of gene expression experiments: annotations are stored in free-text. While these contextual annotations can be parsed to identify drug and other experimental details, parsing is fraught with inaccuracy7
. We previously showed that some GEO experiments are linked to a corresponding publication by a PubMed identifier, and each of those publications is manually assigned Medical Subject Headings (MeSH) terms (from a controlled vocabulary) 8
. We hypothesize that enough annotations exist in GEO, MeSH, and UMLS to enable a comprehensive extraction of pharmacogenomic experiments in GEO. Our overall goal is to identify shared drug signatures among the various drugs.