Adverse drug reactions (ADRs) can have severe consequences, such that the ability to predict ADRs prior to market introduction is desirable. Computational approaches applied to pre-clinical data might be one way to inform drug labeling and marketing with respect to potential ADRs.
Based on the premise that some of the molecular actors of ADRs involve interactions detectable in large, and increasingly public, compound screening campaigns, we generated logistic regression models that correlate post-marketing ADRs with screening data from the PubChem BioAssay database. These models analyze ADRs at the level of organ systems, the System Organ Classes (SOCs). Nine of the 19 SOCs under consideration were found to be significantly correlated with pre-clinical screening data. For 6 of the 8 established drugs for which we could retropredict SOC-specific adversities, prior knowledge was found that support these predictions. We conclude by predicting SOC-specific adversities for three unapproved or recently introduced drugs.
doi:10.1038/clpt.2011.81
PMCID: PMC3464971
PMID: 21613989
Adverse drug reactions; prediction; machine learning; compound screening; pharmacovigilance
doi:10.1038/ng.2355
PMCID: PMC3593099
PMID: 22836096
Clinically recorded pain scores are abundant in patient health records but are rarely used in research. The use of this information could help improve clinical outcomes. For example, a recent report by the Institute of Medicine stated that ineffective use of clinical information contributes to under-treatment of patient subpopulations — especially women. This study used diagnosis-associated pain scores from a large hospital database to document sex differences in reported pain. We used de-identified electronic medical records from Stanford Hospital and Clinics for more than 72,000 patients. Each record contained at least one disease-associated pain score. We found over 160,000 pain scores in more than 250 primary diagnoses, and analyzed differences in disease-specific pain reported by men and women. After filtering for diagnoses with minimum encounter numbers, we found diagnosis-specific sex differences in reported pain. The most significant differences occurred in patients with disorders of the musculoskeletal, circulatory, respiratory and digestive systems, followed by infectious diseases, and injury and poisoning. We also discovered sex-specific differences in pain intensity in previously unreported diseases, including disorders of the cervical region, and acute sinusitis (p = 0.01, 0.017, respectively). Pain scores were collected during hospital encounters. No information about the use of pre-encounter over-the-counter medications was available. To our knowledge, this is the largest data-driven study documenting sex differences of disease-associated pain. It highlights the utility of EMR data to corroborate and expand on results of smaller clinical studies. Our findings emphasize the need for future research examining the mechanisms underlying differences in pain.
doi:10.1016/j.jpain.2011.11.002
PMCID: PMC3293998
PMID: 22245360
electronic medical records; sex differences; pain intensity; data mining
Background
Crohn’s disease (CD), an inflammatory disease of the bowel, affects millions of people around the world. Evidence suggests that disease onset and pathogenesis differ between males and females. Yet no comprehensive efforts exist to assess the sex-specific genetic architecture of CD.
Methods
We used genotyping data from a cohort of 1748 CD cases and 2938 controls to investigate 71 meta-analysis-confirmed CD risk loci for sex differences in disease risk. We further validated the significant results in separate cohorts of 968 CD cases and 2809 controls, and performed a meta-analysis across datasets.
Results
The SNP rs3792106 (C/T) in ATG16L1 showed a significant sex effect with p-value 6.9×10−13 and allelic odds ratio 1.48 in females, and p-value 0.013 and odds ratio 1.22 in males (odds ratio heterogeneity p-value 0.037). Surprisingly, the difference was found to arise from a discrepancy in allele frequencies between male and female controls (p-value 0.0045) rather than cases. We found similar results for this SNP in the separate validation data sets. Using 155 HapMap 3 trios, we detected significant maternal over-transmission of the T allele at rs3792106 (p-value 0.027).
Conclusions
Our results indicate that different transmission patterns between sexes may sustain the disparate allele frequencies at rs3792106 in healthy populations, and furthermore that a virus-risk variant mechanism implicated in CD alters the distribution in diseased patients. To our knowledge, this is the first report of sex-specific CD association in ATG16L1. The possible implications in Crohn’s disease and basic human biology present interesting areas for future investigation.
doi:10.1002/ibd.21781
PMCID: PMC3165065
PMID: 21618365
Inflammatory bowel disease; ATG16L1; transmission distortion; sexual dimorphism
Drug repositioning refers to alternative drug use discoveries which differ from the original intent of the drug. One challenge in these efforts lies in choosing which indication to prospectively test a drug of interest. We systematically evaluated a drug treatment-based view of diseases in order to address this challenge. Suggested novel drug uses were generated using a guilt-by-association approach. Compared with control drug uses, the suggested novel drug uses were significantly enriched in clinical trials.
doi:10.1038/clpt.2009.103
PMCID: PMC2836384
PMID: 19571805
doi:10.1038/nmeth1107-879
PMCID: PMC2716375
PMID: 17971777
Publicly available molecular datasets can be used for independent verification or investigative repurposing, but depends on the presence, consistency and quality of descriptive annotations. Annotation and indexing of molecular datasets using well-defined controlled vocabularies or ontologies enables accurate and systematic data discovery, yet the majority of molecular datasets available through public data repositories lack such annotations. A number of automated annotation methods have been developed; however few systematic evaluations of the quality of annotations supplied by application of these methods have been performed using annotations from standing public data repositories. Here, we compared manually-assigned Medical Subject Heading (MeSH) annotations associated with experiments by data submitters in the PRoteomics IDEntification (PRIDE) proteomics data repository to automated MeSH annotations derived through the National Center for Biomedical Ontology Annotator and National Library of Medicine MetaMap programs. These programs were applied to free-text annotations for experiments in PRIDE. As many submitted datasets were referenced in publications, we used the manually curated MeSH annotations of those linked publications in MEDLINE as “gold standard”. Annotator and MetaMap exhibited recall performance 3-fold greater than that of the manual annotations. We connected PRIDE experiments in a network topology according to shared MeSH annotations and found 373 distinct clusters, many of which were found to be biologically coherent by network analysis. The results of this study suggest that both Annotator and MetaMap are capable of annotating public molecular datasets with a quality comparable, and often exceeding, that of the actual data submitters, highlighting a continuous need to improve and apply automated methods to molecular datasets in public data repositories to maximize their value and utility.
doi:10.1016/j.jbi.2011.03.007
PMCID: PMC3155012
PMID: 21420508
Proteomics; Annotations; Ontologies; Concept Identification; Natural Language Processing; MEDLINE
Summary: We introduce ProfileChaser, a web server that allows for querying the Gene Expression Omnibus based on genome-wide patterns of differential expression. Using a novel, content-based approach, ProfileChaser retrieves expression profiles that match the differentially regulated transcriptional programs in a user-supplied experiment. This analysis identifies statistical links to similar expression experiments from the vast array of publicly available data on diseases, drugs, phenotypes and other experimental conditions.
Availability: http://profilechaser.stanford.edu
Contact: abutte@stanford.edu
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr548
PMCID: PMC3223361
PMID: 21967760
The volume of publicly available genomic scale data is increasing. Genomic datasets in public repositories are annotated with free-text fields describing the pathological state of the studied sample. These annotations are not mapped to concepts in any ontology, making it difficult to integrate these datasets across repositories. We have previously developed methods to map text-annotations of tissue microarrays to concepts in the NCI thesaurus and SNOMED-CT.
In this work we generalize our methods to map text annotations of gene expression datasets to concepts in the UMLS. We demonstrate the utility of our methods by processing annotations of datasets in the Gene Expression Omnibus. We demonstrate that we enable ontology-based querying and integration of tissue and gene expression microarray data. We enable identification of datasets on specific diseases across both repositories. Our approach provides the basis for ontology-driven data integration for translational research on gene and protein expression data.
Based on this work we have built a prototype system for ontology based annotation and indexing of biomedical data. The system processes the text metadata of diverse resource elements such as gene expression data sets, descriptions of radiology images, clinical-trial reports, and PubMed article abstracts to annotate and index them with concepts from appropriate ontologies. The key functionality of this system is to enable users to locate biomedical data resources related to particular ontology concepts.
doi:10.1186/1471-2105-10-S2-S1
PMCID: PMC2646250
PMID: 19208184
The application of established drug compounds to novel therapeutic indications, known as drug repositioning, offers several advantages over traditional drug development, including reduced development costs and shorter paths to approval. Recent approaches to drug repositioning employ high-throughput experimental approaches to assess a compound’s potential therapeutic qualities. Here we present a systematic computational approach to predict novel therapeutic indications based on comprehensive testing of molecular signatures in drug-disease pairs. We integrated gene expression measurements from 100 diseases and gene expression measurements on 164 drug compounds yielding predicted therapeutic potentials for these drugs. We demonstrate the ability to recover many known drug and disease relationships using computationally derived therapeutic potentials, and also predict many new indications for these drugs. We experimentally validated a prediction for the anti-ulcer drug cimetidine as a candidate therapeutic in the treatment of lung adenocarcinoma, and demonstrate both in vitro and in vivo using mouse xenograft models. This novel computational method provides a novel and systematic approach to reposition established drugs to treat a wide range of human diseases.
doi:10.1126/scitranslmed.3001318
PMCID: PMC3502016
PMID: 21849665
Dudley, Joel T. | Sirota, Marina | Shenoy, Mohan | Pai, Reetesh | Roedder, Silke | Chiang, Annie P. | Morgan, Alex A. | Sarwal, Minnie | Pasricha, Pankaj Jay | Butte, Atul J.
Inflammatory Bowel Disease (IBD) is a chronic inflammatory disorder of the gastrointestinal tract for which there are few safe and effective therapeutic options for long-term treatment and disease maintenance. In this study, we applied a computational approach to discover novel drug therapies for IBD in silico using publicly available molecular data measuring gene expression in IBD samples and 164 small-molecule drug compounds. Among the top compounds predicted to be therapeutic for IBD by our approach were prednisolone, a corticosteroid known to treat IBD, and topiramate, an anticonvulsant drug not previously described to demonstrate efficacy for IBD or any related disorders of inflammation or the gastrointestinal tract. We experimentally validated our topiramate prediction in vivo using a trinitrobenzenesulfonic acid (TNBS) induced rodent model of IBD. The experimental results demonstrate that oral administration of topiramate is able to significantly reduce gross pathological signs and microscopic damage in primary affected colon tissue in a TNBS-induced rodent model of IBD. These finding suggest that topiramate might serve as a novel therapeutic option for IBD in humans, and support the use of public molecular data and computational approaches to discover novel therapeutic options for IBD.
doi:10.1126/scitranslmed.3002648
PMCID: PMC3479650
PMID: 21849664
The diagnosis and treatment of cancers, which rank among the leading causes of mortality in developed nations, presents substantial clinical challenges. The genetic and epigenetic heterogeneity of tumors can lead to differential response to therapy and gross disparities in patient outcomes, even for tumors originating from similar tissues. High-throughput DNA sequencing technologies hold promise to improve the diagnosis and treatment of cancers through efficient and economical profiling of complete tumor genomes, paving the way for approaches to personalized oncology that consider the unique genetic composition of the patient’s tumor. Here we present a novel method to leverage the information provided by cancer genome sequencing to match an individual tumor genome with commercial cell lines, which might be leveraged as clinical surrogates to inform prognosis or therapeutic strategy. We evaluate the method using a published lung cancer genome and genetic profiles of commercial cancer cell lines. The results support the general plausibility of this matching approach, thereby offering a first step in translational bioinformatics approaches to personalized oncology using established cancer cell lines.
PMCID: PMC3477496
PMID: 21121052
doi:10.1038/clpt.2011.120
PMCID: PMC3476839
PMID: 21716268
The future of neonatal informatics will be driven by the availability of increasingly vast amounts of clinical and genetic data. The field of translational bioinformatics is concerned with linking and learning from these data and applying new findings to clinical care to transform the data into proactive, predictive, preventive, and participatory health. As a result of advances in translational informatics, the care of neonates will become more data driven, evidence based, and personalized.
doi:10.1542/neo.13-5-e281
PMCID: PMC3424284
PMID: 22924023
doi:10.1136/amiajnl-2011-000343
PMCID: PMC3128419
PMID: 21672904
Modeling physiologic and disease processes; Linking the genotype and phenotype; identifying genome and protein structure and function; visualization of data and knowledge; collaborative technologies; applications that link biomedical knowledge from diverse primary sources (includes automated indexing); knowledge representations; statistical analysis of large datasets; methods for integration of information from disparate sources; discovery, and text and data mining methods, automated learning; ontologies; knowledge bases
Finding new uses for existing drugs, or drug repositioning, has been used as a strategy for decades to get drugs to more patients. As the ability to measure molecules in high-throughput ways has improved over the past decade, it is logical that such data might be useful for enabling drug repositioning through computational methods. Many computational predictions for new indications have been borne out in cellular model systems, though extensive animal model and clinical trial-based validation are still pending. In this review, we show that computational methods for drug repositioning can be classified in two axes: drug based, where discovery initiates from the chemical perspective, or disease based, where discovery initiates from the clinical perspective of disease or its pathology. Newer algorithms for computational drug repositioning will likely span these two axes, will take advantage of newer types of molecular measurements, and will certainly play a role in reducing the global burden of disease.
doi:10.1093/bib/bbr013
PMCID: PMC3137933
PMID: 21690101
bioinformatics; drug repositioning; drug development; microarrays; gene expression; systems biology; genomics
High-resolution image guidance for resection of residual tumor cells would enable more precise and complete excision for more effective treatment of cancers, such as medulloblastoma, the most common pediatric brain cancer. Numerous studies have shown that brain tumor patient outcomes correlate with the precision of resection. To enable guided resection with molecular specificity and cellular resolution, molecular probes that effectively delineate brain tumor boundaries are essential. Therefore, we developed a bioinformatics approach to analyze micro-array datasets for the identification of transcripts that encode candidate cell surface biomarkers that are highly enriched in medulloblastoma. The results identified 380 genes with greater than a two-fold increase in the expression in the medulloblastoma compared with that in the normal cerebellum. To enrich for targets with accessibility for extracellular molecular probes, we further refined this list by filtering it with gene ontology to identify genes with protein localization on, or within, the plasma membrane. To validate this meta-analysis, the top 10 candidates were evaluated with immunohistochemistry. We identified two targets, fibrillin 2 and EphA3, which specifically stain medulloblastoma. These results demonstrate a novel bioinformatics approach that successfully identified cell surface and extracellular candidate markers enriched in medulloblastoma versus adjacent cerebellum. These two proteins are high-value targets for the development of tumor-specific probes in medulloblastoma. This bioinformatics method has broad utility for the identification of accessible molecular targets in a variety of cancers and will enable probe development for guided resection.
PMCID: PMC3421962
PMID: 22904683
Motivation: Complex diseases, such as Type 2 Diabetes Mellitus (T2D), result from the interplay of both environmental and genetic factors. However, most studies investigate either the genetics or the environment and there are a few that study their possible interaction in context of disease. One key challenge in documenting interactions between genes and environment includes choosing which of each to test jointly. Here, we attempt to address this challenge through a data-driven integration of epidemiological and toxicological studies. Specifically, we derive lists of candidate interacting genetic and environmental factors by integrating findings from genome-wide and environment-wide association studies. Next, we search for evidence of toxicological relationships between these genetic and environmental factors that may have an etiological role in the disease. We illustrate our method by selecting candidate interacting factors for T2D.
Contact:
abutte@stanford.edu
doi:10.1093/bioinformatics/bts229
PMCID: PMC3371861
PMID: 22689751
Most GWASs were performed using study populations with Caucasian ethnicity or ancestry, and findings from one ethnic subpopulation might not always translate to another. We curated 4,573 genetic studies on 763 human diseases and identified 3,461 disease-susceptible SNPs with genome-wide significance; only 10% of these had been validated in at least two different ethnic populations. SNPs for autoimmune diseases demonstrated the lowest percentage of cross-ethnicity validation. We used the mortality data from the Center for Disease Control and Prevention and identified 19 diseases killing over 10,000 Americans per year that were still lacking publications of even a single cross-ethnic SNP. Fifteen of these diseases had never been studied in large GWAS in non-Caucasian populations, including chronic liver diseases and cirrhosis, leukemia, and non-Hodgkin’s lymphoma. Our results demonstrate that diseases killing most Americans are still lacking genetic studies across ethnicities.
PMCID: PMC3392055
PMID: 22779041
Expression quantitative trait loci (eQTL), or genetic variants associated with changes in gene expression, have the potential to assist in interpreting results of genome-wide association studies (GWAS). eQTLs also have varying degrees of tissue specificity. By correlating the statistical significance of eQTLs mapped in various tissue types to their odds ratios reported in a large GWAS by the Wellcome Trust Case Control Consortium (WTCCC), we discovered that there is a significant association between diseases studied genetically and their relevant tissues. This suggests that eQTL data sets can be used to determine tissues that play a role in the pathogenesis of a disease, thereby highlighting these tissue types for further post-GWAS functional studies.
PMCID: PMC3392070
PMID: 22779046
Background Both genetic and environmental factors contribute to triglyceride, low-density lipoprotein-cholesterol (LDL-C), and high-density lipoprotein-cholesterol (HDL-C) levels. Although genome-wide association studies are currently testing the genetic factors systematically, testing and reporting one or a few factors at a time can lead to fragmented literature for environmental chemical factors. We screened for correlation between environmental factors and lipid levels, utilizing four independent surveys with information on 188 environmental factors from the Centers of Disease Control, National Health and Nutrition Examination Survey, collected between 1999 and 2006.
Methods We used linear regression to correlate each environmental chemical factor to triglycerides, LDL-C and HDL-C adjusting for age, age2, sex, ethnicity, socio-economic status and body mass index. Final estimates were adjusted for waist circumference, diabetes status, blood pressure and survey. Multiple comparisons were controlled for by estimating the false discovery rate and significant findings were tentatively validated in an independent survey.
Results We identified and validated 29, 9 and 17 environmental factors correlated with triglycerides, LDL-C and HDL-C levels, respectively. Findings include hydrocarbons and nicotine associated with lower HDL-C and vitamin E (γ-tocopherol) associated with unfavourable lipid levels. Higher triglycerides and lower HDL-C were correlated with higher levels of fat-soluble contaminants (e.g. polychlorinated biphenyls and dibenzofurans). Nutrients and vitamin markers (e.g. vitamins B, D and carotenes), were associated with favourable triglyceride and HDL-C levels.
Conclusions Our systematic association study has enabled us to postulate about broad environmental correlation to lipid levels. Although subject to confounding and reverse causality bias, these findings merit evaluation in additional cohorts.
doi:10.1093/ije/dys003
PMCID: PMC3396318
PMID: 22421054
Lipids; cholesterol; environment; pollutants; nutrients; GWAS; EWAS
Pathway analysis has become the first choice for gaining insight into the underlying biology of differentially expressed genes and proteins, as it reduces complexity and has increased explanatory power. We discuss the evolution of knowledge base–driven pathway analysis over its first decade, distinctly divided into three generations. We also discuss the limitations that are specific to each generation, and how they are addressed by successive generations of methods. We identify a number of annotation challenges that must be addressed to enable development of the next generation of pathway analysis methods. Furthermore, we identify a number of methodological challenges that the next generation of methods must tackle to take advantage of the technological advances in genomics and proteomics in order to improve specificity, sensitivity, and relevance of pathway analysis.
doi:10.1371/journal.pcbi.1002375
PMCID: PMC3285573
PMID: 22383865
Genome-wide association studies (GWAS) have revealed novel genes and pathways involved in lung disease, many of which are potential targets for therapy. However, despite numerous successes, a large proportion of the genetic variance in disease risk remains unexplained, and the function of the associated genetic variations identified by GWAS and the mechanisms by which they alter individual risk for disease or pathogenesis are still largely unknown. The National Heart, Lung, and Blood Institute (NHLBI) convened a 2-day workshop to address these shortcomings and to make recommendations for future research areas that will move the scientific community beyond gene discovery. Topics of individual sessions ranged from data integration and systems genetics to functional validation of genetic variations in humans and model systems. There was broad consensus among the participants for five high-priority areas for future research, including the following: (1) integrated approaches to characterize the function of genetic variations, (2) studies on the role of environment and mechanisms of transcriptional and post-transcriptional regulation, (3) development of model systems to study gene function in complex biological systems, (4) comparative phenomic studies across lung diseases, and (5) training in and applications of bioinformatic approaches for comprehensive mining of existing data sets. Last, it was agreed that future research on lung diseases should integrate approaches across “-omic” technologies and to include ethnically/racially diverse populations in human studies of lung disease whenever possible.
doi:10.1164/rccm.201002-0180PP
PMCID: PMC2949401
PMID: 20558629
genetics; epigenetics; genomics; bioinformatics; lung disease
Asthma is considered a Th2 cell–associated disorder. Despite this, both the Th1 cell–associated cytokine IFN-γ and airway neutrophilia have been implicated in severe asthma. To investigate the relative contributions of different immune system components to the pathogenesis of asthma, we previously developed a model that exhibits several features of severe asthma in humans, including airway neutrophilia and increased lung IFN-γ. In the present studies, we tested the hypothesis that IFN-γ regulates mast cell function in our model of chronic asthma. Engraftment of mast cell–deficient KitW-sh/W-sh mice, which develop markedly attenuated features of disease, with wild-type mast cells restored disease pathology in this model of chronic asthma. However, disease pathology was not fully restored by engraftment with either IFN-γ receptor 1–null (Ifngr1–/–) or Fcε receptor 1γ–null (Fcer1g–/–) mast cells. Additional analysis, including gene array studies, showed that mast cell expression of IFN-γR contributed to the development of many FcεRIγ-dependent and some FcεRIγ-independent features of disease in our model, including airway hyperresponsiveness, neutrophilic and eosinophilic inflammation, airway remodeling, and lung expression of several cytokines, chemokines, and markers of an alternatively activated macrophage response. These findings identify a previously unsuspected IFN-γ/mast cell axis in the pathology of chronic allergic inflammation of the airways in mice.
doi:10.1172/JCI43598
PMCID: PMC3148724
PMID: 21737883
Bioinformatics methods that leverage the vast amounts of clinical data promises to provide insights into underlying molecular mechanisms that help explain human physiological processes. One of these processes is adolescent development. The utility of predictive aging models generated from cross-sectional cohorts and their applicability to separate populations, including the clinical population, has yet to be completely explored. In order to address this, we built regression models predictive of adolescent chronological age from 2001–2002 National Health and Nutrition Examination Survey (NHANES) data and validated them against independent 2003–2004 NHANES data and clinical data from an academic tertiary-care pediatric hospital. The results indicate distinct differences between male and female models with both alkaline phosphatase and creatinine as predictive biomarkers for all genders, hematocrit and mean cell volume for males, and total serum globulin for females. We also suggest that the models are generalizable, are clinically relevant, and imply underlying molecular and clinical differences between males and females that may affect prediction accuracy. The integration of both epidemiological and clinical data promises to create more robust models that shed new light on physiological processes.
doi:10.1016/j.jbi.2009.11.007
PMCID: PMC2878870
PMID: 19958842
aging; pediatric; biomarker; translational bioinformatics; age prediction; electronic medical record; adolescent development