Though genome-wide technologies, such as microarrays, are widely used, data from these methods are considered noisy; there is still varied success in downstream biological validation. We report a method that increases the likelihood of successfully validating microarray findings using real time RT-PCR, including genes at low expression levels and with small differences. We use a Bayesian network to identify the most relevant sources of noise based on the successes and failures in validation for an initial set of selected genes, and then improve our subsequent selection of genes for validation based on eliminating these sources of noise. The network displays the significant sources of noise in an experiment, and scores the likelihood of validation for every gene. We show how the method can significantly increase validation success rates. In conclusion, in this study, we have successfully added a new automated step to determine the contributory sources of noise that determine successful or unsuccessful downstream biological validation.
Bioinformatics; Bayesian network; Microarray; RT-PCR; Microarray data
Adverse drug reactions (ADRs) can have severe consequences, such that the ability to predict ADRs prior to market introduction is desirable. Computational approaches applied to pre-clinical data might be one way to inform drug labeling and marketing with respect to potential ADRs.
Based on the premise that some of the molecular actors of ADRs involve interactions detectable in large, and increasingly public, compound screening campaigns, we generated logistic regression models that correlate post-marketing ADRs with screening data from the PubChem BioAssay database. These models analyze ADRs at the level of organ systems, the System Organ Classes (SOCs). Nine of the 19 SOCs under consideration were found to be significantly correlated with pre-clinical screening data. For 6 of the 8 established drugs for which we could retropredict SOC-specific adversities, prior knowledge was found that support these predictions. We conclude by predicting SOC-specific adversities for three unapproved or recently introduced drugs.
Adverse drug reactions; prediction; machine learning; compound screening; pharmacovigilance
Cancer-associated fibroblasts (CAFs) have been reported to support tumor progression by a variety of mechanisms. However, their role in the progression of non-small cell lung cancer (NSCLC) remains poorly defined. In addition, the extent to which specific proteins secreted by CAFs contribute directly to tumor growth is unclear. To study the role of CAFs in NSCLC, a cross-species functional characterization of mouse and human lung CAFs was performed. CAFs supported the growth of lung cancer cells in vivo by secretion of soluble factors that directly stimulate the growth of tumor cells. Gene expression analysis comparing normal mouse lung fibroblasts (NFs) and mouse lung CAFs identified multiple genes that correlate with the CAF phenotype. A gene signature of secreted genes upregulated in CAFs was an independent marker of poor survival in NSCLC patients. This secreted gene signature was upregulated in NFs after long-term exposure to tumor cells, demonstrating that NFs are “educated” by tumor cells to acquire a CAF-like phenotype. Functional studies identified important roles for CLCF1-CNTFR and IL6-IL6R signaling, in promoting growth of NSCLC cells. This study identifies novel soluble factors contributing to the CAF protumorigenic phenotype in NSCLC and suggests new avenues for the development of therapeutic strategies.
Carcinoma-associated fibroblasts; lung cancer; cytokines; il6; clcf1
Drug repositioning refers to alternative drug use discoveries which differ from the original intent of the drug. One challenge in these efforts lies in choosing which indication to prospectively test a drug of interest. We systematically evaluated a drug treatment-based view of diseases in order to address this challenge. Suggested novel drug uses were generated using a guilt-by-association approach. Compared with control drug uses, the suggested novel drug uses were significantly enriched in clinical trials.
The volume of publicly available genomic scale data is increasing. Genomic datasets in public repositories are annotated with free-text fields describing the pathological state of the studied sample. These annotations are not mapped to concepts in any ontology, making it difficult to integrate these datasets across repositories. We have previously developed methods to map text-annotations of tissue microarrays to concepts in the NCI thesaurus and SNOMED-CT.
In this work we generalize our methods to map text annotations of gene expression datasets to concepts in the UMLS. We demonstrate the utility of our methods by processing annotations of datasets in the Gene Expression Omnibus. We demonstrate that we enable ontology-based querying and integration of tissue and gene expression microarray data. We enable identification of datasets on specific diseases across both repositories. Our approach provides the basis for ontology-driven data integration for translational research on gene and protein expression data.
Based on this work we have built a prototype system for ontology based annotation and indexing of biomedical data. The system processes the text metadata of diverse resource elements such as gene expression data sets, descriptions of radiology images, clinical-trial reports, and PubMed article abstracts to annotate and index them with concepts from appropriate ontologies. The key functionality of this system is to enable users to locate biomedical data resources related to particular ontology concepts.
Genome-wide disease association studies contrast genetic variation between disease cohorts and healthy populations to discover single nucleotide polymorphisms (SNPs) and other genetic markers revealing underlying genetic architectures of human diseases. Despite scores of efforts over the past decade, many reproducible genetic variants that explain substantial proportions of the heritable risk of common human diseases remain undiscovered. We have conducted a multispecies genomic analysis of 5,831 putative human risk variants for more than 230 disease phenotypes reported in 2,021 studies. We find that the current approaches show a propensity for discovering disease-associated SNPs (dSNPs) at conserved genomic positions because the effect size (odds ratio) and allelic P value of genetic association of an SNP relates strongly to the evolutionary conservation of their genomic position. We propose a new measure for ranking SNPs that integrates evolutionary conservation scores and the P value (E-rank). Using published data from a large case-control study, we demonstrate that E-rank method prioritizes SNPs with a greater likelihood of bona fide and reproducible genetic disease associations, many of which may explain greater proportions of genetic variance. Therefore, long-term evolutionary histories of genomic positions offer key practical utility in reassessing data from existing disease association studies, and in the design and analysis of future studies aimed at revealing the genetic basis of common human diseases.
phylomedicine, GWAS, heritability
Personal genome resequencing has provided promising lead to personalized medicine. However, due to the limited samples and the lack of case/control design, current interpretation of personal genome sequences has been mainly focused on the identification and functional annotation of the DNA variants that are different from the reference genome. The reference genome was deduced from a collection of DNAs from anonymous individuals, some of whom might be carriers of disease risk alleles. We queried the reference genome against a large high-quality disease-SNP association database and found 3,556 disease-susceptible variants, including 15 rare variants. We assessed the likelihood ratio for risk for the reference genome on 104 diseases and found high risk for type 1 diabetes (T1D) and hypertension. We further demonstrated that the risk of T1D was significantly higher in the reference genome than those in a healthy patient with a whole human genome sequence. We found that the high T1D risk was mainly driven by a R260W mutation in PTPN22 in the reference genome. Therefore, we recommend that the disease-susceptible variants in the reference genome should be taken into consideration and future genome sequences should be interpreted with curated and predicted disease-susceptible loci to assess personal disease risk.
We describe cell type–specific significance analysis of microarrays (cssam) for analyzing differential gene expression for each cell type in a biological sample from microarray data and relative cell-type frequencies. first, we validated cssam with predesigned mixtures and then applied it to whole-blood gene expression datasets from stable post-transplant kidney transplant recipients and those experiencing acute transplant rejection, which revealed hundreds of differentially expressed genes that were otherwise undetectable.
We sought to identify serological markers capable of diagnosing preeclampsia (PE). We performed serum peptide analysis (liquid chromatography mass spectrometry) of 62 unique samples from 31 PE patients and 31 healthy pregnant controls, with two-thirds used as a training set and the other third as a testing set. Differential serum peptide profiling identified 52 significant serum peptides, and a 19-peptide panel collectively discriminating PE in training sets (n = 21 PE, n = 21 control; specificity = 85.7% and sensitivity = 100%) and testing sets (n = 10 PE, n = 10 control; specificity = 80% and sensitivity = 100%). The panel peptides were derived from 6 different protein precursors: 13 from fibrinogen alpha (FGA), 1 from alpha-1-antitrypsin (A1AT), 1 from apolipoprotein L1 (APO-L1), 1 from inter-alpha-trypsin inhibitor heavy chain H4 (ITIH4), 2 from kininogen-1 (KNG1), and 1 from thymosin beta-4 (TMSB4). We concluded that serum peptides can accurately discriminate active PE. Measurement of a 19-peptide panel could be performed quickly and in a quantitative mass spectrometric platform available in clinical laboratories. This serum peptide panel quantification could provide clinical utility in predicting PE or differential diagnosis of PE from confounding chronic hypertension.
Motivation: Biological analysis has shifted from identifying genes and transcripts to mapping these genes and transcripts to biological functions. The ENCODE Project has generated hundreds of ChIP-Seq experiments spanning multiple transcription factors and cell lines for public use, but tools for a biomedical scientist to analyze these data are either non-existent or tailored to narrow biological questions. We present the ENCODE ChIP-Seq Significance Tool, a flexible web application leveraging public ENCODE data to identify enriched transcription factors in a gene or transcript list for comparative analyses.
Supplementary material is available at Bioinformatics online.
Genetic diversity across different human populations can enhance understanding of the genetic basis of disease. We calculated the genetic risk of 102 diseases in 1,043 unrelated individuals across 51 populations of the Human Genome Diversity Panel. We found that genetic risk for type 2 diabetes and pancreatic cancer decreased as humans migrated toward East Asia. In addition, biliary liver cirrhosis, alopecia areata, bladder cancer, inflammatory bowel disease, membranous nephropathy, systemic lupus erythematosus, systemic sclerosis, ulcerative colitis, and vitiligo have undergone genetic risk differentiation. This analysis represents a large-scale attempt to characterize genetic risk differentiation in the context of migration. We anticipate that our findings will enable detailed analysis pertaining to the driving forces behind genetic risk differentiation.
The environment humans inhabit has changed many times in the last 100,000 years. Migration and dynamic local environments can lead to genetic adaptations favoring beneficial traits. Many genes responsible for these adaptations can alter disease susceptibility. Genes can also affect disease susceptibility by varying randomly across different populations. We have studied genetic variants that are known to modify disease susceptibility in the context of worldwide migration. We found that variants associated with 11 diseases have been affected to an extent that is not explained by random variation. We also found that the genetic risk of type 2 diabetes has steadily decreased along the worldwide human migration trajectory from Africa to America.
The rise of personalized medicine has reminded us that each patient must be treated as an individual. One factor in making treatment decisions is the physiological state of each patient, but definitions of relevant states and methods to visualize state-related physiologic changes are scarce. We constructed correlation networks from physiologic data to demonstrate changes associated with pressor use in the intensive care unit.
We collected 29 physiological variables at one-minute intervals from nineteen trauma patients in the intensive care unit of an academic hospital and grouped each minute of data as receiving or not receiving pressors. For each group we constructed Spearman correlation networks of pairs of physiologic variables. To visualize drug-associated changes we split the networks into three components: an unchanging network, a network of connections with changing correlation sign, and a network of connections only present in one group.
Out of a possible 406 connections between the 29 physiological measures, 64, 39, and 48 were present in each of the three component networks. The static network confirms expected physiological relationships while the network of associations with changed correlation sign suggests putative changes due to the drugs. The network of associations present only with pressors suggests new relationships that could be worthy of study.
We demonstrated that visualizing physiological relationships using correlation networks provides insight into underlying physiologic states while also showing that many of these relationships change when the state is defined by the presence of drugs. This method applied to targeted experiments could change the way critical care patients are monitored and treated.
We outline a paradigm for meta-microarray database creation and integration with clinical variables. We use as our implementation example a breast cancer database linking RNA expression measurements (by microarray) and clinical variables, such as survival metrics and tumor size. Such an endeavor involves integrating across different microarray datasets as well as clinical parameters. To this end, we created a data curation and processing pipeline, formal database ontology, and SQL schema to optimally query, analyze and visualize data from over 30 publicly available breast cancer microarray studies listed in the Gene Expression Omnibus (GEO). We demonstrate several pilot examples using this database. This methodology serves as a model for future meta-analyses of complex public clinical datasets, in particular those in the field of cancer.
Clinically recorded pain scores are abundant in patient health records but are rarely used in research. The use of this information could help improve clinical outcomes. For example, a recent report by the Institute of Medicine stated that ineffective use of clinical information contributes to under-treatment of patient subpopulations — especially women. This study used diagnosis-associated pain scores from a large hospital database to document sex differences in reported pain. We used de-identified electronic medical records from Stanford Hospital and Clinics for more than 72,000 patients. Each record contained at least one disease-associated pain score. We found over 160,000 pain scores in more than 250 primary diagnoses, and analyzed differences in disease-specific pain reported by men and women. After filtering for diagnoses with minimum encounter numbers, we found diagnosis-specific sex differences in reported pain. The most significant differences occurred in patients with disorders of the musculoskeletal, circulatory, respiratory and digestive systems, followed by infectious diseases, and injury and poisoning. We also discovered sex-specific differences in pain intensity in previously unreported diseases, including disorders of the cervical region, and acute sinusitis (p = 0.01, 0.017, respectively). Pain scores were collected during hospital encounters. No information about the use of pre-encounter over-the-counter medications was available. To our knowledge, this is the largest data-driven study documenting sex differences of disease-associated pain. It highlights the utility of EMR data to corroborate and expand on results of smaller clinical studies. Our findings emphasize the need for future research examining the mechanisms underlying differences in pain.
electronic medical records; sex differences; pain intensity; data mining
Crohn’s disease (CD), an inflammatory disease of the bowel, affects millions of people around the world. Evidence suggests that disease onset and pathogenesis differ between males and females. Yet no comprehensive efforts exist to assess the sex-specific genetic architecture of CD.
We used genotyping data from a cohort of 1748 CD cases and 2938 controls to investigate 71 meta-analysis-confirmed CD risk loci for sex differences in disease risk. We further validated the significant results in separate cohorts of 968 CD cases and 2809 controls, and performed a meta-analysis across datasets.
The SNP rs3792106 (C/T) in ATG16L1 showed a significant sex effect with p-value 6.9×10−13 and allelic odds ratio 1.48 in females, and p-value 0.013 and odds ratio 1.22 in males (odds ratio heterogeneity p-value 0.037). Surprisingly, the difference was found to arise from a discrepancy in allele frequencies between male and female controls (p-value 0.0045) rather than cases. We found similar results for this SNP in the separate validation data sets. Using 155 HapMap 3 trios, we detected significant maternal over-transmission of the T allele at rs3792106 (p-value 0.027).
Our results indicate that different transmission patterns between sexes may sustain the disparate allele frequencies at rs3792106 in healthy populations, and furthermore that a virus-risk variant mechanism implicated in CD alters the distribution in diseased patients. To our knowledge, this is the first report of sex-specific CD association in ATG16L1. The possible implications in Crohn’s disease and basic human biology present interesting areas for future investigation.
Inflammatory bowel disease; ATG16L1; transmission distortion; sexual dimorphism
Diseases such as type 2 diabetes (T2D) result from environmental and genetic factors, and risk varies considerably in the population. T2D-related genetic loci discovered to date explain only a small portion of the T2D heritability. Some heritability may be due to gene–environment interactions. However, documenting these interactions has been difficult due to low availability of concurrent genetic and environmental measures, selection bias, and challenges in controlling for multiple hypothesis testing. Through genome-wide association studies (GWAS), investigators have identified over 90 single nucleotide polymorphisms (SNPs) associated to T2D. Using a method analogous to GWAS [environment-wide association study (EWAS)], we found five environmental factors associated with the disease. By focusing on risk factors that emerge from GWAS and EWAS, it is possible to overcome difficulties in uncovering gene–environment interactions. Using data from the National Health and Nutrition Examination Survey (NHANES), we screened 18 SNPs and 5 serum-based environmental factors for interaction in association to T2D. We controlled for multiple hypotheses using false discovery rate (FDR) and Bonferroni correction and found four interactions with FDR <20 %. The interaction between rs13266634 (SLC30A8) and trans-β-carotene withstood Bonferroni correction (corrected p = 0.006, FDR <1.5 %). The per-risk-allele effect sizes in subjects with low levels of trans-β-carotene were 40 % greater than the marginal effect size [odds ratio (OR) 1.8, 95 % CI 1.3–2.6]. We hypothesize that impaired function driven by rs13266634 increases T2D risk when combined with serum levels of nutrients. Unbiased consideration of environmental and genetic factors may help identify larger and more relevant effect sizes for disease associations.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-012-1258-z) contains supplementary material, which is available to authorized users.
Publicly available molecular datasets can be used for independent verification or investigative repurposing, but depends on the presence, consistency and quality of descriptive annotations. Annotation and indexing of molecular datasets using well-defined controlled vocabularies or ontologies enables accurate and systematic data discovery, yet the majority of molecular datasets available through public data repositories lack such annotations. A number of automated annotation methods have been developed; however few systematic evaluations of the quality of annotations supplied by application of these methods have been performed using annotations from standing public data repositories. Here, we compared manually-assigned Medical Subject Heading (MeSH) annotations associated with experiments by data submitters in the PRoteomics IDEntification (PRIDE) proteomics data repository to automated MeSH annotations derived through the National Center for Biomedical Ontology Annotator and National Library of Medicine MetaMap programs. These programs were applied to free-text annotations for experiments in PRIDE. As many submitted datasets were referenced in publications, we used the manually curated MeSH annotations of those linked publications in MEDLINE as “gold standard”. Annotator and MetaMap exhibited recall performance 3-fold greater than that of the manual annotations. We connected PRIDE experiments in a network topology according to shared MeSH annotations and found 373 distinct clusters, many of which were found to be biologically coherent by network analysis. The results of this study suggest that both Annotator and MetaMap are capable of annotating public molecular datasets with a quality comparable, and often exceeding, that of the actual data submitters, highlighting a continuous need to improve and apply automated methods to molecular datasets in public data repositories to maximize their value and utility.
Proteomics; Annotations; Ontologies; Concept Identification; Natural Language Processing; MEDLINE
Summary: We introduce ProfileChaser, a web server that allows for querying the Gene Expression Omnibus based on genome-wide patterns of differential expression. Using a novel, content-based approach, ProfileChaser retrieves expression profiles that match the differentially regulated transcriptional programs in a user-supplied experiment. This analysis identifies statistical links to similar expression experiments from the vast array of publicly available data on diseases, drugs, phenotypes and other experimental conditions.
Supplementary Information: Supplementary data are available at Bioinformatics online.
The application of established drug compounds to novel therapeutic indications, known as drug repositioning, offers several advantages over traditional drug development, including reduced development costs and shorter paths to approval. Recent approaches to drug repositioning employ high-throughput experimental approaches to assess a compound’s potential therapeutic qualities. Here we present a systematic computational approach to predict novel therapeutic indications based on comprehensive testing of molecular signatures in drug-disease pairs. We integrated gene expression measurements from 100 diseases and gene expression measurements on 164 drug compounds yielding predicted therapeutic potentials for these drugs. We demonstrate the ability to recover many known drug and disease relationships using computationally derived therapeutic potentials, and also predict many new indications for these drugs. We experimentally validated a prediction for the anti-ulcer drug cimetidine as a candidate therapeutic in the treatment of lung adenocarcinoma, and demonstrate both in vitro and in vivo using mouse xenograft models. This novel computational method provides a novel and systematic approach to reposition established drugs to treat a wide range of human diseases.
Inflammatory Bowel Disease (IBD) is a chronic inflammatory disorder of the gastrointestinal tract for which there are few safe and effective therapeutic options for long-term treatment and disease maintenance. In this study, we applied a computational approach to discover novel drug therapies for IBD in silico using publicly available molecular data measuring gene expression in IBD samples and 164 small-molecule drug compounds. Among the top compounds predicted to be therapeutic for IBD by our approach were prednisolone, a corticosteroid known to treat IBD, and topiramate, an anticonvulsant drug not previously described to demonstrate efficacy for IBD or any related disorders of inflammation or the gastrointestinal tract. We experimentally validated our topiramate prediction in vivo using a trinitrobenzenesulfonic acid (TNBS) induced rodent model of IBD. The experimental results demonstrate that oral administration of topiramate is able to significantly reduce gross pathological signs and microscopic damage in primary affected colon tissue in a TNBS-induced rodent model of IBD. These finding suggest that topiramate might serve as a novel therapeutic option for IBD in humans, and support the use of public molecular data and computational approaches to discover novel therapeutic options for IBD.