Genome-Wide Association Studies (GWAS) have identified genetic variants for thousands of diseases and traits. In this study, we evaluated the relationships between specific risk factors (for example, blood cholesterol level) and diseases on the basis of their shared genetic architecture in a comprehensive human disease-SNP association database (VARIMED), analyzing the findings from 8,962 published association studies. Similarity between traits and diseases was statistically evaluated based on their association with shared gene variants. We identified 120 disease-trait pairs that were statistically similar, and of these we tested and validated five previously unknown disease-trait associations by searching electronic medical records (EMR) from 3 independent medical centers for evidence of the trait appearing in patients within one year of first diagnosis of the disease. We validated that mean corpuscular volume is elevated before diagnosis of acute lymphoblastic leukemia; both have associated variants in the gene IKZF1. Platelet count is decreased before diagnosis of alcohol dependence; both are associated with variants in the gene C12orf51. Alkaline phosphatase level is elevated in patients with venous thromboembolism; both share variants in ABO. Similarly, we found prostate specific antigen and serum magnesium levels were altered before the diagnosis of lung cancer and gastric cancer, respectively. Disease-trait associations identifies traits that can potentially serve a prognostic function clinically; validating disease-trait associations through EMR can whether these candidates are risk factors for complex diseases.
Human blood glucose levels have likely evolved toward their current point of stability over hundreds of thousands of years. The robust population stability of this trait is called canalization. It has been represented by a hyperbolic function of two variables: insulin sensitivity and insulin response. Environmental changes due to global migration may have pushed some human subpopulations to different points of stability. We hypothesized that there may be ethnic differences in the optimal states in the relationship between insulin sensitivity and insulin response.
RESEARCH DESIGN AND METHODS
We identified studies that measured the insulin sensitivity index (SI) and acute insulin response to glucose (AIRg) in three major ethnic groups: Africans, Caucasians, and East Asians. We identified 74 study cohorts comprising 3,813 individuals (19 African cohorts, 31 Caucasian, and 24 East Asian). We calculated the hyperbolic relationship using the mean values of SI and AIRg in the healthy cohorts with normal glucose tolerance.
We found that Caucasian subpopulations were located around the middle point of the hyperbola, while African and East Asian subpopulations are located around unstable extreme points, where a small change in one variable is associated with a large nonlinear change in the other variable.
Our findings suggest that the genetic background of Africans and East Asians makes them more and differentially susceptible to diabetes than Caucasians. This ethnic stratification could be implicated in the different natural courses of diabetes onset.
A set of 11 genes, termed the common rejection module, predicts acute graft rejection in solid organ transplant patients and may help to identify novel drug targets in transplantation.
Using meta-analysis of eight independent transplant datasets (236 graft biopsy samples) from four organs, we identified a common rejection module (CRM) consisting of 11 genes that were significantly overexpressed in acute rejection (AR) across all transplanted organs. The CRM genes could diagnose AR with high specificity and sensitivity in three additional independent cohorts (794 samples). In another two independent cohorts (151 renal transplant biopsies), the CRM genes correlated with the extent of graft injury and predicted future injury to a graft using protocol biopsies. Inferred drug mechanisms from the literature suggested that two FDA-approved drugs (atorvastatin and dasatinib), approved for nontransplant indications, could regulate specific CRM genes and reduce the number of graft-infiltrating cells during AR. We treated mice with HLA-mismatched mouse cardiac transplant with atorvastatin and dasatinib and showed reduction of the CRM genes, significant reduction of graft-infiltrating cells, and extended graft survival. We further validated the beneficial effect of atorvastatin on graft survival by retrospective analysis of electronic medical records of a single-center cohort of 2,515 renal transplant patients followed for up to 22 yr. In conclusion, we identified a CRM in transplantation that provides new opportunities for diagnosis, drug repositioning, and rational drug design.
Biomedicine is undergoing a revolution driven by high throughput and connective computing that is transforming medical research and practice. Using oncology as an example, the speed and capacity of genomic sequencing technologies is advancing the utility of individual genetic profiles for anticipating risk and targeting therapeutics. The goal is to enable an era of “P4” medicine that will become increasingly more predictive, personalized, preemptive, and participative over time. This vision hinges on leveraging potentially innovative and disruptive technologies in medicine to accelerate discovery and to reorient clinical practice for patient-centered care. Based on a panel discussion at the Medicine 2.0 conference in Boston with representatives from the National Cancer Institute, Moffitt Cancer Center, and Stanford University School of Medicine, this paper explores how emerging sociotechnical frameworks, informatics platforms, and health-related policy can be used to encourage data liquidity and innovation. This builds on the Institute of Medicine’s vision for a “rapid learning health care system” to enable an open source, population-based approach to cancer prevention and control.
biomedical research; crowdsourcing; health information technology; innovation; precision medicine
Crohn’s disease (CD), an inflammatory disease of the bowel, affects millions of people around the world. Evidence suggests that disease onset and pathogenesis differ between males and females. Yet no comprehensive efforts exist to assess the sex-specific genetic architecture of CD.
We used genotyping data from a cohort of 1748 CD cases and 2938 controls to investigate 71 meta-analysis-confirmed CD risk loci for sex differences in disease risk. We further validated the significant results in separate cohorts of 968 CD cases and 2809 controls, and performed a meta-analysis across datasets.
The SNP rs3792106 (C/T) in ATG16L1 showed a significant sex effect with p-value 6.9×10−13 and allelic odds ratio 1.48 in females, and p-value 0.013 and odds ratio 1.22 in males (odds ratio heterogeneity p-value 0.037). Surprisingly, the difference was found to arise from a discrepancy in allele frequencies between male and female controls (p-value 0.0045) rather than cases. We found similar results for this SNP in the separate validation data sets. Using 155 HapMap 3 trios, we detected significant maternal over-transmission of the T allele at rs3792106 (p-value 0.027).
Our results indicate that different transmission patterns between sexes may sustain the disparate allele frequencies at rs3792106 in healthy populations, and furthermore that a virus-risk variant mechanism implicated in CD alters the distribution in diseased patients. To our knowledge, this is the first report of sex-specific CD association in ATG16L1. The possible implications in Crohn’s disease and basic human biology present interesting areas for future investigation.
Inflammatory bowel disease; ATG16L1; transmission distortion; sexual dimorphism
Publicly available molecular datasets can be used for independent verification or investigative repurposing, but depends on the presence, consistency and quality of descriptive annotations. Annotation and indexing of molecular datasets using well-defined controlled vocabularies or ontologies enables accurate and systematic data discovery, yet the majority of molecular datasets available through public data repositories lack such annotations. A number of automated annotation methods have been developed; however few systematic evaluations of the quality of annotations supplied by application of these methods have been performed using annotations from standing public data repositories. Here, we compared manually-assigned Medical Subject Heading (MeSH) annotations associated with experiments by data submitters in the PRoteomics IDEntification (PRIDE) proteomics data repository to automated MeSH annotations derived through the National Center for Biomedical Ontology Annotator and National Library of Medicine MetaMap programs. These programs were applied to free-text annotations for experiments in PRIDE. As many submitted datasets were referenced in publications, we used the manually curated MeSH annotations of those linked publications in MEDLINE as “gold standard”. Annotator and MetaMap exhibited recall performance 3-fold greater than that of the manual annotations. We connected PRIDE experiments in a network topology according to shared MeSH annotations and found 373 distinct clusters, many of which were found to be biologically coherent by network analysis. The results of this study suggest that both Annotator and MetaMap are capable of annotating public molecular datasets with a quality comparable, and often exceeding, that of the actual data submitters, highlighting a continuous need to improve and apply automated methods to molecular datasets in public data repositories to maximize their value and utility.
Proteomics; Annotations; Ontologies; Concept Identification; Natural Language Processing; MEDLINE
Adverse drug reactions (ADRs) can have severe consequences, such that the ability to predict ADRs prior to market introduction is desirable. Computational approaches applied to pre-clinical data might be one way to inform drug labeling and marketing with respect to potential ADRs.
Based on the premise that some of the molecular actors of ADRs involve interactions detectable in large, and increasingly public, compound screening campaigns, we generated logistic regression models that correlate post-marketing ADRs with screening data from the PubChem BioAssay database. These models analyze ADRs at the level of organ systems, the System Organ Classes (SOCs). Nine of the 19 SOCs under consideration were found to be significantly correlated with pre-clinical screening data. For 6 of the 8 established drugs for which we could retropredict SOC-specific adversities, prior knowledge was found that support these predictions. We conclude by predicting SOC-specific adversities for three unapproved or recently introduced drugs.
Adverse drug reactions; prediction; machine learning; compound screening; pharmacovigilance
Identification of maternal environmental factors influencing preterm birth risks is important to understand the reasons for the increase in prematurity since 1990. Here, we utilized a health survey, the US National Health and Nutrition Examination Survey (NHANES) to search for personal environmental factors associated with preterm birth. 201 urine and blood markers of environmental factors, such as allergens, pollutants, and nutrients were assayed in mothers (range of N: 49 to 724) who answered questions about any children born preterm (delivery <37 weeks). We screened each of the 201 factors for association with any child born preterm adjusting by age, race/ethnicity, education, and household income. We attempted to verify the top finding, urinary bisphenol A, in an independent study of pregnant women attending Lucile Packard Children’s Hospital. We conclude that the association between maternal urinary levels of bisphenol A and preterm birth should be evaluated in a larger epidemiological investigation.
environmental exposure; environment-wide association study; preterm birth
Modeling physiologic and disease processes; Linking the genotype and phenotype; identifying genome and protein structure and function; visualization of data and knowledge; collaborative technologies; applications that link biomedical knowledge from diverse primary sources (includes automated indexing); knowledge representations; statistical analysis of large datasets; methods for integration of information from disparate sources; discovery, and text and data mining methods, automated learning; ontologies; knowledge bases
Deregulation in lysine methylation signaling has emerged as a common etiologic factor in cancer pathogenesis, with inhibitors of several histone lysine methyltransferases (KMTs) being developed as chemotherapeutics1. The largely cytoplasmic KMT SMYD3 (SET and MYND domain containing protein 3) is overexpressed in numerous human tumors2-4. However, the molecular mechanism by which SMYD3 regulates cancer pathways and its relationship to tumorigenesis in vivo are largely unknown. Here we show that methylation of MAP3K2 by SMYD3 increases MAP Kinase signaling and promotes the formation of Ras-driven carcinomas. Using mouse models for pancreatic ductal adenocarcinoma (PDAC) and lung adenocarcinoma (LAC), we found that abrogating SMYD3 catalytic activity inhibits tumor development in response to oncogenic Ras. We employed protein array technology to identify the MAP3K2 kinase as a target of SMYD3. In cancer cell lines, SMYD3-mediated methylation of MAP3K2 at lysine 260 potentiates activation of the Ras/Raf/MEK/ERK signaling module. Finally, the PP2A phosphatase complex, a key negative regulator of the MAP Kinase pathway, binds to MAP3K2 and this interaction is blocked by methylation. Together, our results elucidate a new role for lysine methylation in integrating cytoplasmic kinase-signaling cascades and establish a pivotal role for SMYD3 in the regulation of oncogenic Ras signaling.
The endoplasmic reticulum-associated degradation (ERAD) pathway is responsible for the translocation of misfolded proteins across the ER membrane into the cytosol for subsequent degradation by the proteasome. In order to understand the spectrum of clinical and molecular findings in a complex neurological syndrome, we studied a series of eight patients with inherited deficiency of N-glycanase 1 (NGLY1), a novel disorder of cytosolic ERAD dysfunction.
Whole-genome, whole-exome or standard Sanger sequencing techniques were employed. Retrospective chart reviews were performed in order to obtain clinical data.
All patients had global developmental delay, a movement disorder, and hypotonia. Other common findings included hypo- or alacrima (7/8), elevated liver transaminases (6/7), microcephaly (6/8), diminished reflexes (6/8), hepatocyte cytoplasmic storage material or vacuolization (5/6), and seizures (4/8). The nonsense mutation c.1201A>T (p.R401X) was the most common deleterious allele.
NGLY1 deficiency is a novel autosomal recessive disorder of the ERAD pathway associated with neurological dysfunction, abnormal tear production, and liver disease. The majority of patients detected to date carry a specific nonsense mutation that appears to be associated with severe disease. The phenotypic spectrum is likely to enlarge as cases with a more broad range of mutations are detected.
NGLY1; alacrima; choreoathetosis; seizures; liver disease
Advanced age is associated with an increased risk of vascular morbidity, attributable in part to impairments in new blood vessel formation. Mesenchymal stem cells (MSCs) have previously been shown to play an important role in neovascularization and deficiencies in these cells have been described in aged patients. Here we utilize single cell transcriptional analysis to determine the effect of aging on MSC population dynamics. We identify an age-related depletion of a subpopulation of MSCs characterized by a pro-vascular transcriptional profile. Supporting this finding, we demonstrate that aged MSCs are also significantly compromised in their ability to support vascular network formation in vitro and in vivo. Finally, aged MSCs are unable to rescue age-associated impairments in cutaneous wound healing. Taken together, these data suggest that age-related changes in MSC population dynamics result in impaired therapeutic potential of aged progenitor cells. These findings have critical implications for therapeutic cell source decisions (autologous versus allogeneic) and indicate the necessity of strategies to improve functionality of aged MSCs.
Drug repositioning refers to alternative drug use discoveries which differ from the original intent of the drug. One challenge in these efforts lies in choosing which indication to prospectively test a drug of interest. We systematically evaluated a drug treatment-based view of diseases in order to address this challenge. Suggested novel drug uses were generated using a guilt-by-association approach. Compared with control drug uses, the suggested novel drug uses were significantly enriched in clinical trials.
The volume of publicly available genomic scale data is increasing. Genomic datasets in public repositories are annotated with free-text fields describing the pathological state of the studied sample. These annotations are not mapped to concepts in any ontology, making it difficult to integrate these datasets across repositories. We have previously developed methods to map text-annotations of tissue microarrays to concepts in the NCI thesaurus and SNOMED-CT.
In this work we generalize our methods to map text annotations of gene expression datasets to concepts in the UMLS. We demonstrate the utility of our methods by processing annotations of datasets in the Gene Expression Omnibus. We demonstrate that we enable ontology-based querying and integration of tissue and gene expression microarray data. We enable identification of datasets on specific diseases across both repositories. Our approach provides the basis for ontology-driven data integration for translational research on gene and protein expression data.
Based on this work we have built a prototype system for ontology based annotation and indexing of biomedical data. The system processes the text metadata of diverse resource elements such as gene expression data sets, descriptions of radiology images, clinical-trial reports, and PubMed article abstracts to annotate and index them with concepts from appropriate ontologies. The key functionality of this system is to enable users to locate biomedical data resources related to particular ontology concepts.
Both genetic and environmental factors contribute to human diseases. Most common diseases are influenced by a large number of genetic and environmental factors, most of which individually have only a modest effect on the disease. Though genetic contributions are relatively well characterized for some monogenetic diseases, there has been no effort at curating the extensive list of environmental etiological factors.
From a comprehensive search of the MeSH annotation of MEDLINE articles, we identified 3,342 environmental etiological factors associated with 3,159 diseases. We also identified 1,100 genes associated with 1,034 complex diseases from the NIH Genetic Association Database (GAD), a database of genetic association studies. 863 diseases have both genetic and environmental etiological factors available. Integrating genetic and environmental factors results in the "etiome", which we define as the comprehensive compendium of disease etiology. Clustering of environmental factors may alert clinicians of the risks of added exposures, or synergy in interventions to alter these factors. Clustering of both genetic and environmental etiological factors puts genes in the context of environment in a quantitative manner.
In this paper, we obtained a comprehensive list of associations between disease and environmental factors using MeSH annotation of MEDLINE articles. It serves as a summary of current knowledge between etiological factors and diseases. By combining the environmental etiological factors and genetic factors from GAD, we computed the "etiome" profile for 863 diseases. Comparing diseases across these profiles may have utility for clinical medicine, basic science research, and population-based science.
Monitoring of renal graft status through peripheral blood (PB) rather than invasive biopsy is important as it will lessen the risk of infection and other stresses, while reducing the costs of rejection diagnosis. Blood gene biomarker panels were discovered by microarrays at a single center and subsequently validated and cross-validated by QPCR in gthe NIH SNSO1 randomized study from 12 US pediatric transplant programs. A total of 367 unique human PB samples, each paired with a graft biopsy for centralized, blinded phenotype classification, were analyzed (115 acute rejection (AR), 180 stable and 72 other causes of graft injury). Of the differentially expressed genes by microarray, Q-PCR analysis of a five gene-set (DUSP1, PBEF1, PSEN1, MAPK9 and NKTR) classified AR with high accuracy. A logistic regression model was built on independent training-set (n=47) and validated on independent test-set (n=198)samples, discriminating AR from STA with 91% sensitivity and 94% specificity and AR from all other non-AR phenotypes with 91% sensitivity and 90% specificity. The 5-gene set can diagnose AR potentially avoiding the need for invasive renal biopsy. These data support the conduct of a prospective study to validate the clinical predictive utility of this diagnostic tool.
acute allograft rejection; transplantation genomics; transplantation; transplant rejection; translational research; renal transplantation; renal allograft rejection; biomarker; bioinformatics
Whole exome sequencing by high-throughput sequencing of target-enriched genomic DNA (exome-seq) has become common in basic and translational research as a means of interrogating the interpretable part of the human genome at relatively low cost. Presented here is a comparison of three major commercial exome sequencing platforms from Agilent, Illumina and Nimblegen applied to the same human blood sample. The Nimblegen platform, which is the only one to use high-density overlapping baits, provides increased efficiency of enrichment and sensitivity for detecting variants but covers fewer genomic regions than the other platforms. As a result, Nimblegen requires the least amount of sequencing to sensitively detect small variants, but Agilent and Illumina are able to detect a greater total number of variants with additional sequencing. Illumina in particular captures the untranslated regions, which are missing from the Nimblegen and Agilent platforms. Exome sequencing and whole genome sequencing (WGS) of the same sample were also compared, demonstrating that exome-seq allows for the detection of additional small variants missed by WGS. These data suggest that WGS experiments benefit from being supplemented with targeted exome-seq data. This study serves to assist the community in selecting the optimal exome-seq platform for their experiments, as well as proving that exome-seq is capable of identifying important coding variations that are missed by a typical WGS experiment.
A nutrient-wide approach may be useful comprehensively to test and validate associations between nutrients (derived from foods and supplements) and blood pressure (BP) in an unbiased manner.
Methods and Results
Data from 4,680 participants ages 40–59 in the cross-sectional International Study of Macro/Micro-nutrients and Blood Pressure (INTERMAP) were stratified randomly into training and testing sets. NHANES cross-sectional cohorts of 1999–2000 to 2005–2006 were used for external validation. We performed multiple linear regression analyses associating each of 82 nutrients and 3 urine electrolytes with systolic and diastolic BP in the INTERMAP training set. Significant findings were validated in the INTERMAP testing set and further in the NHANES cohorts (False Discovery Rate <5% in training, p<0.05 for internal and external validation). Among the validated nutrients, alcohol and urinary sodium-to-potassium ratio were directly associated with systolic BP, and dietary phosphorus, magnesium, iron, thiamin, folacin, and riboflavin were inversely associated with systolic BP. In addition, dietary folacin, and riboflavin were inversely associated with diastolic BP. The absolute effect sizes in the validation data (NHANES) ranged from 0.97 mmHg lower systolic BP (phosphorus) to 0.39 mmHg lower systolic BP (thiamin) per 1SD difference in nutrient variable. Inclusion of nutrient intake from supplements in addition to foods gave similar results for some nutrients, though it attenuated the associations of folacin, thiamin and riboflavin intake with BP.
We identified significant inverse associations between B vitamins and BP, relationships hitherto poorly investigated. Our analyses represent a systematic unbiased approach to the evaluation and validation of nutrient-BP associations.
lood pressure; diet; epidemiology; nutrition
Whole-genome sequencing is becoming commonplace, but the accuracy and completeness of variant calling by the most widely used platforms from Illumina and Complete Genomics have not been reported. Here we sequenced the genome of an individual with both technologies to a high average coverage of ~76×, and compared their performance with respect to sequence coverage and calling of single-nucleotide variants (SNVs), insertions and deletions (indels). Although 88.1% of the ~3.7 million unique SNVs were concordant between platforms, there were tens of thousands of platform-specific calls located in genes and other genomic regions. In contrast, 26.5% of indels were concordant between platforms. Target enrichment validated 92.7% of the concordant SNVs, whereas validation by genotyping array revealed a sensitivity of 99.3%. The validation experiments also suggested that >60% of the platform-specific variants were indeed present in the genome. Our results have important implications for understanding the accuracy and completeness of the genome sequencing platforms.
Small cell lung cancer (SCLC) is an aggressive neuroendocrine subtype of lung cancer with high mortality. We used a systematic drug-repositioning bioinformatics approach querying a large compendium of gene expression profiles to identify candidate FDA-approved drugs to treat SCLC. We found that tricyclic antidepressants and related molecules potently induce apoptosis in both chemonaïve and chemoresistant SCLC cells in culture, in mouse and human SCLC tumors transplanted into immunocompromised mice, and in endogenous tumors from a mouse model for human SCLC. The candidate drugs activate stress pathways and induce cell death in SCLC cells, at least in part by disrupting autocrine survival signals involving neurotransmitters and their G protein-coupled receptors. The candidate drugs inhibit the growth of other neuroendocrine tumors, including pancreatic neuroendocrine tumors and Merkel cell carcinoma. These experiments identify novel targeted strategies that can be rapidly evaluated in patients with neuroendocrine tumors through the repurposing of approved drugs.
Small Cell Lung Cancer (SCLC); drug repositioning; Tricyclic Antidepressants (TCAs); imipramine; G-protein coupled receptors (GPCRs)
Though genome-wide technologies, such as microarrays, are widely used, data from these methods are considered noisy; there is still varied success in downstream biological validation. We report a method that increases the likelihood of successfully validating microarray findings using real time RT-PCR, including genes at low expression levels and with small differences. We use a Bayesian network to identify the most relevant sources of noise based on the successes and failures in validation for an initial set of selected genes, and then improve our subsequent selection of genes for validation based on eliminating these sources of noise. The network displays the significant sources of noise in an experiment, and scores the likelihood of validation for every gene. We show how the method can significantly increase validation success rates. In conclusion, in this study, we have successfully added a new automated step to determine the contributory sources of noise that determine successful or unsuccessful downstream biological validation.
Bioinformatics; Bayesian network; Microarray; RT-PCR; Microarray data
Cancer-associated fibroblasts (CAFs) have been reported to support tumor progression by a variety of mechanisms. However, their role in the progression of non-small cell lung cancer (NSCLC) remains poorly defined. In addition, the extent to which specific proteins secreted by CAFs contribute directly to tumor growth is unclear. To study the role of CAFs in NSCLC, a cross-species functional characterization of mouse and human lung CAFs was performed. CAFs supported the growth of lung cancer cells in vivo by secretion of soluble factors that directly stimulate the growth of tumor cells. Gene expression analysis comparing normal mouse lung fibroblasts (NFs) and mouse lung CAFs identified multiple genes that correlate with the CAF phenotype. A gene signature of secreted genes upregulated in CAFs was an independent marker of poor survival in NSCLC patients. This secreted gene signature was upregulated in NFs after long-term exposure to tumor cells, demonstrating that NFs are “educated” by tumor cells to acquire a CAF-like phenotype. Functional studies identified important roles for CLCF1-CNTFR and IL6-IL6R signaling, in promoting growth of NSCLC cells. This study identifies novel soluble factors contributing to the CAF protumorigenic phenotype in NSCLC and suggests new avenues for the development of therapeutic strategies.
Carcinoma-associated fibroblasts; lung cancer; cytokines; il6; clcf1