|Home | About | Journals | Submit | Contact Us | Français|
Though gene and protein measurements have been used to determine therapeutic action 1, develop diagnostic tests 2, and distinguish disease subtypes 3,4, they do not characterize a sample's entire phenotype, environmental, or experimental context. Here we comprehensively consider associations between components of phenotype and environment and genes to identify genes that may govern phenotype and our response to the environment. Context from the annotations of gene expression data sets in the Gene Expression Omnibus is represented using the Unified Medical Language System, a compendium of biomedical vocabularies with nearly one-million concepts. After showing how data sets can be clustered by annotative concepts, we find a network of relations between phenotypic, disease, environmental, and experimental context and genes with differential expression associated with these concepts. We identify novel genes related to concepts such as aging. Comprehensively identifying genes related to phenotype and environment is a step towards the Human Phenome Project. 5
In considering a sample of cancer, such as one extracted from a patient's lung tumor, there remain other aspects to the patient and sample besides its gene expression and proteomic pattern, such as phenotype and clinical history (e.g. chief complaint of hemoptysis, family history, or tumor size), environmental exposures (e.g. duration of exposure to asbestos or cigarette smoke), and experimental conditions (e.g. anesthesia or sample preparation). It is well known that these snapshots of genomic or physiological state cannot solely represent the entire envirome, here broadened from the initial definition by Anthony, et al., as the totality of equivalent environmental influences contributing to mental disorders, to include all disorders and organisms,6 and phenome, the physical totality of all traits of an organism as defined by Mahner and Kary,5,7 of the sample and organism.
Relations between enviromic concepts and phenomic concepts have been invaluable to medicine. For example, one such relation is the association of environmental exposure to cigarette smoke with the phenotype of lung cancer development. Comprehensively relating specific concepts in the envirome and phenome to specific genes could lead to the identification of new disease-associated genes.5 Though some phenomic data is available,8 it is greatly overshadowed in size by the 50,000+ microarray measurements in repositories such as the Gene Expression Omnibus (GEO).9 Even for microarray data stored in standardized formats like MIAME and MAGE-ML,10,11 contextual annotations are represented by unstructured narrative text; determining the phenotype and environmental context is no longer a tractable manual process. A question we have sought to answer is whether prior investments in biomedical ontologies can provide leverage in finding phenome-genome and envirome-genome relations. We show here that a large set of phenome-genome and envirome-genome relations can be found within a public repository of transcriptome measurements, if the phenotypes and environmental context can be ascertained for each experiment, along with the expression measurements. We addressed this by creating a system that extracts contextual concepts from the sample annotations in GEO, represents these concepts using the Unified Medical Language System (UMLS), unifies the gene expression measurements across data sets using LocusLink, and finally relates the gene expression measurements to the contextual concepts (Fig. 1). UMLS is the largest available compendium of biomedical vocabularies, containing over 60 biomedical vocabularies with approximately one million inter-related concepts.12 UMLS already unifies vocabularies used in molecular biology and genomics, such as the Medical Subject Headings (MeSH), NCBI Taxonomy, and the Gene Ontology, with medical vocabularies including the International Classification of Diseases and SNOMED International.9,13,14
After manual elimination of incorrectly assigned concepts, described in Methods and Supplementary Note online, mappings to 4,127 UMLS concepts remained (from 296,843 mappings to 5,115 strings). Concepts were from 18 source vocabularies, with MeSH (23%), Read Codes (17%), and SNOMED International (14%) contributing the most. The GEO series description annotation was the most information-rich, as it provided the most unique concepts (Supplementary Table online). This was likely because GEO series descriptions are often dissimilar to each other, compared to sample descriptions, which are often repeated. As expected, the concepts mapping to the most annotations are Cells and RNA (Table 1).
Parsing failed on too short annotations, containing only laboratory identifiers and few recognizable words, or too long for parsing to complete. Regardless, over 99% of GEO samples were successfully directly mapped to at least one UMLS concept; the remainder were subordinate to a GEO series with annotations mapping to concepts. Thus, every GEO sample can be mapped either directly to concepts, or indirectly through its parent GEO series. Similarly, every GEO data set could be mapped to concepts, either directly through the GEO data set annotations, or indirectly through subordinate series and sample annotations.
We created a taxonomy of eleven types of errors seen in determining concepts from annotations, in three broad categories: GEO annotations, UMLS vocabularies, or text processing (Table 2). Three significant errors are listed here. Several important concepts in molecular biology experimentation are missing in UMLS, including “phosphate-buffered saline”, “transgenic”, and “growth medium”. Company names and protocol details falsely map to concepts. For example, the phrase “Axon Instruments (Foster City, California)” maps to Axon and Fostering, related to foster homes. Many annotations include the entire MIAME check-list,10 literally beginning with “The MIAME Checklist...”, so the salient details cannot be parsed.
To verify that data sets with similar context could now be considered similar mathematically, we clustered the data sets using the directly and indirectly mapped concepts (Fig. 2a). Qualitative examination of the tree revealed multiple branches where similar concepts were used in annotations from a single submitter, while other branches indicated successful clustering of similar data sets from multiple submitters. For example, muscle expression data used to study inflammatory myopathy, muscular dystrophy, aging, and dermatomyositis clustered together, even though these were from three different submitters (Fig. 2b).
We then sought relationships between specific genes and annotative concepts, where a gene shows a statistically significant difference in expression level between data sets annotated with that concept and data sets not annotated with the concept. We controlled for multiple comparison testing by calculating q-values for each relation, or the proportion of false positive relations expected if the given relation were significant.15 We used two separate methods to reduce the noise associated with errors in concept assignment: (1) keeping only those gene-concept relations demonstrated in two or more species, and (2) keeping only those relations with a very low q-value.
Our first method to heighten stringency was to consider only those gene-concept relations where a second relation was present between the same concept and an orthologous gene, assuming functional identity for orthologous genes. This resulted in relations between 444 genes and 46 concepts, which were graphically visualized as a phenome-genome network (Fig. 3a). Four of the top five connected concepts were related to muscle, such as Skeletal Muscle, with 128 genes (38 human, 40 mouse, and 50 rat) showing significantly increased expression level in data sets annotated with these concepts and 40 decreasing genes (6 human, 18 mouse, and 16 rat) with decreased expression level (Fig. 3b). Many of these genes are known to be uniquely expressed in muscle, including Tnni1, Mybpc1, Mybph, and Pgam2; Pnliprp1, Chad, and Fgb are expressed in a few tissues including muscle. Human Pdlim3 shows the greatest differential expression between the 8 data sets annotated with the concept Muscle Cells (average normalized expression level 0.968) and the 34 data sets measuring this transcript without such an annotation (expression level 0.340), a highly significant difference by t-test with p < 1.2 × 10−16, with q < 0.0002 across 100 permutations of mappings between data sets and the annotated concept (Fig. 3c). A similar pattern is seen for mouse Pdlim3 (Fig. 3d, p < 6 × 10−8, q = 0.0025). Pdlim3 is thought to play a role in skeletal muscle development.16 Mouse Pdlim3 is most highly expressed in skeletal muscle, compared to 21 other tissues with EST count data in the NCBI UniGene,9 and 60 other tissues in the GNF SymAtlas panel of microarray expression measurements by Su, et al.17
To quantitate the sensitivity and validate the significance of these relations, we compared the list of genes related to the four muscle-related concepts with two lists of transcripts previously measured through sequencing or microarrays as being expressed in muscle. Of the 40 mouse transcripts found to have increased expression in data sets associated with the muscle concepts, 27 were also highly expressed transcripts in muscle in microarray measurements in the GNF SymAtlas,17 and 18 were sequenced from the mouse skeletal muscle library dbEST 8902.9 In total, at least 31 (78%) of the 40 mouse transcripts related to muscle could be independently validated.
However, only considering relations between concepts and ortholog families can be too restrictive. Few diseases have been studied in both human and model animals, and microarray data may not yet be available for both. Thus, we separately used a second approach to heighten stringency by considering only the most reliable gene-concept relations, defined as having a q-value ≤ 1%. This resulted in 64,003 relations between 281 biomedical concepts and 7,466 genes.
We explain here several of these relations involving phenotypic or environmental concepts. These relations were manually validated to ensure each of these concepts was mapped correctly to an annotation. We found 11 genes related to Aging. The mean normalized expression level of H6pd, the H form of glucose-6-phosphate dehydrogenase, drops from 0.90 in the 4 data sets annotated with Aging to 0.71 in the 35 data sets without the annotation (Fig. 4a, p = 1.2 × 10−6, q = 0.01). Glucose-6-phosphate dehydrogenase activity is known to increase with age in rat brain and other tissues.18,19 Some individuals with G6PDH deficiency have been noted to have reduced mortality from cardiovascular disease and have increased longevity, though this is associated with the X-linked gene.20
Bdnf was less expressed in the 4 data sets annotated with Aging compared with 47 non-annotated data sets (Fig. 4b, p = 7.3 × 10−10, q < 0.0001); Bdnf has been previously shown to have a significant drop in expression in human skin fibroblasts associated with advancing age.21 The other nine genes, including Tnnt1, Synj1, Tada2l, Slc7a2, Mgat2, and Kctd2, have no previously established association with aging, though Tnnt1 and Mgat2 have been associated with nemaline myopathy and type 2a congenital disorder of glycosylation, respectively. A decrease in Tnnt1 and an increase in Synj1 have been found to be associated with differentiation from human embryonic and hematopoetic stem cells, respectively.22,23
We found gene-concept relations for other phenotypic concepts, including diseases. The only gene relating to Leukemia is Ddx24; mean normalized expression of Ddx24 drops from 0.71 in the 6 data sets measuring this gene not associated with Leukemia to 0.44 in the 4 data sets with the annotation (Fig. 4c, p = 0.007, q = 0.01). Interestingly, this gene was first cloned from a leukemia cDNA library.24
The concept of Injury represents an environmental annotation and was related only to Gpx3 (plasma glutathione peroxidase) and Mapk14 (mitogen-activated protein kinase 14). Both demonstrate a significant increase in mean normalized expression levels in the four data sets associated with Injury compared with over 100 other data sets measuring these genes without this annotation (Fig. 4d and 4e, Gpx3 increases from 0.56 to 0.97; Mapk14 increases from 0.52 to 0.89; for both, p < 1 × 10−15 and q < 0.0002). Both genes have been shown to be related to injury of various forms. Increased expression of plasma glutathione peroxidase has been shown to be protective against toxic injury to the liver,25 presumably by preventing damaging effects of reactive oxygen species. Plasma glutathione peroxidase activity significantly drops after burn injury, and rises after spinal cord injury.26,27 The mitogen-activated protein kinase pathway is activated in a variety of processes, including in response to environmental stresses and injury. Mapk14 is activated during wound healing and increases after ischemic myocardial injury.28,29
We have created and validated a system that identifies and represents phenotypic, environmental, and experimental context for every microarray sample and data set stored in GEO by mapping annotation phrases to biomedical concepts in UMLS. In addition to merely holding series of gene measurements, data sets can now be considered by their phenotypic, environmental, and experimental labels and even clustered based on their shared annotative concepts (Fig. 2), similar to how transcriptome measurements are commonly clustered based on expression measurements.
After finding annotative concepts in UMLS for every GEO data set, we have shown a network of relations between phenotype (e.g. aging), disease (e.g. leukemia), environmental (e.g. injury), and experimental context (e.g. muscle cells) and genes with differential expression associated with these concepts. Some of these relations exist even across orthologous genes in two or more species, and we have validated several of these phenome-genome and envirome-genome relations (Fig. 3). It is traditionally difficult to find genes associated with components of phenotype and environment. The example relations we have shown here were found because they were seen across multiple data sets (and types of microarrays) that were originally created to study many different processes, and only by integrating the data sets across the phenome and envirome were we able to find these associations. These relations immediately suggest further targeted cellular, genetic, and epidemiological study of how these genes may influence or are influenced by phenotype and environment. For example, mice that are homozygous null for Slc7a2 show a reduction in nitric oxide production in macrophages and fibroblasts, important in immune response and wound healing.30,31 Mice homozygous null for Mgat2 show early post-natal lethality and a number of other defects.32 Decreases in Tnnt1 and increases in Synj1 have been found to be markers for stem cell differentiation.22,23 Our relations link these four genes and others to aging; it is possible subtle changes in expression of these genes may impact aging of an organism. Further studies should be carried out to elucidate the roles of these genes in aging.
In addition, our findings suggest that UMLS is missing concepts important in molecular and cell biology experimentation, but is sufficient to represent many of the phenotypic, environmental, and experimental concepts held in the text-based annotations of transcriptome data. Though the representation of molecular biology concepts in UMLS could be improved and the parsing made more robust, improving the quality of the annotations is now the crucial step towards the automated determination of context. As more journals call for microarray data to be made publicly available, there is an increasing level of detail being placed in the annotations, with the goal of aiding others in reproducing the findings. In general, as more text is entered having little to do with the experimental design, or as more words are abbreviated, the potential for errors in understanding greatly rises, whether during manual or automated text processing. Findings have already been published in journals that were later judged as inaccurate because of incorrect annotations.33,34 Parsimony and precision in annotations will lead to better manual and computational understanding.
As we can now show that UMLS concepts can be used to represent the experimental context of a genomics experiment, even if only at a coarse resolution, we now have an opportunity to organize our research community to start the call for submitters of genomic data to describe their experiments using UMLS concepts. This could be done in a manner similar to how journals require that microarray data described in manuscripts be available in public repositories in standard formats.10,11,35 Software tools could become available to assist authors in choosing the correct concepts, allowing mappings to UMLS to be more accurate than those determined through automation. Finally, future successful phenome-genome and envirome-genome studies are assured, but the ease and accuracy of automating inferences across data are crucially dependent on the accuracy and consistency of the human annotation process, which will only happen when every investigator has a better prospective understanding of the long-term value of the time invested in improving annotations.
Source code for software and all relevant data sets are available at http://genotext.stanford.edu.
The Gene Expression Omnibus (GEO) is an international repository for gene expression data, developed and maintained by the National Library of Medicine. 9 GEO samples (abbreviated GSM) relate expression measurements of multiple RNA transcripts with platform-specific identifiers. Though GEO holds samples from many types of parallel expression measurement systems, hereafter we will use the term “microarray” interchangeably with “sample”. The mappings between gene identifiers used in the sample and external gene identifiers, names and symbols are stored in GEO platforms (abbreviated GPL). Each GEO series (abbreviated GSE), usually corresponding to a single experiment, relates to multiple GSM. A subset of the GSE have previously been manually validated as containing internally comparable data; these are represented as GEO data sets (abbreviated GDS). The relationship between GSM, GSE, GPL and GDS is illustrated in Figure 1.
We accessed the Gene Expression Omnibus site on March 24, 2004 and downloaded each GEO series. At the time of downloading, the GEO FTP site contained 8,519 GSM in 524 GSE measured using 195 GPL, as well as 448 GDS. We wrote software in PERL that extracts and stores within a MySQL relational database seven types of annotations from the Gene Expression Omnibus: GEO sample title, description, source, and keyword; GEO series title, and description; and GEO data set title.
To accurately model the context of a sample, we need a vocabulary that spans the domains to be joined, including terms describing genes and proteins and their functions, organism phenotypes, species, medical conditions, and the experiments themselves. The interlinked set of controlled vocabularies that best serves this role is the Unified Medical Language System (UMLS). UMLS is currently freely available to academic researchers, needing only a signed license agreement. UMLS has three components. The Metathesaurus contains a catalog of unified biomedical concepts, relations between concepts, and text strings mapped to each concept. The Semantic Network contains a catalog of 135 higher level categories for all concepts in the Metathesaurus, as well as relations between these categories. The SPECIALIST lexicon and other resources provide data and tools for processing the text strings associated with UMLS concepts.
The UMLS Metathesaurus relates source vocabularies by creating concepts that span the vocabularies. Each unique concept relates to one or more source vocabularies. A concept may have multiple listed synonyms and terms; each term is uniquely specified in the Metathesaurus. Two types of relations between concepts are stored in the Metathesaurus: asserted structural or hierarchical relations, and statistical relations (determined by co-occurrence of concepts in MEDLINE records). An example of the basic relations in the Metathesaurus is shown in the Supplementary Figure online.
The 2003AC release of UMLS was obtained through the National Library of Medicine (http://www.nlm.nih.gov/research/umls/). Using the METAMORPHOSYS tool, we created a subset dropping vocabularies considered less relevant, including translations of terms into alternate languages. All subsequent analysis was performed using the remaining subset, which contains 878,496 concepts described by 1,724,070 text strings. A table of all mappings between annotations and concepts is available at the web-site given above.
MetaMap is a software program written to take free text and generate a list of potentially matching concepts from the UMLS Metathesaurus.36 Using the MetaMap programming libraries, we created a software system called GENOTEXT (GENOmic conTEXT) in Java that processes each of the seven types of GEO annotations and stores in a relational database the UMLS string unique identifier (SUI) of all candidate matches, the score of the match, and the phrase of original text in which the string was found. As concept unique identifiers (CUI) are needed in subsequent analyses, they are automatically determined using the SUI to CUI relations in the UMLS concepts table. We define gross mapping errors as those caused by the incorrect interpretation of abbreviations leading to multiple strings, such that it is highly unlikely that there is any proper reference for these mapped strings. We wrote a program that takes many manually specified SUI and text fragments (described as regular expressions) and eliminates these mappings (additional details may be found in the Supplemental Note online).
The value of clustering samples from multiple expression data sets by expression values has been previous demonstrated.37 Here we show how gene expression data sets themselves can be clustered by phenotypic, environmental, and experimental context, when this context is represented using a standardized vocabulary. Each data set is considered as a binary vector reflecting the presence or absence of a mapping from that data set or its subordinate GEO series and samples to a UMLS concept. Comprehensive pair-wise binary distances are then computed between vectors. Hierarchical clustering is performed using complete linkage using R.38
Co-expression networks have been previously shown to relate genes by similar function.39,40 Here we show how we have comprehensively related gene expression, the transcriptome, to the rest of the envirome and phenome that remains external to these values in an automated manner. A mapping was manually created from 40% of the nearly 2.5 million GEO platform identifiers to LocusLink identifiers, allowing nearly 55 million expression measurements to be referenced using LocusLink. Mappings to genes were created using GEO platform files, and not by parsing annotations. Each gene expression measurement from each GEO sample is rank-normalized to between 0 and 1, depending on the relative ranking of the gene's expression level compared to other genes measured from the sample. A mean rank-normalized expression level is calculated for each gene across every subordinate GEO sample within a GEO data set, and the average normalized measurements are assigned to that GEO data set. Thus, each GEO data set has a panel of genes measured in its samples and a single average rank-normalized expression measurement for each gene.
We then iterate over the set of 55,619 measured genes, G, defined by LocusLink identifiers. For each gene g, we determine the GEO data sets in which the gene was measured, D, and then determine the UMLS concepts assigned to any those GEO data sets, C. For each concept c in C, we determine the subset of D annotated with c, called d1, and those data sets within D not annotated with c, called d2. Concept c is only considered if subsets d1 and d2 have a minimum of 4 data sets each. We then determine whether the rank-normalized expression measurements of g are significantly different between d1 and d2. Significance is determined using a Student's t-test with unpaired values. An f-test for significantly different variances is first performed for each comparison; if positive ([.alpha] ≤ 0.005), the t-test is performed with Welch correction for unequal variance.41 The threshold p-value for determining significance is determined using 100 random permutations: for each gene g and concept c, the assignment of GEO data sets D between d1 and d2 is randomly shuffled and the t-test is repeated. A q-value is then calculated for each relation between gene g and concept c, based on the proportion of false positive concepts in C we would expect mapped to the gene if concept c were significant. The table of all gene-concept relations (without additional filtering) is available at the web-site given above.
After all gene-concept relations are determined, those with q-values above 0.1 are discarded. Two independent methods are then used to study the remaining relations. In the first method, we study only those relations between gene g and concept c if there is another gene g’ mapped to the same concept c, such that gene g and g’ are in different species yet are in the same homology family, as defined by Homologene. 9 In other words, a concept has to be strongly related to orthologous genes in two species to be included. These relations are made into a graph, additional UMLS hierarchical relations between concepts are added, and the graph is formatted using the yEd Java Graph Editor (yWorks GmbH, Tübingen, Germany). In the second method, we study only the most reliable relations, defined as those with q-values under 0.01. With both methods, we manually verify the concept assignments to each annotation before interpretation.
Two sources were used to biologically validate the genes we find related to concepts representing Skeletal (C0521324), Skeletal Muscle (C0242695), Muscle Cells (C0596981), and Muscle (C0026845). First, a symbol list was obtained from the Genomics Institute of the Novartis Research Foundation Mouse SymAtlas GNF1M for genes that were expressed in a muscle cDNA library (obtained from pooled male and female 6−11 week old C57BL/6J mice) at 2 or more times the median expression level across all measured tissues.17 Second, a list of UniGene identifiers was obtained from UniGene for clusters of cDNAs sequenced from dbEST library 8902, the adult mouse skeletal muscle library from the University of Texas Southwestern Medical Center.9 Both of these lists were compared to the list of mouse genes with increased expression in GEO data sets containing annotations mapped to any of the four concepts. Neither of these datasets was included in the version of the Gene Expression Omnibus we used.
We thank Tarangini Deshpande for critical comments on and suggestions for the manuscript. The authors thank Partners Healthcare Research Computing for use of and assistance with the Linux High Performance Computing Cluster. The work was supported by grants from the Lucille Packard Foundation for Children's Health, NIH National Center for Biomedical Computing (U54 LM008748), the National Library of Medicine (K22 LM008261), National Institute of Diabetes and Digestive and Kidney Diseases (K12 DK63696, R01 DK62948, and R01 DK060837), the Harvard-MIT Division of Health Sciences and Technology, and the Lawson Wilkins Pediatric Endocrine Society.
Supplementary Figure: An illustrative subset of concepts and relations from UMLS. Concepts, given a serial number (e.g. C0011849) differ from the multitude of terms (e.g. “diabetes mellitus”) and abbreviations (e.g. “dm”) used to describe them. Each concept is contributed from a source vocabulary (e.g. the International Classification of Diseases, 9th revision, or ICD-9), which may have its own original code or designation for the concept or term (e.g. “250”). Asserted relationships, such as hierarchical relations, are available for many, but not all concepts. Statistical relationships (based on co-occurrence in MEDLINE records) are only available for those concepts that are also represented in the MeSH vocabularies.
Supplementary Table: Number of concepts mapped to each of the seven GEO free-text annotations.