Mouse gene expression data are complex and voluminous. To maximize the utility of these data, they must be made readily accessible through databases, and those resources need to place the expression data in the larger biological context. Here we describe two community resources that approach these problems in different but complementary ways: BioGPS and the Mouse Gene Expression Database (GXD). BioGPS connects its large and homogenous microarray gene expression reference data sets via plugins with a heterogeneous collection of external gene centric resources, thus casting a wide but loose net. GXD acquires different types of expression data from many sources and integrates these data tightly with other types of data in the Mouse Genome Informatics (MGI) resource, with a strong emphasis on consistency checks and manual curation. We describe and contrast the “loose” and “tight” data integration strategies employed by BioGPS and GXD, respectively, and discuss the challenges and benefits of data integration. BioGPS is freely available at http://biogps.org. GXD is freely available through the Mouse Genome Informatics (MGI) web site (www.informatics.jax.org), or directly at www.informatics.jax.org/expression.shtml.
data integration; gene expression; database
Structured gene annotations are a foundation upon which many bioinformatics and statistical analyses are built. However the structured annotations available in public databases are a sparse representation of biological knowledge as a whole. The rate of biomedical data generation is such that centralized biocuration efforts struggle to keep up. New models for gene annotation need to be explored that expand the pace at which we are able to structure biomedical knowledge. Recently, online games have emerged as an effective way to recruit, engage and organize large numbers of volunteers to help address difficult biological challenges. For example, games have been successfully developed for protein folding (Foldit), multiple sequence alignment (Phylo) and RNA structure design (EteRNA). Here we present Dizeez, a simple online game built with the purpose of structuring knowledge of gene-disease associations. Preliminary results from game play online and at scientific conferences suggest that Dizeez is producing valid gene-disease annotations not yet present in any public database. These early results provide a basic proof of principle that online games can be successfully applied to the challenge of gene annotation. Dizeez is available at http://genegames.org.
The Gene Ontology and its associated annotations are critical tools for interpreting lists of genes. Here, we introduce a method for evaluating the Gene Ontology annotations and structure based on the impact they have on gene set enrichment analysis, along with an example implementation. This task-based approach yields quantitative assessments grounded in experimental data and anchored tightly to the primary use of the annotations.
Applied to specific areas of biological interest, our framework allowed us to understand the progress of annotation and structural ontology changes from 2004 to 2012. Our framework was also able to determine that the quality of annotations and structure in the area under test have been improving in their ability to recall underlying biological traits. Furthermore, we were able to distinguish between the impact of changes to the annotation sets and ontology structure.
Our framework and implementation lay the groundwork for a powerful tool in evaluating the usefulness of the Gene Ontology. We demonstrate both the flexibility and the power of this approach in evaluating the current and past state of the Gene Ontology as well as its applicability in developing new methods for creating gene annotations.
The Gene Wiki is an open-access and openly editable collection of Wikipedia articles about human genes. Initiated in 2008, it has grown to include articles about more than 10 000 genes that, collectively, contain more than 1.4 million words of gene-centric text with extensive citations back to the primary scientific literature. This growing body of useful, gene-centric content is the result of the work of thousands of individuals throughout the scientific community. Here, we describe recent improvements to the automated system that keeps the structured data presented on Gene Wiki articles in sync with the data from trusted primary databases. We also describe the expanding contents, editors and users of the Gene Wiki. Finally, we introduce a new automated system, called WikiTrust, which can effectively compute the quality of Wikipedia articles, including Gene Wiki articles, at the word level. All articles in the Gene Wiki can be freely accessed and edited at Wikipedia, and additional links and information can be found at the project's Wikipedia portal page: http://en.wikipedia.org/wiki/Portal:Gene_Wiki.
The protein folding game Foldit shows that games are an effective way to recruit, engage and organize ordinary citizens to help solve difficult scientific problems.
Fast-evolving technologies have enabled researchers to easily generate data at genome scale, and using these technologies to compare biological states typically results in a list of candidate genes. Researchers are then faced with the daunting task of prioritizing these candidate genes for follow-up studies. There are hundreds, possibly even thousands, of web-based gene annotation resources available, but it quickly becomes impractical to manually access and review all of these sites for each gene in a candidate gene list. BioGPS (http://biogps.org) was created as a centralized gene portal for aggregating distributed gene annotation resources, emphasizing community extensibility and user customizability. BioGPS serves as a convenient tool for users to access known gene-centric resources, as well as a mechanism to discover new resources that were previously unknown to the user. This article describes updates to BioGPS made after its initial release in 2008. We summarize recent additions of features and data, as well as the robust user activity that underlies this community intelligence application. Finally, we describe MyGene.info (http://mygene.info) and related web services that provide programmatic access to BioGPS.
A variety of topic-focused wikis are used in the biomedical sciences to enable the mass-collaborative synthesis and distribution of diverse bodies of knowledge. To address complex problems such as defining the relationships between genes and disease, it is important to bring the knowledge from many different domains together. Here we show how advances in wiki technology and natural language processing can be used to automatically assemble ‘meta-wikis’ that present integrated views over the data collaboratively created in multiple source wikis.
We produced a semantic meta-wiki called the Gene Wiki+ that automatically mirrors and integrates data from the Gene Wiki and SNPedia. The Gene Wiki+, available at (http://genewikiplus.org/), captures 8,047 distinct gene-disease relationships. SNPedia accounts for 4,149 of the gene-disease pairs, the Gene Wiki provides 4,377 and only 479 appear independently in both sources. All of this content is available to query and browse and is provided as linked open data.
Wikis contain increasing amounts of diverse, biological information useful for elucidating the connections between genes and disease. The Gene Wiki+ shows how wiki technology can be used in concert with natural language processing to provide integrated views over diverse underlying data sources.
Wikipedia is increasingly used as a platform for collaborative data curation, but its current technical implementation has significant limitations that hinder its use in biocuration applications. Specifically, while editors can easily link between two articles in Wikipedia to indicate a relationship, there is no way to indicate the nature of that relationship in a way that is computationally accessible to the system or to external developers. For example, in addition to noting a relationship between a gene and a disease, it would be useful to differentiate the cases where genetic mutation or altered expression causes the disease. Here, we introduce a straightforward method that allows Wikipedia editors to embed computable semantic relations directly in the context of current Wikipedia articles. In addition, we demonstrate two novel applications enabled by the presence of these new relationships. The first is a dynamically generated information box that can be rendered on all semantically enhanced Wikipedia articles. The second is a prototype gene annotation system that draws its content from the gene-centric articles on Wikipedia and exposes the new semantic relationships to enable previously impossible, user-defined queries.
Ontology-based gene annotations are important tools for organizing and analyzing genome-scale biological data. Collecting these annotations is a valuable but costly endeavor. The Gene Wiki makes use of Wikipedia as a low-cost, mass-collaborative platform for assembling text-based gene annotations. The Gene Wiki is comprised of more than 10,000 review articles, each describing one human gene. The goal of this study is to define and assess a computational strategy for translating the text of Gene Wiki articles into ontology-based gene annotations. We specifically explore the generation of structured annotations using the Gene Ontology and the Human Disease Ontology.
Our system produced 2,983 candidate gene annotations using the Disease Ontology and 11,022 candidate annotations using the Gene Ontology from the text of the Gene Wiki. Based on manual evaluations and comparisons to reference annotation sets, we estimate a precision of 90-93% for the Disease Ontology annotations and 48-64% for the Gene Ontology annotations. We further demonstrate that this data set can systematically improve the results from gene set enrichment analyses.
The Gene Wiki is a rapidly growing corpus of text focused on human gene function. Here, we demonstrate that the Gene Wiki can be a powerful resource for generating ontology-based gene annotations. These annotations can be used immediately to improve workflows for building curated gene annotation databases and knowledge-based statistical analyses.
We analyzed the gene expression patterns of 138 Non-Small Cell Lung Cancer (NSCLC) samples and developed a new algorithm called Coverage Analysis with Fisher’s Exact Test (CAFET) to identify molecular pathways that are differentially activated in squamous cell carcinoma (SCC) and adenocarcinoma (AC) subtypes. Analysis of the lung cancer samples demonstrated hierarchical clustering according to the histological subtype and revealed a strong enrichment for the Wnt signaling pathway components in the cluster consisting predominantly of SCC samples. The specific gene expression pattern observed correlated with enhanced activation of the Wnt Planar Cell Polarity (PCP) pathway and inhibition of the canonical Wnt signaling branch. Further real time RT-PCR follow-up with additional primary tumor samples and lung cancer cell lines confirmed enrichment of Wnt/PCP pathway associated genes in the SCC subtype. Dysregulation of the canonical Wnt pathway, characterized by increased levels of β-catenin and epigenetic silencing of negative regulators, has been reported in adenocarcinoma of the lung. Our results suggest that SCC and AC utilize different branches of the Wnt pathway during oncogenesis.
Animal models of human behavioral endophenotypes, such as the Tail Suspension Test (TST) and the Open Field assay (OF), have proven to be essential tools in revealing the genetics and mechanisms of psychiatric diseases. As in the human disorders they model, the measurements generated in these behavioral assays are significantly impacted by the genetic background of the animals tested. In order to better understand the strain-dependent phenotypic variability endemic to this type of work, and better inform future studies that rely on the data generated by these models, we phenotyped 33 inbred mouse strains for immobility in the TST, a mouse model of behavioral despair, and for activity in the OF, a model of general anxiety and locomotor activity.
We identified significant strain-dependent differences in TST immobility, and in thigmotaxis and distance traveled in the OF. These results were replicable over multiple testing sessions and exhibited high heritability. We exploited the heritability of these behavioral traits by using in silico haplotype-based association mapping to identify candidate genes for regulating TST behavior. Two significant loci (-logp >7.0, gFWER adjusted p value <0.05) of approximately 300 kb each on MMU9 and MMU10 were identified. The MMU10 locus is syntenic to a major human depressive disorder QTL on human chromosome 12 and contains several genes that are expressed in brain regions associated with behavioral despair.
We report the results of phenotyping a large panel of inbred mouse strains for depression and anxiety-associated behaviors. These results show significant, heritable strain-specific differences in behavior, and should prove to be a valuable resource for the behavioral and genetics communities. Additionally, we used haplotype mapping to identify several loci that may contain genes that regulate behavioral despair.
The study of expression quantitative trait loci (eQTL) is a powerful way of detecting transcriptional regulators at a genomic scale and for elucidating how natural genetic variation impacts gene expression. Power and genetic resolution are heavily affected by the study population: whereas recombinant inbred (RI) strains yield greater statistical power with low genetic resolution, using diverse inbred or outbred strains improves genetic resolution at the cost of lower power. In order to overcome the limitations of both individual approaches, we combine data from RI strains with genetically more diverse strains and analyze hippocampus eQTL data obtained from mouse RI strains (BXD) and from a panel of diverse inbred strains (Mouse Diversity Panel, MDP). We perform a systematic analysis of the consistency of eQTL independently obtained from these two populations and demonstrate that a significant fraction of eQTL can be replicated. Based on existing knowledge from pathway databases we assess different approaches for using the high-resolution MDP data for fine mapping BXD eQTL. Finally, we apply this framework to an eQTL hotspot on chromosome 1 (Qrr1), which has been implicated in a range of neurological traits. Here we present the first systematic examination of the consistency between eQTL obtained independently from the BXD and MDP populations. Our analysis of fine-mapping approaches is based on ‘real life’ data as opposed to simulated data and it allows us to propose a strategy for using MDP data to fine map BXD eQTL. Application of this framework to Qrr1 reveals that this eQTL hotspot is not caused by just one (or few) ‘master regulators’, but actually by a set of polymorphic genes specific to the central nervous system.
Two decades of research identified more than a dozen clock genes and defined a biochemical feedback mechanism of circadian oscillator function. To identify additional clock genes and modifiers, we conducted a genome-wide siRNA screen in a human cellular clock model. Knockdown of nearly a thousand genes reduced rhythm amplitude. Potent effects on period length or increased amplitude were less frequent; we found hundreds of these and confirmed them in secondary screens. Characterization of a subset of these genes demonstrated a dosage-dependent effect on oscillator function. Protein interaction network analysis showed that dozens of gene products directly or indirectly associate with known clock components. Pathway analysis revealed these genes are overrepresented for components of insulin and hedgehog signaling, the cell cycle, and the folate metabolism. Coupled with data showing many of these pathways are clock-regulated, we conclude the clock is interconnected with many aspects of cellular function.
In Huntington's disease (HD), an expanded CAG repeat produces characteristic striatal neurodegeneration. Interestingly, the HD CAG repeat, whose length determines age at onset, undergoes tissue-specific somatic instability, predominant in the striatum, suggesting that tissue-specific CAG length changes could modify the disease process. Therefore, understanding the mechanisms underlying the tissue specificity of somatic instability may provide novel routes to therapies. However progress in this area has been hampered by the lack of sensitive high-throughput instability quantification methods and global approaches to identify the underlying factors.
Here we describe a novel approach to gain insight into the factors responsible for the tissue specificity of somatic instability. Using accurate genetic knock-in mouse models of HD, we developed a reliable, high-throughput method to quantify tissue HD CAG repeat instability and integrated this with genome-wide bioinformatic approaches. Using tissue instability quantified in 16 tissues as a phenotype and tissue microarray gene expression as a predictor, we built a mathematical model and identified a gene expression signature that accurately predicted tissue instability. Using the predictive ability of this signature we found that somatic instability was not a consequence of pathogenesis. In support of this, genetic crosses with models of accelerated neuropathology failed to induce somatic instability. In addition, we searched for genes and pathways that correlated with tissue instability. We found that expression levels of DNA repair genes did not explain the tissue specificity of somatic instability. Instead, our data implicate other pathways, particularly cell cycle, metabolism and neurotransmitter pathways, acting in combination to generate tissue-specific patterns of instability.
Our study clearly demonstrates that multiple tissue factors reflect the level of somatic instability in different tissues. In addition, our quantitative, genome-wide approach is readily applicable to high-throughput assays and opens the door to widespread applications with the potential to accelerate the discovery of drugs that alter tissue instability.
Annotating the function of all human genes is a critical, yet formidable, challenge. Current gene annotation efforts focus on centralized curation resources, but it is increasingly clear that this approach does not scale with the rapid growth of the biomedical literature. The Gene Wiki utilizes an alternative and complementary model based on the principle of community intelligence. Directly integrated within the online encyclopedia, Wikipedia, the goal of this effort is to build a gene-specific review article for every gene in the human genome, where each article is collaboratively written, continuously updated and community reviewed. Previously, we described the creation of Gene Wiki ‘stubs’ for approximately 9000 human genes. Here, we describe ongoing systematic improvements to these articles to increase their utility. Moreover, we retrospectively examine the community usage and improvement of the Gene Wiki, providing evidence of a critical mass of users and editors. Gene Wiki articles are freely accessible within the Wikipedia web site, and additional links and information are available at http://en.wikipedia.org/wiki/Portal:Gene_Wiki.
Glyoxalase 1 (Glo1) has been implicated in anxiety-like behavior in mice and in multiple psychiatric diseases in humans. We used mouse Affymetrix exon arrays to detect copy number variants (CNV) among inbred mouse strains and thereby identified a ∼475 kb tandem duplication on chromosome 17 that includes Glo1 (30,174,390–30,651,226 Mb; mouse genome build 36). We developed a PCR-based strategy and used it to detect this duplication in 23 of 71 inbred strains tested, and in various outbred and wild-caught mice. Presence of the duplication is associated with a cis-acting expression QTL for Glo1 (LOD>30) in BXD recombinant inbred strains. However, evidence for an eQTL for Glo1 was not obtained when we analyzed single SNPs or 3-SNP haplotypes in a panel of 27 inbred strains. We conclude that association analysis in the inbred strain panel failed to detect an eQTL because the duplication was present on multiple highly divergent haplotypes. Furthermore, we suggest that non-allelic homologous recombination has led to multiple reversions to the non-duplicated state among inbred strains. We show associations between multiple duplication-containing haplotypes, Glo1 expression and anxiety-like behavior in both inbred strain panels and outbred CD-1 mice. Our findings provide a molecular basis for differential expression of Glo1 and further implicate Glo1 in anxiety-like behavior. More broadly, these results identify problems with commonly employed tests for association in inbred strains when CNVs are present. Finally, these data provide an example of biologically significant phenotypic variability in model organisms that can be attributed to CNVs.
This manuscript describes the creation of comprehensive gene wiki, seeded with data from public domain sources, which will enable and encourage community annotation of gene function.
Monocytes and macrophages express an extensive repertoire of G Protein-Coupled Receptors (GPCRs) that regulate inflammation and immunity. In this study we performed a systematic micro-array analysis of GPCR expression in primary mouse macrophages to identify family members that are either enriched in macrophages compared to a panel of other cell types, or are regulated by an inflammatory stimulus, the bacterial product lipopolysaccharide (LPS).
Several members of the P2RY family had striking expression patterns in macrophages; P2ry6 mRNA was essentially expressed in a macrophage-specific fashion, whilst P2ry1 and P2ry5 mRNA levels were strongly down-regulated by LPS. Expression of several other GPCRs was either restricted to macrophages (e.g. Gpr84) or to both macrophages and neural tissues (e.g. P2ry12, Gpr85). The GPCR repertoire expressed by bone marrow-derived macrophages and thioglycollate-elicited peritoneal macrophages had some commonality, but there were also several GPCRs preferentially expressed by either cell population.
The constitutive or regulated expression in macrophages of several GPCRs identified in this study has not previously been described. Future studies on such GPCRs and their agonists are likely to provide important insights into macrophage biology, as well as novel inflammatory pathways that could be future targets for drug discovery.
Adipose tissue renewal and obesity-driven expansion of fat cell number are dependent on proliferation and differentiation of adipose progenitors that reside in the vasculature that develops in coordination with adipose depots. The transcriptional events that regulate commitment of progenitors to the adipose lineage are poorly understood. Because expression of the nuclear receptor PPARγ defines the adipose lineage, isolation of elements that control PPARγ expression in adipose precursors may lead to discovery of transcriptional regulators of early adipocyte determination. Here, we describe the identification and validation in transgenic mice of 5 highly conserved non-coding sequences from the PPARγ locus that can drive expression of a reporter gene in a manner that recapitulates the tissue-specific pattern of PPARγ expression. Surprisingly, these 5 elements appear to control PPARγ expression in adipocyte precursors that are associated with the vasculature of adipose depots, but not in mature adipocytes. Characterization of these five PPARγ regulatory sequences may enable isolation of the transcription factors that bind these cis elements and provide insight into the molecular regulation of adipose tissue expansion in normal and pathological states.
A search of the literature reveals that researchers study relatively few genes out of the total human genome.
Gene annotation, as measured by links to the biomedical literature and funded grants, is governed by a power law, indicating that researchers favor the extensive study of relatively few genes. This emphasizes the need for data-driven science to accomplish genome-wide gene annotation.
Finding the genetic causes of quantitative traits is a complex and difficult task. Classical methods for mapping quantitative trail loci (QTL) in miceuse an F2 cross between two strains with substantially different phenotype and an interval mapping method to compute confidence intervals at each position in the genome. This process requires significant resources for breeding and genotyping, and the data generated are usually only applicable to one phenotype of interest. Recently, we reported the application of a haplotype association mapping method which utilizes dense genotyping data across a diverse panel of inbred mouse strains and a marker association algorithm that is independent of any specific phenotype. As the availability of genotyping data grows in size and density, analysis of these haplotype association mapping methods should be of increasing value to the statistical genetics community.
We describe a detailed comparative analysis of variations on our marker association method. In particular, we describe the use of inferred haplotypes from adjacent SNPs, parametric and nonparametric statistics, and control of multiple testing error. These results show that nonparametric methods are slightly better in the test cases we study, although the choice of test statistic may often be dependent on the specific phenotype and haplotype structure being studied. The use of multi-SNP windows to infer local haplotype structure is critical to the use of a diverse panel of inbred strains for QTL mapping. Finally, because the marginal effect of any single gene in a complex disease is often relatively small, these methods require the use of sensitive methods for controlling family-wise error. We also report our initial application of this method to phenotypes cataloged in the Mouse Phenome Database.
The use of inbred strains of mice for QTL mapping has many advantages over traditional methods. However, there are also limitations in comparison to the traditional linkage analysis from F2 and RI lines. Application of these methods requires careful consideration of algorithmic choices based on both theoretical and practical factors. Our findings suggest general guidelines, though a complete evaluation of these methods can only be performed as more genetic data in complex diseases becomes available.
Rab GTPases and SNARE fusion proteins direct cargo trafficking through the exocytic and endocytic pathways of eukaryotic cells. We have used steady state mRNA expression profiling and computational hierarchical clustering methods to generate a global overview of the distribution of Rabs, SNAREs, and coat machinery components, as well as their respective adaptors, effectors, and regulators in 79 human and 61 mouse nonredundant tissues. We now show that this systems biology approach can be used to define building blocks for membrane trafficking based on Rab-centric protein activity hubs. These Rab-regulated hubs provide a framework for an integrated coding system, the membrome network, which regulates the dynamics of the specialized membrane architecture of differentiated cells. The distribution of Rab-regulated hubs illustrates a number of facets that guides the overall organization of subcellular compartments of cells and tissues through the activity of dynamic protein interaction networks. An interactive website for exploring datasets comprising components of the Rab-regulated hubs that define the membrome of different cell and organ systems in both human and mouse is available at http://www.membrome.org/.