The model organism Encyclopedia of DNA Elements (modENCODE) project is a National Human Genome Research Institute (NHGRI) initiative designed to characterize the genomes of Drosophila melanogaster and Caenorhabditis elegans. A Data Coordination Center (DCC) was created to collect, store and catalog modENCODE data. An effective DCC must gather, organize and provide all primary, interpreted and analyzed data, and ensure the community is supplied with the knowledge of the experimental conditions, protocols and verification checks used to generate each primary data set. We present here the design principles of the modENCODE DCC, and describe the ramifications of collecting thorough and deep metadata for describing experiments, including the use of a wiki for capturing protocol and reagent information, and the BIR-TAB specification for linking biological samples to experimental results. modENCODE data can be found at http://www.modencode.org.
Database URL: http://www.modencode.org.
Funded by the National Institutes of Health (NIH), the aim of the Model Organism ENCyclopedia of DNA Elements (modENCODE) project is to provide the biological research community with a comprehensive encyclopedia of functional genomic elements for both model organisms C. elegans (worm) and D. melanogaster (fly). With a total size of just under 10 terabytes of data collected and released to the public, one of the challenges faced by researchers is to extract biologically meaningful knowledge from this large data set. While the basic quality control, pre-processing, and analysis of the data has already been performed by members of the modENCODE consortium, many researchers will wish to reinterpret the data set using modifications and enhancements of the original protocols, or combine modENCODE data with other data sets. Unfortunately this can be a time consuming and logistically challenging proposition.
In recognition of this challenge, the modENCODE DCC has released uniform computing resources for analyzing modENCODE data on Galaxy (https://github.com/modENCODE-DCC/Galaxy), on the public Amazon Cloud (http://aws.amazon.com), and on the private Bionimbus Cloud for genomic research (http://www.bionimbus.org). In particular, we have released Galaxy workflows for interpreting ChIP-seq data which use the same quality control (QC) and peak calling standards adopted by the modENCODE and ENCODE communities. For convenience of use, we have created Amazon and Bionimbus Cloud machine images containing Galaxy along with all the modENCODE data, software and other dependencies.
Using these resources provides a framework for running consistent and reproducible analyses on modENCODE data, ultimately allowing researchers to use more of their time using modENCODE data, and less time moving it around.
The functional repertoire of long intergenic noncoding RNA (lincRNA) molecules has begun to be elucidated in mammals. Determining the biological relevance and potential gene regulatory mechanisms of these enigmatic molecules would be expedited in a more tractable model organism, such as Drosophila melanogaster. To this end, we defined a set of 1,119 putative lincRNA genes in D. melanogaster using modENCODE whole transcriptome (RNA-seq) data. A large majority (1.1 of 1.3 Mb; 85%) of these bases were not previously reported by modENCODE as being transcribed. Significant selective constraint on the sequences of these loci predicts that virtually all have sustained functionality across the Drosophila clade. We observe biases in lincRNA genomic locations and expression profiles that are consistent with some of these lincRNAs being involved in the regulation of neighboring protein-coding genes with developmental functions. We identify lincRNAs that may be important in the developing nervous system and in male-specific organs, such as the testes. LincRNA loci were also identified whose positions, relative to nearby protein-coding loci, are equivalent between D. melanogaster and mouse. This study predicts that the genomes of not only vertebrates, such as mammals, but also an invertebrate (fruit fly) harbor large numbers of lincRNA loci. Our findings now permit exploitation of Drosophila genetics for the investigation of lincRNA mechanisms, including lincRNAs with potential functional analogues in mammals.
long intergenic noncoding RNAs; modENCODE; transcriptional regulation; evolution; development
FlyBase (http://flybase.org) is the leading database and web portal for genetic and genomic information on the fruit fly Drosophila melanogaster and related fly species. Whether you use the fruit fly as an experimental system or want to apply Drosophila biological knowledge to another field of study, FlyBase can help you successfully navigate the wealth of available Drosophila data. Here, we review the FlyBase web site with novice and less-experienced users of FlyBase in mind and point out recent developments stemming from the availability of genome-wide data from the modENCODE project. The first section of this paper explains the organization of the web site and describes the report pages available on FlyBase, focusing on the most popular, the Gene Report. The next section introduces some of the search tools available on FlyBase, in particular, our heavily used and recently redesigned search tool QuickSearch, found on the FlyBase homepage. The final section concerns genomic data, including recent modENCODE (http://www.modencode.org) data, available through our Genome Browser, GBrowse.
Motivation: The highly coordinated expression of thousands of genes in an organism is regulated by the concerted action of transcription factors, chromatin proteins and epigenetic mechanisms. High-throughput experimental data for genome wide in vivo protein–DNA interactions and epigenetic marks are becoming available from large projects, such as the model organism ENCyclopedia Of DNA Elements (modENCODE) and from individual labs. Dissemination and visualization of these datasets in an explorable form is an important challenge.
Results: To support research on Drosophila melanogaster transcription regulation and make the genome wide in vivo protein–DNA interactions data available to the scientific community as a whole, we have developed a system called Flynet. Currently, Flynet contains 101 datasets for 38 transcription factors and chromatin regulator proteins in different experimental conditions. These factors exhibit different types of binding profiles ranging from sharp localized peaks to broad binding regions. The protein–DNA interaction data in Flynet was obtained from the analysis of chromatin immunoprecipitation experiments on one color and two color genomic tiling arrays as well as chromatin immunoprecipitation followed by massively parallel sequencing. A web-based interface, integrated with an AJAX based genome browser, has been built for queries and presenting analysis results. Flynet also makes available the cis-regulatory modules reported in literature, known and de novo identified sequence motifs across the genome, and other resources to study gene regulation.
Availability: Flynet is available at https://www.cistrack.org/flynet/.
Supplementary information: Supplementary data are available at Bioinformatics online.
To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications, chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding, RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new functions for genes, reveals stage- and tissue-specific regulators, and enables gene-expression prediction. Our results provide a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation.
WormBase (www.wormbase.org) has been serving the scientific community for over 11 years as the central repository for genomic and genetic information for the soil nematode Caenorhabditis elegans. The resource has evolved from its beginnings as a database housing the genomic sequence and genetic and physical maps of a single species, and now represents the breadth and diversity of nematode research, currently serving genome sequence and annotation for around 20 nematodes. In this article, we focus on WormBase’s role of genome sequence annotation, describing how we annotate and integrate data from a growing collection of nematode species and strains. We also review our approaches to sequence curation, and discuss the impact on annotation quality of large functional genomics projects such as modENCODE.
Caenorhabditis elegans; annotation; community resource; genome; model organism database; nematode; parasitic nematode; sequence curation
We have implemented aggregation and correlation toolbox (ACT), an efficient, multifaceted toolbox for analyzing continuous signal and discrete region tracks from high-throughput genomic experiments, such as RNA-seq or ChIP-chip signal profiles from the ENCODE and modENCODE projects, or lists of single nucleotide polymorphisms from the 1000 genomes project. It is able to generate aggregate profiles of a given track around a set of specified anchor points, such as transcription start sites. It is also able to correlate related tracks and analyze them for saturation–i.e. how much of a certain feature is covered with each new succeeding experiment. The ACT site contains downloadable code in a variety of formats, interactive web servers (for use on small quantities of data), example datasets, documentation and a gallery of outputs. Here, we explain the components of the toolbox in more detail and apply them in various contexts.
Availability: ACT is available at http://act.gersteinlab.org
Systematic annotation of gene regulatory elements is a major challenge in genome science. Direct mapping of chromatin modification marks and transcriptional factor binding sites genome-wide 1,2 has successfully identified specific subtypes of regulatory elements 3. In Drosophila several pioneering studies have provided genome-wide identification of Polycomb-Response Elements 4, chromatin states 5, transcription factor binding sites (TFBS) 6–9, PolII regulation 8, and insulator elements 10; however, comprehensive annotation of the regulatory genome remains a significant challenge. Here we describe results from the modENCODE cis-regulatory annotation project. We produced a map of the Drosophila melanogaster regulatory genome based on more than 300 chromatin immuno-precipitation (ChIP) datasets for eight chromatin features, five histone deacetylases (HDACs) and thirty-eight site-specific transcription factors (TFs) at different stages of development. Using these data we inferred more than 20,000 candidate regulatory elements and we validated a subset of predictions for promoters, enhancers, and insulators in vivo. We also identified nearly 2,000 genomic regions of dense TF binding associated with chromatin activity and accessibility. We discovered hundreds of new TF co-binding relationships and defined a TF network with over 800 potential regulatory relationships.
Advances in sequencing technology have boosted population genomics and made it possible to map the positions of transcription factor binding sites (TFBSs) with high precision. Here we investigate TFBS variability by combining transcription factor binding maps generated by ENCODE, modENCODE, our previously published data and other sources with genomic variation data for human individuals and Drosophila isogenic lines.
We introduce a metric of TFBS variability that takes into account changes in motif match associated with mutation and makes it possible to investigate TFBS functional constraints instance-by-instance as well as in sets that share common biological properties. We also take advantage of the emerging per-individual transcription factor binding data to show evidence that TFBS mutations, particularly at evolutionarily conserved sites, can be efficiently buffered to ensure coherent levels of transcription factor binding.
Our analyses provide insights into the relationship between individual and interspecies variation and show evidence for the functional buffering of TFBS mutations in both humans and flies. In a broad perspective, these results demonstrate the potential of combining functional genomics and population genetics approaches for understanding gene regulation.
Salivary gland polytene chromosomes demonstrate banding pattern, genetic meaning of which is an enigma for decades. Till now it is not known how to mark the band/interband borders on physical map of DNA and structures of polytene chromosomes are not characterized in molecular and genetic terms. It is not known either similar banding pattern exists in chromosomes of regular diploid mitotically dividing nonpolytene cells. Using the newly developed approach permitting to identify the interband material and localization data of interband-specific proteins from modENCODE and other genome-wide projects, we identify physical limits of bands and interbands in small cytological region 9F13-10B3 of the X chromosome in D. melanogaster, as well as characterize their general molecular features. Our results suggests that the polytene and interphase cell line chromosomes have practically the same patterns of bands and interbands reflecting, probably, the basic principle of interphase chromosome organization. Two types of bands have been described in chromosomes, early and late-replicating, which differ in many aspects of their protein and genetic content. As appeared, origin recognition complexes are located almost totally in the interbands of chromosomes.
In D. melanogaster polytene chromosomes, intercalary heterochromatin (IH) appears as large dense bands scattered in euchromatin and comprises clusters of repressed genes. IH displays distinctly low gene density, indicative of their particular regulation. Genes embedded in IH replicate late in the S phase and become underreplicated. We asked whether localization and organization of these late-replicating domains is conserved in a distinct cell type. Using published comprehensive genome-wide chromatin annotation datasets (modENCODE and others), we compared IH organization in salivary gland cells and in a Kc cell line. We first established the borders of 60 IH regions on a molecular map, these regions containing underreplicated material and encompassing ∼12% of Drosophila genome. We showed that in Kc cells repressed chromatin constituted 97% of the sequences that corresponded to IH bands. This chromatin is depleted for ORC-2 binding and largely replicates late. Differences in replication timing between the cell types analyzed are local and affect only sub-regions but never whole IH bands. As a rule such differentially replicating sub-regions display open chromatin organization, which apparently results from cell-type specific gene expression of underlying genes. We conclude that repressed chromatin organization of IH is generally conserved in polytene and non-polytene cells. Yet, IH domains do not function as transcription- and replication-regulatory units, because differences in transcription and replication between cell types are not domain-wide, rather they are restricted to small “islands” embedded in these domains. IH regions can thus be defined as a special class of domains with low gene density, which have narrow temporal expression patterns, and so displaying relatively conserved organization.
Identifying a bicluster, or submatrix of a gene expression dataset wherein the genes express similar behavior over the columns, is useful for discovering novel functional gene interactions. In this article, we introduce a new algorithm for finding biClusters with Linear Patterns (CLiP). Instead of solely maximizing Pearson correlation, we introduce a fitness function that also considers the correlation of complementary genes and conditions. This eliminates the need for a priori determination of the bicluster size. We employ both greedy search and the genetic algorithm in optimization, incorporating resampling for more robust discovery. When applied to both real and simulation datasets, our results show that CLiP is superior to existing methods. In analyzing RNA-seq fly and worm time-course data from modENCODE, we uncover a set of similarly expressed genes suggesting maternal dependence. Supplementary Material is available online (at www.liebertonline.com/cmb).
algorithms; gene clusters; probability
Despite many efforts, little is known about distribution and interactions of chromatin proteins which contribute to the specificity of chromomeric organization of interphase chromosomes. To address this issue, we used publicly available datasets from several recent Drosophila genome-wide mapping and annotation projects, in particular, those from modENCODE project, and compared molecular organization of 13 interband regions which were accurately mapped previously.
Here we demonstrate that in interphase chromosomes of Drosophila cell lines, the interband regions are enriched for a specific set of proteins generally characteristic of the "open" chromatin (RNA polymerase II, CHRIZ (CHRO), BEAF-32, BRE1, dMI-2, GAF, NURF301, WDS and TRX). These regions also display reduced nucleosome density, histone H1 depletion and pronounced enrichment for ORC2, a pre-replication complex component. Within the 13 interband regions analyzed, most were around 3-4 kb long, particularly those where many of said protein features were present. We estimate there are about 3500 regions with similar properties in chromosomes of D. melanogaster cell lines, which fits quite well the number of cytologically observed interbands in salivary gland polytene chromosomes.
Our observations suggest strikingly similar organization of interband chromatin in polytene chromosomes and in chromosomes from cell lines thereby reflecting the existence of a universal principle of interphase chromosome organization.
Mitochondria contain their own circular genome, with mitochondria-specific transcription and replication systems and corresponding regulatory proteins. All of these proteins are encoded in the nuclear genome and are post-translationally imported into mitochondria. In addition, several nuclear transcription factors have been reported to act in mitochondria, but there has been no comprehensive mapping of their occupancy patterns and it is not clear how many other factors may also be found in mitochondria. Here we address these questions by using ChIP-seq data from the ENCODE, mouseENCODE and modENCODE consortia for 151 human, 31 mouse and 35 C. elegans factors. We identified 8 human and 3 mouse transcription factors with strong localized enrichment over the mitochondrial genome that was usually associated with the corresponding recognition sequence motif. Notably, these sites of occupancy are often the sites with highest ChIP-seq signal intensity within both the nuclear and mitochondrial genomes and are thus best explained as true binding events to mitochondrial DNA, which exist in high copy number in each cell. We corroborated these findings by immunocytochemical staining evidence for mitochondrial localization. However, we were unable to find clear evidence for mitochondrial binding in ENCODE and other publicly available ChIP-seq data for most factors previously reported to localize there. As the first global analysis of nuclear transcription factors binding in mitochondria, this work opens the door to future studies that probe the functional significance of the phenomenon.
Repeat sequences are abundant in eukaryotic genomes but many are excluded from genome assemblies. In Drosophila melanogaster classical studies of repeat content suggested variability between individuals, but they lacked the precision of modern high throughput sequencing technologies. Genome-wide profiling of chromatin features such as histone tail modifications and DNA-binding proteins relies on alignment to the reference genome and hence excludes highly repetitive sequences.
By analyzing repeat libraries, sequence complexity and k-mer counts we determined the abundances of different D. melanogaster repeat classes in flies in two public datasets, DGRP and modENCODE. We found that larval DNA was depleted of all repeat classes relative to adult and embryonic DNA, as expected from the known depletion of repeat-rich pericentromeric regions during polytenization of larval tissues. By applying a method that is independent of alignment to the genome assembly, we found that satellite repeats associate with distinct H3 tail modifications, such as H3K9me2 and H3K9me3 for short repeats and H3K9me1 for 359 bp repeats. Short AT-rich repeats however are depleted of nucleosomes and hence all histone modifications and associated chromatin proteins.
The total repeat content and association of repeat sequences with chromatin modifications can be determined despite repeats being excluded from genome assemblies, revealing unexpected distinctions in chromatin features based on sequence composition.
DNA satellites; Next-generation sequencing; ChIP-seq; Histone modification
Sequence-specific transcription factors (TFs) are critical for specifying patterns and levels of gene expression, but target DNA elements are not sufficient to specify TF binding in vivo. In eukaryotes, the binding of a TF is in competition with a constellation of other proteins, including histones, which package DNA into nucleosomes. We used the ChIP-seq assay to examine the genome-wide distribution of Drosophila Heat Shock Factor (HSF), a TF whose binding activity is mediated by heat shock-induced trimerization. HSF binds to 464 sites after heat shock, the vast majority of which contain HSF Sequence-binding Elements (HSEs). HSF-bound sequence motifs represent only a small fraction of the total HSEs present in the genome. ModENCODE ChIP-chip datasets, generated during non-heat shock conditions, were used to show that inducibly bound HSE motifs are associated with histone acetylation, H3K4 trimethylation, RNA Polymerase II, and coactivators, compared to HSE motifs that remain HSF-free. Furthermore, directly changing the chromatin landscape, from an inactive to an active state, permits inducible HSF binding. There is a strong correlation of bound HSEs to active chromatin marks present prior to induced HSF binding, indicating that an HSE's residence in “active” chromatin is a primary determinant of whether HSF can bind following heat shock.
Many Transcription Factors (TFs) have been shown to bind DNA in a sequence-specific manner. However, only a sub-set of possible binding sites are occupied in vivo, and it remains unclear how TFs discriminate between sequences of equal predicted binding affinity. We set out to determine how a specific TF, Heat Shock Factor (HSF), distinguishes between utilized and unused potential binding sites. HSF is uniquely qualified to study this problem, because HSF is inactive and lowly bound to DNA in unstressed cells and upon stress HSF becomes active and strongly binds to DNA. We compared the properties of the unstressed chromatin between the sites that become HSF-bound or remain HSF-free following stress activation. We find that sites that are destined to be bound strongly by HSF after stress are associated with distinct chromatin marks compared to sites that are unoccupied by HSF after heat shock. Furthermore, chromatin landscape can be changed from a restrictive to a permissive state, allowing inducible HSF binding. These finding suggest that TF binding sites can be predicted based on the chromatin signatures present prior to induced TF recruitment.
Chromatin immunoprecipitation (ChIP) followed by microarray hybridization (ChIP-chip) or high-throughput sequencing (ChIP-seq) allows genome-wide discovery of protein-DNA interactions such as transcription factor bindings and histone modifications. Previous reports only compared a small number of profiles, and little has been done to compare histone modification profiles generated by the two technologies or to assess the impact of input DNA libraries in ChIP-seq analysis. Here, we performed a systematic analysis of a modENCODE dataset consisting of 31 pairs of ChIP-chip/ChIP-seq profiles of the coactivator CBP, RNA polymerase II (RNA PolII), and six histone modifications across four developmental stages of Drosophila melanogaster.
Both technologies produce highly reproducible profiles within each platform, ChIP-seq generally produces profiles with a better signal-to-noise ratio, and allows detection of more peaks and narrower peaks. The set of peaks identified by the two technologies can be significantly different, but the extent to which they differ varies depending on the factor and the analysis algorithm. Importantly, we found that there is a significant variation among multiple sequencing profiles of input DNA libraries and that this variation most likely arises from both differences in experimental condition and sequencing depth. We further show that using an inappropriate input DNA profile can impact the average signal profiles around genomic features and peak calling results, highlighting the importance of having high quality input DNA data for normalization in ChIP-seq analysis.
Our findings highlight the biases present in each of the platforms, show the variability that can arise from both technology and analysis methods, and emphasize the importance of obtaining high quality and deeply sequenced input DNA libraries for ChIP-seq analysis.
The Drosophila MSL complex mediates dosage compensation by increasing transcription of the single X chromosome in males approximately two-fold. This is accomplished through recognition of the X chromosome and subsequent acetylation of histone H4K16 on X-linked genes. Initial binding to the X is thought to occur at “entry sites” that contain a consensus sequence motif (“MSL recognition element” or MRE). However, this motif is only ∼2 fold enriched on X, and only a fraction of the motifs on X are initially targeted. Here we ask whether chromatin context could distinguish between utilized and non-utilized copies of the motif, by comparing their relative enrichment for histone modifications and chromosomal proteins mapped in the modENCODE project. Through a comparative analysis of the chromatin features in male S2 cells (which contain MSL complex) and female Kc cells (which lack the complex), we find that the presence of active chromatin modifications, together with an elevated local GC content in the surrounding sequences, has strong predictive value for functional MSL entry sites, independent of MSL binding. We tested these sites for function in Kc cells by RNAi knockdown of Sxl, resulting in induction of MSL complex. We show that ectopic MSL expression in Kc cells leads to H4K16 acetylation around these sites and a relative increase in X chromosome transcription. Collectively, our results support a model in which a pre-existing active chromatin environment, coincident with H3K36me3, contributes to MSL entry site selection. The consequences of MSL targeting of the male X chromosome include increase in nucleosome lability, enrichment for H4K16 acetylation and JIL-1 kinase, and depletion of linker histone H1 on active X-linked genes. Our analysis can serve as a model for identifying chromatin and local sequence features that may contribute to selection of functional protein binding sites in the genome.
The genomes of complex organisms encompass hundreds of millions of base pairs of DNA, and regulatory molecules must distinguish specific targets within this vast landscape. In general, regulatory factors find target genes through sequence-specific interactions with the underlying DNA. However, sequence-specific factors typically bind only a fraction of the candidate genomic regions containing their specific target sequence motif. Here we identify potential roles for chromatin environment and flanking sequence composition in helping regulatory factors find their appropriate binding sites, using targeting of the Drosophila dosage compensation complex as a model. The initial stage of dosage compensation involves binding of the Male Specific Lethal (MSL) complex to a sequence motif called the MSL recognition element . Using data from a large chromatin mapping effort (the modENCODE project), we successfully identify an active chromatin environment as predictive of selective MRE binding by the MSL complex. Our study provides a framework for using genome-wide datasets to analyze and predict functional protein–DNA binding site selection.
An accurate, comprehensive, non-redundant and up-to-date bibliography is a crucial component of any Model Organism Database (MOD). Principally, the bibliography provides a set of references that are specific to the field served by the MOD. Moreover, it serves as a backbone to which all curated biological data can be attributed. Here, we describe the organization and main features of the bibliography in FlyBase (flybase.org), the MOD for Drosophila melanogaster. We present an overview of the current content of the bibliography, the pipeline for identifying and adding new references, the presentation of data within Reference Reports and effective methods for searching and retrieving bibliographic data. We highlight recent improvements in these areas and describe the advantages of using the FlyBase bibliography over alternative literature resources. Although this article is focused on bibliographic data, many of the features and tools described are applicable to browsing and querying other datasets in FlyBase.
Model organisms are widely used for understanding basic biology, and have significantly contributed to the study of human disease. In recent years, genomic analysis has provided extensive evidence of widespread conservation of gene sequence and function amongst eukaryotes, allowing insights from model organisms to help decipher gene function in a wider range of species. The InterMOD consortium is developing an infrastructure based around the InterMine data warehouse system to integrate genomic and functional data from a number of key model organisms, leading the way to improved cross-species research. So far including budding yeast, nematode worm, fruit fly, zebrafish, rat and mouse, the project has set up data warehouses, synchronized data models, and created analysis tools and links between data from different species. The project unites a number of major model organism databases, improving both the consistency and accessibility of comparative research, to the benefit of the wider scientific community.
Motivation: As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models.
Results: We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures.
Availability and Implementation: The Objective Caml source code and executables for GNU/Linux and Mac OS X are freely available at http://compbio.mit.edu/PhyloCSF
Contact: firstname.lastname@example.org; email@example.com
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) provides high-quality curated genomic, genetic, and molecular information on the genes and their products of the budding yeast Saccharomyces cerevisiae. To accommodate the increasingly complex, diverse needs of researchers for searching and comparing data, SGD has implemented InterMine (http://www.InterMine.org), an open source data warehouse system with a sophisticated querying interface, to create YeastMine (http://yeastmine.yeastgenome.org). YeastMine is a multifaceted search and retrieval environment that provides access to diverse data types. Searches can be initiated with a list of genes, a list of Gene Ontology terms, or lists of many other data types. The results from queries can be combined for further analysis and saved or downloaded in customizable file formats. Queries themselves can be customized by modifying predefined templates or by creating a new template to access a combination of specific data types. YeastMine offers multiple scenarios in which it can be used such as a powerful search interface, a discovery tool, a curation aid and also a complex database presentation format.
A major challenge for functional and comparative genomics resource development is the extraction of data from the biomedical literature. Although text mining for biological data is an active research field, few applications have been integrated into production literature curation systems such as those of the model organism databases (MODs). Not only are most available biological natural language (bioNLP) and information retrieval and extraction solutions difficult to adapt to existing MOD curation workflows, but many also have high error rates or are unable to process documents available in those formats preferred by scientific journals.
In September 2008, Mouse Genome Informatics (MGI) at The Jackson Laboratory initiated a search for dictionary-based text mining tools that we could integrate into our biocuration workflow. MGI has rigorous document triage and annotation procedures designed to identify appropriate articles about mouse genetics and genome biology. We currently screen ∼1000 journal articles a month for Gene Ontology terms, gene mapping, gene expression, phenotype data and other key biological information. Although we do not foresee that curation tasks will ever be fully automated, we are eager to implement named entity recognition (NER) tools for gene tagging that can help streamline our curation workflow and simplify gene indexing tasks within the MGI system. Gene indexing is an MGI-specific curation function that involves identifying which mouse genes are being studied in an article, then associating the appropriate gene symbols with the article reference number in the MGI database.
Here, we discuss our search process, performance metrics and success criteria, and how we identified a short list of potential text mining tools for further evaluation. We provide an overview of our pilot projects with NCBO's Open Biomedical Annotator and Fraunhofer SCAI's ProMiner. In doing so, we prove the potential for the further incorporation of semi-automated processes into the curation of the biomedical literature.
Many host-adapted bacterial pathogens contain DNA methyltransferases (mod genes) that are subject to phase-variable expression (high-frequency reversible ON/OFF switching of gene expression). In Haemophilus influenzae, the random switching of the modA gene controls expression of a phase-variable regulon of genes (a “phasevarion”), via differential methylation of the genome in the modA ON and OFF states. Phase-variable mod genes are also present in Neisseria meningitidis and Neisseria gonorrhoeae, suggesting that phasevarions may occur in these important human pathogens. Phylogenetic studies on phase-variable mod genes associated with type III restriction modification (R-M) systems revealed that these organisms have two distinct mod genes—modA and modB. There are also distinct alleles of modA (abundant: modA11, 12, 13; minor: modA4, 15, 18) and modB (modB1, 2). These alleles differ only in their DNA recognition domain. ModA11 was only found in N. meningitidis and modA13 only in N. gonorrhoeae. The recognition site for the modA13 methyltransferase in N. gonorrhoeae strain FA1090 was identified as 5′-AGAAA-3′. Mutant strains lacking the modA11, 12 or 13 genes were made in N. meningitidis and N. gonorrhoeae and their phenotype analyzed in comparison to a corresponding mod ON wild-type strain. Microarray analysis revealed that in all three modA alleles multiple genes were either upregulated or downregulated, some of which were virulence-associated. For example, in N. meningitidis MC58 (modA11), differentially expressed genes included those encoding the candidate vaccine antigens lactoferrin binding proteins A and B. Functional studies using N. gonorrhoeae FA1090 and the clinical isolate O1G1370 confirmed that modA13 ON and OFF strains have distinct phenotypes in antimicrobial resistance, in a primary human cervical epithelial cell model of infection, and in biofilm formation. This study, in conjunction with our previous work in H. influenzae, indicates that phasevarions may be a common strategy used by host-adapted bacterial pathogens to randomly switch between “differentiated” cell types.
The pathogenic Neisseria are bacterial pathogens that cause meningitis and gonorrhoea. They have adapted to life exclusively in humans and have developed unique strategies to colonize the host and to evade the immune response. Central among these strategies are genetic switches that randomly turn genes on and off. In most cases, the genes controlled by these switches, contingency genes, are required for making bacterial surface structures. Recently we described a new class of contingency gene that methylates DNA. Rather than affecting the synthesis of a single surface structure, on/off switching of this DNA-methyltransferase gene leads to random switching of multiple genes. In this study, we have shown that this mechanism exists in all pathogenic Neisseria, and alters expression of multiple genes in all cases we examined. The two distinct populations of bacteria generated by this process had different behavior in model systems of colonization and infection. Understanding this process is key to understanding these human pathogens, and to developing strategies for treatment and prevention of the diseases they cause.