In an effort to comprehensively characterize the functional elements within the genomes of the important model organisms Drosophila melanogaster and Caenorhabditis elegans, the NHGRI model organism Encyclopaedia of DNA Elements (modENCODE) consortium has generated an enormous library of genomic data along with detailed, structured information on all aspects of the experiments. The modMine database (http://intermine.modencode.org) described here has been built by the modENCODE Data Coordination Center to allow the broader research community to (i) search for and download data sets of interest among the thousands generated by modENCODE; (ii) access the data in an integrated form together with non-modENCODE data sets; and (iii) facilitate fine-grained analysis of the above data. The sophisticated search features are possible because of the collection of extensive experimental metadata by the consortium. Interfaces are provided to allow both biologists and bioinformaticians to exploit these rich modENCODE data sets now available via modMine.
Funded by the National Institutes of Health (NIH), the aim of the Model Organism ENCyclopedia of DNA Elements (modENCODE) project is to provide the biological research community with a comprehensive encyclopedia of functional genomic elements for both model organisms C. elegans (worm) and D. melanogaster (fly). With a total size of just under 10 terabytes of data collected and released to the public, one of the challenges faced by researchers is to extract biologically meaningful knowledge from this large data set. While the basic quality control, pre-processing, and analysis of the data has already been performed by members of the modENCODE consortium, many researchers will wish to reinterpret the data set using modifications and enhancements of the original protocols, or combine modENCODE data with other data sets. Unfortunately this can be a time consuming and logistically challenging proposition.
In recognition of this challenge, the modENCODE DCC has released uniform computing resources for analyzing modENCODE data on Galaxy (https://github.com/modENCODE-DCC/Galaxy), on the public Amazon Cloud (http://aws.amazon.com), and on the private Bionimbus Cloud for genomic research (http://www.bionimbus.org). In particular, we have released Galaxy workflows for interpreting ChIP-seq data which use the same quality control (QC) and peak calling standards adopted by the modENCODE and ENCODE communities. For convenience of use, we have created Amazon and Bionimbus Cloud machine images containing Galaxy along with all the modENCODE data, software and other dependencies.
Using these resources provides a framework for running consistent and reproducible analyses on modENCODE data, ultimately allowing researchers to use more of their time using modENCODE data, and less time moving it around.
Motivation: The highly coordinated expression of thousands of genes in an organism is regulated by the concerted action of transcription factors, chromatin proteins and epigenetic mechanisms. High-throughput experimental data for genome wide in vivo protein–DNA interactions and epigenetic marks are becoming available from large projects, such as the model organism ENCyclopedia Of DNA Elements (modENCODE) and from individual labs. Dissemination and visualization of these datasets in an explorable form is an important challenge.
Results: To support research on Drosophila melanogaster transcription regulation and make the genome wide in vivo protein–DNA interactions data available to the scientific community as a whole, we have developed a system called Flynet. Currently, Flynet contains 101 datasets for 38 transcription factors and chromatin regulator proteins in different experimental conditions. These factors exhibit different types of binding profiles ranging from sharp localized peaks to broad binding regions. The protein–DNA interaction data in Flynet was obtained from the analysis of chromatin immunoprecipitation experiments on one color and two color genomic tiling arrays as well as chromatin immunoprecipitation followed by massively parallel sequencing. A web-based interface, integrated with an AJAX based genome browser, has been built for queries and presenting analysis results. Flynet also makes available the cis-regulatory modules reported in literature, known and de novo identified sequence motifs across the genome, and other resources to study gene regulation.
Availability: Flynet is available at https://www.cistrack.org/flynet/.
Supplementary information: Supplementary data are available at Bioinformatics online.
To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications, chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding, RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new functions for genes, reveals stage- and tissue-specific regulators, and enables gene-expression prediction. Our results provide a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation.
The functional repertoire of long intergenic noncoding RNA (lincRNA) molecules has begun to be elucidated in mammals. Determining the biological relevance and potential gene regulatory mechanisms of these enigmatic molecules would be expedited in a more tractable model organism, such as Drosophila melanogaster. To this end, we defined a set of 1,119 putative lincRNA genes in D. melanogaster using modENCODE whole transcriptome (RNA-seq) data. A large majority (1.1 of 1.3 Mb; 85%) of these bases were not previously reported by modENCODE as being transcribed. Significant selective constraint on the sequences of these loci predicts that virtually all have sustained functionality across the Drosophila clade. We observe biases in lincRNA genomic locations and expression profiles that are consistent with some of these lincRNAs being involved in the regulation of neighboring protein-coding genes with developmental functions. We identify lincRNAs that may be important in the developing nervous system and in male-specific organs, such as the testes. LincRNA loci were also identified whose positions, relative to nearby protein-coding loci, are equivalent between D. melanogaster and mouse. This study predicts that the genomes of not only vertebrates, such as mammals, but also an invertebrate (fruit fly) harbor large numbers of lincRNA loci. Our findings now permit exploitation of Drosophila genetics for the investigation of lincRNA mechanisms, including lincRNAs with potential functional analogues in mammals.
long intergenic noncoding RNAs; modENCODE; transcriptional regulation; evolution; development
WormBase (www.wormbase.org) has been serving the scientific community for over 11 years as the central repository for genomic and genetic information for the soil nematode Caenorhabditis elegans. The resource has evolved from its beginnings as a database housing the genomic sequence and genetic and physical maps of a single species, and now represents the breadth and diversity of nematode research, currently serving genome sequence and annotation for around 20 nematodes. In this article, we focus on WormBase’s role of genome sequence annotation, describing how we annotate and integrate data from a growing collection of nematode species and strains. We also review our approaches to sequence curation, and discuss the impact on annotation quality of large functional genomics projects such as modENCODE.
Caenorhabditis elegans; annotation; community resource; genome; model organism database; nematode; parasitic nematode; sequence curation
FlyBase (http://flybase.org) is the leading database and web portal for genetic and genomic information on the fruit fly Drosophila melanogaster and related fly species. Whether you use the fruit fly as an experimental system or want to apply Drosophila biological knowledge to another field of study, FlyBase can help you successfully navigate the wealth of available Drosophila data. Here, we review the FlyBase web site with novice and less-experienced users of FlyBase in mind and point out recent developments stemming from the availability of genome-wide data from the modENCODE project. The first section of this paper explains the organization of the web site and describes the report pages available on FlyBase, focusing on the most popular, the Gene Report. The next section introduces some of the search tools available on FlyBase, in particular, our heavily used and recently redesigned search tool QuickSearch, found on the FlyBase homepage. The final section concerns genomic data, including recent modENCODE (http://www.modencode.org) data, available through our Genome Browser, GBrowse.
Chromatin immunoprecipitation (ChIP) followed by microarray hybridization (ChIP-chip) or high-throughput sequencing (ChIP-seq) allows genome-wide discovery of protein-DNA interactions such as transcription factor bindings and histone modifications. Previous reports only compared a small number of profiles, and little has been done to compare histone modification profiles generated by the two technologies or to assess the impact of input DNA libraries in ChIP-seq analysis. Here, we performed a systematic analysis of a modENCODE dataset consisting of 31 pairs of ChIP-chip/ChIP-seq profiles of the coactivator CBP, RNA polymerase II (RNA PolII), and six histone modifications across four developmental stages of Drosophila melanogaster.
Both technologies produce highly reproducible profiles within each platform, ChIP-seq generally produces profiles with a better signal-to-noise ratio, and allows detection of more peaks and narrower peaks. The set of peaks identified by the two technologies can be significantly different, but the extent to which they differ varies depending on the factor and the analysis algorithm. Importantly, we found that there is a significant variation among multiple sequencing profiles of input DNA libraries and that this variation most likely arises from both differences in experimental condition and sequencing depth. We further show that using an inappropriate input DNA profile can impact the average signal profiles around genomic features and peak calling results, highlighting the importance of having high quality input DNA data for normalization in ChIP-seq analysis.
Our findings highlight the biases present in each of the platforms, show the variability that can arise from both technology and analysis methods, and emphasize the importance of obtaining high quality and deeply sequenced input DNA libraries for ChIP-seq analysis.
Advances in sequencing technology have boosted population genomics and made it possible to map the positions of transcription factor binding sites (TFBSs) with high precision. Here we investigate TFBS variability by combining transcription factor binding maps generated by ENCODE, modENCODE, our previously published data and other sources with genomic variation data for human individuals and Drosophila isogenic lines.
We introduce a metric of TFBS variability that takes into account changes in motif match associated with mutation and makes it possible to investigate TFBS functional constraints instance-by-instance as well as in sets that share common biological properties. We also take advantage of the emerging per-individual transcription factor binding data to show evidence that TFBS mutations, particularly at evolutionarily conserved sites, can be efficiently buffered to ensure coherent levels of transcription factor binding.
Our analyses provide insights into the relationship between individual and interspecies variation and show evidence for the functional buffering of TFBS mutations in both humans and flies. In a broad perspective, these results demonstrate the potential of combining functional genomics and population genetics approaches for understanding gene regulation.
We have implemented aggregation and correlation toolbox (ACT), an efficient, multifaceted toolbox for analyzing continuous signal and discrete region tracks from high-throughput genomic experiments, such as RNA-seq or ChIP-chip signal profiles from the ENCODE and modENCODE projects, or lists of single nucleotide polymorphisms from the 1000 genomes project. It is able to generate aggregate profiles of a given track around a set of specified anchor points, such as transcription start sites. It is also able to correlate related tracks and analyze them for saturation–i.e. how much of a certain feature is covered with each new succeeding experiment. The ACT site contains downloadable code in a variety of formats, interactive web servers (for use on small quantities of data), example datasets, documentation and a gallery of outputs. Here, we explain the components of the toolbox in more detail and apply them in various contexts.
Availability: ACT is available at http://act.gersteinlab.org
In D. melanogaster polytene chromosomes, intercalary heterochromatin (IH) appears as large dense bands scattered in euchromatin and comprises clusters of repressed genes. IH displays distinctly low gene density, indicative of their particular regulation. Genes embedded in IH replicate late in the S phase and become underreplicated. We asked whether localization and organization of these late-replicating domains is conserved in a distinct cell type. Using published comprehensive genome-wide chromatin annotation datasets (modENCODE and others), we compared IH organization in salivary gland cells and in a Kc cell line. We first established the borders of 60 IH regions on a molecular map, these regions containing underreplicated material and encompassing ∼12% of Drosophila genome. We showed that in Kc cells repressed chromatin constituted 97% of the sequences that corresponded to IH bands. This chromatin is depleted for ORC-2 binding and largely replicates late. Differences in replication timing between the cell types analyzed are local and affect only sub-regions but never whole IH bands. As a rule such differentially replicating sub-regions display open chromatin organization, which apparently results from cell-type specific gene expression of underlying genes. We conclude that repressed chromatin organization of IH is generally conserved in polytene and non-polytene cells. Yet, IH domains do not function as transcription- and replication-regulatory units, because differences in transcription and replication between cell types are not domain-wide, rather they are restricted to small “islands” embedded in these domains. IH regions can thus be defined as a special class of domains with low gene density, which have narrow temporal expression patterns, and so displaying relatively conserved organization.
Salivary gland polytene chromosomes demonstrate banding pattern, genetic meaning of which is an enigma for decades. Till now it is not known how to mark the band/interband borders on physical map of DNA and structures of polytene chromosomes are not characterized in molecular and genetic terms. It is not known either similar banding pattern exists in chromosomes of regular diploid mitotically dividing nonpolytene cells. Using the newly developed approach permitting to identify the interband material and localization data of interband-specific proteins from modENCODE and other genome-wide projects, we identify physical limits of bands and interbands in small cytological region 9F13-10B3 of the X chromosome in D. melanogaster, as well as characterize their general molecular features. Our results suggests that the polytene and interphase cell line chromosomes have practically the same patterns of bands and interbands reflecting, probably, the basic principle of interphase chromosome organization. Two types of bands have been described in chromosomes, early and late-replicating, which differ in many aspects of their protein and genetic content. As appeared, origin recognition complexes are located almost totally in the interbands of chromosomes.
Despite many efforts, little is known about distribution and interactions of chromatin proteins which contribute to the specificity of chromomeric organization of interphase chromosomes. To address this issue, we used publicly available datasets from several recent Drosophila genome-wide mapping and annotation projects, in particular, those from modENCODE project, and compared molecular organization of 13 interband regions which were accurately mapped previously.
Here we demonstrate that in interphase chromosomes of Drosophila cell lines, the interband regions are enriched for a specific set of proteins generally characteristic of the "open" chromatin (RNA polymerase II, CHRIZ (CHRO), BEAF-32, BRE1, dMI-2, GAF, NURF301, WDS and TRX). These regions also display reduced nucleosome density, histone H1 depletion and pronounced enrichment for ORC2, a pre-replication complex component. Within the 13 interband regions analyzed, most were around 3-4 kb long, particularly those where many of said protein features were present. We estimate there are about 3500 regions with similar properties in chromosomes of D. melanogaster cell lines, which fits quite well the number of cytologically observed interbands in salivary gland polytene chromosomes.
Our observations suggest strikingly similar organization of interband chromatin in polytene chromosomes and in chromosomes from cell lines thereby reflecting the existence of a universal principle of interphase chromosome organization.
Motivation: As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models.
Results: We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures.
Availability and Implementation: The Objective Caml source code and executables for GNU/Linux and Mac OS X are freely available at http://compbio.mit.edu/PhyloCSF
Contact: firstname.lastname@example.org; email@example.com
Systematic annotation of gene regulatory elements is a major challenge in genome science. Direct mapping of chromatin modification marks and transcriptional factor binding sites genome-wide 1,2 has successfully identified specific subtypes of regulatory elements 3. In Drosophila several pioneering studies have provided genome-wide identification of Polycomb-Response Elements 4, chromatin states 5, transcription factor binding sites (TFBS) 6–9, PolII regulation 8, and insulator elements 10; however, comprehensive annotation of the regulatory genome remains a significant challenge. Here we describe results from the modENCODE cis-regulatory annotation project. We produced a map of the Drosophila melanogaster regulatory genome based on more than 300 chromatin immuno-precipitation (ChIP) datasets for eight chromatin features, five histone deacetylases (HDACs) and thirty-eight site-specific transcription factors (TFs) at different stages of development. Using these data we inferred more than 20,000 candidate regulatory elements and we validated a subset of predictions for promoters, enhancers, and insulators in vivo. We also identified nearly 2,000 genomic regions of dense TF binding associated with chromatin activity and accessibility. We discovered hundreds of new TF co-binding relationships and defined a TF network with over 800 potential regulatory relationships.
Identifying a bicluster, or submatrix of a gene expression dataset wherein the genes express similar behavior over the columns, is useful for discovering novel functional gene interactions. In this article, we introduce a new algorithm for finding biClusters with Linear Patterns (CLiP). Instead of solely maximizing Pearson correlation, we introduce a fitness function that also considers the correlation of complementary genes and conditions. This eliminates the need for a priori determination of the bicluster size. We employ both greedy search and the genetic algorithm in optimization, incorporating resampling for more robust discovery. When applied to both real and simulation datasets, our results show that CLiP is superior to existing methods. In analyzing RNA-seq fly and worm time-course data from modENCODE, we uncover a set of similarly expressed genes suggesting maternal dependence. Supplementary Material is available online (at www.liebertonline.com/cmb).
algorithms; gene clusters; probability
Sequence-specific transcription factors (TFs) are critical for specifying patterns and levels of gene expression, but target DNA elements are not sufficient to specify TF binding in vivo. In eukaryotes, the binding of a TF is in competition with a constellation of other proteins, including histones, which package DNA into nucleosomes. We used the ChIP-seq assay to examine the genome-wide distribution of Drosophila Heat Shock Factor (HSF), a TF whose binding activity is mediated by heat shock-induced trimerization. HSF binds to 464 sites after heat shock, the vast majority of which contain HSF Sequence-binding Elements (HSEs). HSF-bound sequence motifs represent only a small fraction of the total HSEs present in the genome. ModENCODE ChIP-chip datasets, generated during non-heat shock conditions, were used to show that inducibly bound HSE motifs are associated with histone acetylation, H3K4 trimethylation, RNA Polymerase II, and coactivators, compared to HSE motifs that remain HSF-free. Furthermore, directly changing the chromatin landscape, from an inactive to an active state, permits inducible HSF binding. There is a strong correlation of bound HSEs to active chromatin marks present prior to induced HSF binding, indicating that an HSE's residence in “active” chromatin is a primary determinant of whether HSF can bind following heat shock.
Many Transcription Factors (TFs) have been shown to bind DNA in a sequence-specific manner. However, only a sub-set of possible binding sites are occupied in vivo, and it remains unclear how TFs discriminate between sequences of equal predicted binding affinity. We set out to determine how a specific TF, Heat Shock Factor (HSF), distinguishes between utilized and unused potential binding sites. HSF is uniquely qualified to study this problem, because HSF is inactive and lowly bound to DNA in unstressed cells and upon stress HSF becomes active and strongly binds to DNA. We compared the properties of the unstressed chromatin between the sites that become HSF-bound or remain HSF-free following stress activation. We find that sites that are destined to be bound strongly by HSF after stress are associated with distinct chromatin marks compared to sites that are unoccupied by HSF after heat shock. Furthermore, chromatin landscape can be changed from a restrictive to a permissive state, allowing inducible HSF binding. These finding suggest that TF binding sites can be predicted based on the chromatin signatures present prior to induced TF recruitment.
The Drosophila MSL complex mediates dosage compensation by increasing transcription of the single X chromosome in males approximately two-fold. This is accomplished through recognition of the X chromosome and subsequent acetylation of histone H4K16 on X-linked genes. Initial binding to the X is thought to occur at “entry sites” that contain a consensus sequence motif (“MSL recognition element” or MRE). However, this motif is only ∼2 fold enriched on X, and only a fraction of the motifs on X are initially targeted. Here we ask whether chromatin context could distinguish between utilized and non-utilized copies of the motif, by comparing their relative enrichment for histone modifications and chromosomal proteins mapped in the modENCODE project. Through a comparative analysis of the chromatin features in male S2 cells (which contain MSL complex) and female Kc cells (which lack the complex), we find that the presence of active chromatin modifications, together with an elevated local GC content in the surrounding sequences, has strong predictive value for functional MSL entry sites, independent of MSL binding. We tested these sites for function in Kc cells by RNAi knockdown of Sxl, resulting in induction of MSL complex. We show that ectopic MSL expression in Kc cells leads to H4K16 acetylation around these sites and a relative increase in X chromosome transcription. Collectively, our results support a model in which a pre-existing active chromatin environment, coincident with H3K36me3, contributes to MSL entry site selection. The consequences of MSL targeting of the male X chromosome include increase in nucleosome lability, enrichment for H4K16 acetylation and JIL-1 kinase, and depletion of linker histone H1 on active X-linked genes. Our analysis can serve as a model for identifying chromatin and local sequence features that may contribute to selection of functional protein binding sites in the genome.
The genomes of complex organisms encompass hundreds of millions of base pairs of DNA, and regulatory molecules must distinguish specific targets within this vast landscape. In general, regulatory factors find target genes through sequence-specific interactions with the underlying DNA. However, sequence-specific factors typically bind only a fraction of the candidate genomic regions containing their specific target sequence motif. Here we identify potential roles for chromatin environment and flanking sequence composition in helping regulatory factors find their appropriate binding sites, using targeting of the Drosophila dosage compensation complex as a model. The initial stage of dosage compensation involves binding of the Male Specific Lethal (MSL) complex to a sequence motif called the MSL recognition element . Using data from a large chromatin mapping effort (the modENCODE project), we successfully identify an active chromatin environment as predictive of selective MRE binding by the MSL complex. Our study provides a framework for using genome-wide datasets to analyze and predict functional protein–DNA binding site selection.
The ENCODE project is an international consortium with a goal of cataloguing all the functional elements in the human genome. The ENCODE Data Coordination Center (DCC) at the University of California, Santa Cruz serves as the central repository for ENCODE data. In this role, the DCC offers a collection of high-throughput, genome-wide data generated with technologies such as ChIP-Seq, RNA-Seq, DNA digestion and others. This data helps illuminate transcription factor-binding sites, histone marks, chromatin accessibility, DNA methylation, RNA expression, RNA binding and other cell-state indicators. It includes sequences with quality scores, alignments, signals calculated from the alignments, and in most cases, element or peak calls calculated from the signal data. Each data set is available for visualization and download via the UCSC Genome Browser (http://genome.ucsc.edu/). ENCODE data can also be retrieved using a metadata system that captures the experimental parameters of each assay. The ENCODE web portal at UCSC (http://encodeproject.org/) provides information about the ENCODE data and links for access.
Curation of biological data is a multi-faceted task whose goal is to create a structured, comprehensive, integrated, and accurate resource of current biological knowledge. These structured data facilitate the work of the scientific community by providing knowledge about genes or genomes and by generating validated connections between the data that yield new information and stimulate new research approaches. For the model organism databases (MODs), an important source of data is research publications. Every published paper containing experimental information about a particular model organism is a candidate for curation. All such papers are examined carefully by curators for relevant information. Here, four curators from different MODs describe the literature curation process and highlight approaches taken by the four MODs to address: (1) the decision process by which papers are selected, and (2) the identification and prioritization of the data contained in the paper. We will highlight some of the challenges that MOD biocurators face, and point to ways in which researchers and publishers can support the work of biocurators and the value of such support.
Annotation; Biocuration; Database; Genome; Literature; Model organism
Hyperferritinemia is associated with increased mortality in pediatric sepsis, multiple organ dysfunction syndrome (MODS), and critical illness. The International Histiocyte Society has recommended that children with hyperferritinemia and secondary hemophagocytic lymphohistiocytosis (HLH) or macrophage activation syndrome (MAS) should be treated with the same immunosuppressant/cytotoxic therapies used to treat primary HLH. We hypothesized that patients with hyperferritinemia associated secondary HLH/sepsis/MODS/MAS can be successfully treated with a less immunosuppressant approach than is recommended for primary HLH.
We conducted a multi-center cohort study of children in Turkish Pediatric Intensive Care units with hyperferritinemia associated secondary HLH/sepsis/MODS/MAS treated with less immunosuppression (plasma exchange and intravenous immunoglobulin or methyl prednisolone) or with the primary HLH protocol (plasma exchange and dexamethasone or cyclosporine A and/or etoposide). The primary outcome assessed was hospital survival.
Twenty-three children with hyperferritinemia and secondary HLH/sepsis/MODS/MAS were enrolled (median ferritin = 6341 μg/dL, median number of organ failures = 5). Univariate and multivariate analyses demonstrated that use of plasma exchange and methyl prednisolone or intravenous immunoglobulin (n = 17, survival 100%) was associated with improved survival compared to plasma exchange and dexamethasone and/or cyclosporine and/or etoposide (n = 6, survival 50%) (P = 0.002).
Children with hyperferritinemia and secondary HLH/sepsis/MODS/MAS can be successfully treated with plasma exchange, intravenous immunoglobulin, and methylprednisone. Randomized trials are required to evaluate if the HLH-94 protocol is helpful or harmful compared to this less immune suppressive and cytotoxic approach in this specific population.
We develop a statistical framework to study the relationship between chromatin features and gene expression. This can be used to predict gene expression of protein coding genes, as well as microRNAs. We demonstrate the prediction in a variety of contexts, focusing particularly on the modENCODE worm datasets. Moreover, our framework reveals the positional contribution around genes (upstream or downstream) of distinct chromatin features to the overall prediction of expression levels.
Many host-adapted bacterial pathogens contain DNA methyltransferases (mod genes) that are subject to phase-variable expression (high-frequency reversible ON/OFF switching of gene expression). In Haemophilus influenzae, the random switching of the modA gene controls expression of a phase-variable regulon of genes (a “phasevarion”), via differential methylation of the genome in the modA ON and OFF states. Phase-variable mod genes are also present in Neisseria meningitidis and Neisseria gonorrhoeae, suggesting that phasevarions may occur in these important human pathogens. Phylogenetic studies on phase-variable mod genes associated with type III restriction modification (R-M) systems revealed that these organisms have two distinct mod genes—modA and modB. There are also distinct alleles of modA (abundant: modA11, 12, 13; minor: modA4, 15, 18) and modB (modB1, 2). These alleles differ only in their DNA recognition domain. ModA11 was only found in N. meningitidis and modA13 only in N. gonorrhoeae. The recognition site for the modA13 methyltransferase in N. gonorrhoeae strain FA1090 was identified as 5′-AGAAA-3′. Mutant strains lacking the modA11, 12 or 13 genes were made in N. meningitidis and N. gonorrhoeae and their phenotype analyzed in comparison to a corresponding mod ON wild-type strain. Microarray analysis revealed that in all three modA alleles multiple genes were either upregulated or downregulated, some of which were virulence-associated. For example, in N. meningitidis MC58 (modA11), differentially expressed genes included those encoding the candidate vaccine antigens lactoferrin binding proteins A and B. Functional studies using N. gonorrhoeae FA1090 and the clinical isolate O1G1370 confirmed that modA13 ON and OFF strains have distinct phenotypes in antimicrobial resistance, in a primary human cervical epithelial cell model of infection, and in biofilm formation. This study, in conjunction with our previous work in H. influenzae, indicates that phasevarions may be a common strategy used by host-adapted bacterial pathogens to randomly switch between “differentiated” cell types.
The pathogenic Neisseria are bacterial pathogens that cause meningitis and gonorrhoea. They have adapted to life exclusively in humans and have developed unique strategies to colonize the host and to evade the immune response. Central among these strategies are genetic switches that randomly turn genes on and off. In most cases, the genes controlled by these switches, contingency genes, are required for making bacterial surface structures. Recently we described a new class of contingency gene that methylates DNA. Rather than affecting the synthesis of a single surface structure, on/off switching of this DNA-methyltransferase gene leads to random switching of multiple genes. In this study, we have shown that this mechanism exists in all pathogenic Neisseria, and alters expression of multiple genes in all cases we examined. The two distinct populations of bacteria generated by this process had different behavior in model systems of colonization and infection. Understanding this process is key to understanding these human pathogens, and to developing strategies for treatment and prevention of the diseases they cause.
The Encyclopedia of DNA Elements (ENCODE) project is an international consortium of investigators funded to analyze the human genome with the goal of producing a comprehensive catalog of functional elements. The ENCODE Data Coordination Center at The University of California, Santa Cruz (UCSC) is the primary repository for experimental results generated by ENCODE investigators. These results are captured in the UCSC Genome Bioinformatics database and download server for visualization and data mining via the UCSC Genome Browser and companion tools (Rhead et al. The UCSC Genome Browser Database: update 2010, in this issue). The ENCODE web portal at UCSC (http://encodeproject.org or http://genome.ucsc.edu/ENCODE) provides information about the ENCODE data and convenient links for access.
DNA sequence analysis of the modABCD operon of Escherichia coli revealed the presence of four open reading frames. The first gene, modA, codes for a 257-amino-acid periplasmic binding protein enunciated by the presence of a signal peptide-like sequence. The second gene (modB) encodes a 229-amino-acid protein with a potential membrane location, while the 352-amino-acid ModC protein (modC product) contains a nucleotide-binding motif. On the basis of sequence similarities with proteins from other transport systems and molybdate transport proteins from other organisms, these three proteins are proposed to constitute the molybdate transport system. The fourth open reading frame (modD) encodes a 231-amino-acid protein of unknown function. Plasmids containing different mod genes were used to map several molybdate-suppressible chlorate-resistant mutants; interestingly, none of the 40 mutants tested had a mutation in the modD gene. About 35% of these chlorate-resistant mutants were not complemented by mod operon DNA. These mutants, designated mol, contained mutations at unknown chromosomal location(s) and produced formate hydrogenlyase activity only when cultured in molybdate-supplemented glucose-minimal medium, not in L broth. This group of mol mutants constitutes a new class of molybdate utilization mutants distinct from other known mutants in molybdate metabolism. These results show that molybdate, after transport into cells by the ModABC proteins, is metabolized (activated?) by the products of the mol gene(s).