In an effort to comprehensively characterize the functional elements within the genomes of the important model organisms Drosophila melanogaster and Caenorhabditis elegans, the NHGRI model organism Encyclopaedia of DNA Elements (modENCODE) consortium has generated an enormous library of genomic data along with detailed, structured information on all aspects of the experiments. The modMine database (http://intermine.modencode.org) described here has been built by the modENCODE Data Coordination Center to allow the broader research community to (i) search for and download data sets of interest among the thousands generated by modENCODE; (ii) access the data in an integrated form together with non-modENCODE data sets; and (iii) facilitate fine-grained analysis of the above data. The sophisticated search features are possible because of the collection of extensive experimental metadata by the consortium. Interfaces are provided to allow both biologists and bioinformaticians to exploit these rich modENCODE data sets now available via modMine.
To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications, chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding, RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new functions for genes, reveals stage- and tissue-specific regulators, and enables gene-expression prediction. Our results provide a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation.
The model organism Encyclopedia of DNA Elements (modENCODE) project is a National Human Genome Research Institute (NHGRI) initiative designed to characterize the genomes of Drosophila melanogaster and Caenorhabditis elegans. A Data Coordination Center (DCC) was created to collect, store and catalog modENCODE data. An effective DCC must gather, organize and provide all primary, interpreted and analyzed data, and ensure the community is supplied with the knowledge of the experimental conditions, protocols and verification checks used to generate each primary data set. We present here the design principles of the modENCODE DCC, and describe the ramifications of collecting thorough and deep metadata for describing experiments, including the use of a wiki for capturing protocol and reagent information, and the BIR-TAB specification for linking biological samples to experimental results. modENCODE data can be found at http://www.modencode.org.
Database URL: http://www.modencode.org.
WormBase (www.wormbase.org) has been serving the scientific community for over 11 years as the central repository for genomic and genetic information for the soil nematode Caenorhabditis elegans. The resource has evolved from its beginnings as a database housing the genomic sequence and genetic and physical maps of a single species, and now represents the breadth and diversity of nematode research, currently serving genome sequence and annotation for around 20 nematodes. In this article, we focus on WormBase’s role of genome sequence annotation, describing how we annotate and integrate data from a growing collection of nematode species and strains. We also review our approaches to sequence curation, and discuss the impact on annotation quality of large functional genomics projects such as modENCODE.
Caenorhabditis elegans; annotation; community resource; genome; model organism database; nematode; parasitic nematode; sequence curation
The functional repertoire of long intergenic noncoding RNA (lincRNA) molecules has begun to be elucidated in mammals. Determining the biological relevance and potential gene regulatory mechanisms of these enigmatic molecules would be expedited in a more tractable model organism, such as Drosophila melanogaster. To this end, we defined a set of 1,119 putative lincRNA genes in D. melanogaster using modENCODE whole transcriptome (RNA-seq) data. A large majority (1.1 of 1.3 Mb; 85%) of these bases were not previously reported by modENCODE as being transcribed. Significant selective constraint on the sequences of these loci predicts that virtually all have sustained functionality across the Drosophila clade. We observe biases in lincRNA genomic locations and expression profiles that are consistent with some of these lincRNAs being involved in the regulation of neighboring protein-coding genes with developmental functions. We identify lincRNAs that may be important in the developing nervous system and in male-specific organs, such as the testes. LincRNA loci were also identified whose positions, relative to nearby protein-coding loci, are equivalent between D. melanogaster and mouse. This study predicts that the genomes of not only vertebrates, such as mammals, but also an invertebrate (fruit fly) harbor large numbers of lincRNA loci. Our findings now permit exploitation of Drosophila genetics for the investigation of lincRNA mechanisms, including lincRNAs with potential functional analogues in mammals.
long intergenic noncoding RNAs; modENCODE; transcriptional regulation; evolution; development
Motivation: The highly coordinated expression of thousands of genes in an organism is regulated by the concerted action of transcription factors, chromatin proteins and epigenetic mechanisms. High-throughput experimental data for genome wide in vivo protein–DNA interactions and epigenetic marks are becoming available from large projects, such as the model organism ENCyclopedia Of DNA Elements (modENCODE) and from individual labs. Dissemination and visualization of these datasets in an explorable form is an important challenge.
Results: To support research on Drosophila melanogaster transcription regulation and make the genome wide in vivo protein–DNA interactions data available to the scientific community as a whole, we have developed a system called Flynet. Currently, Flynet contains 101 datasets for 38 transcription factors and chromatin regulator proteins in different experimental conditions. These factors exhibit different types of binding profiles ranging from sharp localized peaks to broad binding regions. The protein–DNA interaction data in Flynet was obtained from the analysis of chromatin immunoprecipitation experiments on one color and two color genomic tiling arrays as well as chromatin immunoprecipitation followed by massively parallel sequencing. A web-based interface, integrated with an AJAX based genome browser, has been built for queries and presenting analysis results. Flynet also makes available the cis-regulatory modules reported in literature, known and de novo identified sequence motifs across the genome, and other resources to study gene regulation.
Availability: Flynet is available at https://www.cistrack.org/flynet/.
Supplementary information: Supplementary data are available at Bioinformatics online.
Funded by the National Institutes of Health (NIH), the aim of the Model Organism ENCyclopedia of DNA Elements (modENCODE) project is to provide the biological research community with a comprehensive encyclopedia of functional genomic elements for both model organisms C. elegans (worm) and D. melanogaster (fly). With a total size of just under 10 terabytes of data collected and released to the public, one of the challenges faced by researchers is to extract biologically meaningful knowledge from this large data set. While the basic quality control, pre-processing, and analysis of the data has already been performed by members of the modENCODE consortium, many researchers will wish to reinterpret the data set using modifications and enhancements of the original protocols, or combine modENCODE data with other data sets. Unfortunately this can be a time consuming and logistically challenging proposition.
In recognition of this challenge, the modENCODE DCC has released uniform computing resources for analyzing modENCODE data on Galaxy (https://github.com/modENCODE-DCC/Galaxy), on the public Amazon Cloud (http://aws.amazon.com), and on the private Bionimbus Cloud for genomic research (http://www.bionimbus.org). In particular, we have released Galaxy workflows for interpreting ChIP-seq data which use the same quality control (QC) and peak calling standards adopted by the modENCODE and ENCODE communities. For convenience of use, we have created Amazon and Bionimbus Cloud machine images containing Galaxy along with all the modENCODE data, software and other dependencies.
Using these resources provides a framework for running consistent and reproducible analyses on modENCODE data, ultimately allowing researchers to use more of their time using modENCODE data, and less time moving it around.
Despite many efforts, little is known about distribution and interactions of chromatin proteins which contribute to the specificity of chromomeric organization of interphase chromosomes. To address this issue, we used publicly available datasets from several recent Drosophila genome-wide mapping and annotation projects, in particular, those from modENCODE project, and compared molecular organization of 13 interband regions which were accurately mapped previously.
Here we demonstrate that in interphase chromosomes of Drosophila cell lines, the interband regions are enriched for a specific set of proteins generally characteristic of the "open" chromatin (RNA polymerase II, CHRIZ (CHRO), BEAF-32, BRE1, dMI-2, GAF, NURF301, WDS and TRX). These regions also display reduced nucleosome density, histone H1 depletion and pronounced enrichment for ORC2, a pre-replication complex component. Within the 13 interband regions analyzed, most were around 3-4 kb long, particularly those where many of said protein features were present. We estimate there are about 3500 regions with similar properties in chromosomes of D. melanogaster cell lines, which fits quite well the number of cytologically observed interbands in salivary gland polytene chromosomes.
Our observations suggest strikingly similar organization of interband chromatin in polytene chromosomes and in chromosomes from cell lines thereby reflecting the existence of a universal principle of interphase chromosome organization.
Systematic annotation of gene regulatory elements is a major challenge in genome science. Direct mapping of chromatin modification marks and transcriptional factor binding sites genome-wide 1,2 has successfully identified specific subtypes of regulatory elements 3. In Drosophila several pioneering studies have provided genome-wide identification of Polycomb-Response Elements 4, chromatin states 5, transcription factor binding sites (TFBS) 6–9, PolII regulation 8, and insulator elements 10; however, comprehensive annotation of the regulatory genome remains a significant challenge. Here we describe results from the modENCODE cis-regulatory annotation project. We produced a map of the Drosophila melanogaster regulatory genome based on more than 300 chromatin immuno-precipitation (ChIP) datasets for eight chromatin features, five histone deacetylases (HDACs) and thirty-eight site-specific transcription factors (TFs) at different stages of development. Using these data we inferred more than 20,000 candidate regulatory elements and we validated a subset of predictions for promoters, enhancers, and insulators in vivo. We also identified nearly 2,000 genomic regions of dense TF binding associated with chromatin activity and accessibility. We discovered hundreds of new TF co-binding relationships and defined a TF network with over 800 potential regulatory relationships.
In D. melanogaster polytene chromosomes, intercalary heterochromatin (IH) appears as large dense bands scattered in euchromatin and comprises clusters of repressed genes. IH displays distinctly low gene density, indicative of their particular regulation. Genes embedded in IH replicate late in the S phase and become underreplicated. We asked whether localization and organization of these late-replicating domains is conserved in a distinct cell type. Using published comprehensive genome-wide chromatin annotation datasets (modENCODE and others), we compared IH organization in salivary gland cells and in a Kc cell line. We first established the borders of 60 IH regions on a molecular map, these regions containing underreplicated material and encompassing ∼12% of Drosophila genome. We showed that in Kc cells repressed chromatin constituted 97% of the sequences that corresponded to IH bands. This chromatin is depleted for ORC-2 binding and largely replicates late. Differences in replication timing between the cell types analyzed are local and affect only sub-regions but never whole IH bands. As a rule such differentially replicating sub-regions display open chromatin organization, which apparently results from cell-type specific gene expression of underlying genes. We conclude that repressed chromatin organization of IH is generally conserved in polytene and non-polytene cells. Yet, IH domains do not function as transcription- and replication-regulatory units, because differences in transcription and replication between cell types are not domain-wide, rather they are restricted to small “islands” embedded in these domains. IH regions can thus be defined as a special class of domains with low gene density, which have narrow temporal expression patterns, and so displaying relatively conserved organization.
To assess whether the pattern of high rates of genome rearrangement, with a bias towards within-chromosome events is true of nematodes in general, genome sequence was used to compare the model Caenorhabditis elegans and the filarial parasite Brugia malayi. It is suggested that intrachromosomal rearrangement is a major force driving chromosomal organization in nematodes.
Comparisons between the genomes of the closely related nematodes Caenorhabditis elegans and Caenorhabditis briggsae reveal high rates of rearrangement, with a bias towards within-chromosome events. To assess whether this pattern is true of nematodes in general, we have used genome sequence to compare two nematode species that last shared a common ancestor approximately 300 million years ago: the model C. elegans and the filarial parasite Brugia malayi.
An 83 kb region flanking the gene for Bm-mif-1 (macrophage migration inhibitory factor, a B. malayi homolog of a human cytokine) was sequenced. When compared to the complete genome of C. elegans, evidence for conservation of long-range synteny and microsynteny was found. Potential C. elegans orthologs for II of the 12 protein-coding genes predicted in the B. malayi sequence were identified. Ten of these orthologs were located on chromosome I, with eight clustered in a 2.3 Mb region. While several, relatively local, intrachromosomal rearrangements have occurred, the order, composition, and configuration of two gene clusters, each containing three genes, was conserved. Comparison of B. malayi BAC-end genome survey sequence to C. elegans also revealed a bias towards intrachromosome rearrangements.
We suggest that intrachromosomal rearrangement is a major force driving chromosomal organization in nematodes, but is constrained by the interdigitation of functional elements of neighboring genes.
MicroRNAs (miRNAs) are non-coding RNAs with important roles in regulating gene expression. Recent studies indicate that transcription and cleavage of miRNA are coupled, and that chromatin structure may influence miRNA transcription. However, little is known about the relationship between the chromatin structure and cleavage of pre-miRNA from pri-miRNA.
By analysis of genome-wide nucleosome positioning data sets from human and Caenorhabditis elegans (C. elegans), we found an enrichment of positioned nucleosome on pre-miRNA genomic sequences, which is highly correlated with GC content within pre-miRNA. In addition, obvious enrichments of three histone modifications (H2BK5me1, H3K36me3 and H4K20me1) as well as RNA Polymerase II (RNAPII) were observed on pre-miRNA genomic sequences corresponding to the active-promoter miRNAs and expressed miRNAs.
Our results revealed the chromatin structure characteristics of pre-miRNA genomic sequences, and implied potential mechanisms that can recognize these characteristics, thus improving pre-miRNA cleavage.
The zinc finger (ZF) protein CTCF (CCCTC-binding factor) is highly conserved in Drosophila and vertebrates where it has been shown to mediate chromatin insulation at a genomewide level. A mode of genetic regulation that involves insulators and insulator binding proteins to establish independent transcriptional units is currently not known in nematodes including Caenorhabditis elegans. We therefore searched in nematodes for orthologs of proteins that are involved in chromatin insulation.
While orthologs for other insulator proteins were absent in all 35 analysed nematode species, we find orthologs of CTCF in a subset of nematodes. As an example for these we cloned the Trichinella spiralis CTCF-like gene and revealed a genomic structure very similar to the Drosophila counterpart. To investigate the pattern of CTCF occurrence in nematodes, we performed phylogenetic analysis with the ZF protein sets of completely sequenced nematodes. We show that three ZF proteins from three basal nematodes cluster together with known CTCF proteins whereas no zinc finger protein of C. elegans and other derived nematodes does so.
Our findings show that CTCF and possibly chromatin insulation are present in basal nematodes. We suggest that the insulator protein CTCF has been secondarily lost in derived nematodes like C. elegans. We propose a switch in the regulation of gene expression during nematode evolution, from the common vertebrate and insect type involving distantly acting regulatory elements and chromatin insulation to a so far poorly characterised mode present in more derived nematodes. Here, all or some of these components are missing. Instead operons, polycistronic transcriptional units common in derived nematodes, seemingly adopted their function.
For more than ten years the nematode Caenorhabditis elegans has proven to be a valuable model for studies of the host response to various bacterial and fungal pathogens. When exposed to a pathogenic organism, a clear response is elicited in the nematode, which is characterized by specific alterations on the transcriptional and translational levels. Early on, researchers took advantage of the possibility to conduct large-scale investigations of the C. elegans immune response. Multiple studies demonstrated that C. elegans does indeed mount a protective response against invading pathogens, thus rendering this small nematode a very useful and simple host model for the study of innate immunity and host-pathogen interactions. Here, we provide an overview of key aspects of innate immunity in C. elegans revealed by recent whole-genome transcriptomics and proteomics studies of the global response of C. elegans to various bacterial and fungal pathogens.
innate immunity; C. elegans; transcriptomics; proteomics; model host
WormBase (http://www.wormbase.org/) is the central data repository for information about Caenorhabditis elegans and related nematodes. As a model organism database, WormBase extends beyond the genomic sequence, integrating experimental results with extensively annotated views of the genome. The WormBase Consortium continues to expand the biological scope and utility of WormBase with the inclusion of large-scale genomic analyses, through active data and literature curation, through new analysis and visualization tools, and through refinement of the user interface. Over the past year, the nearly complete genomic sequence and comparative analyses of the closely related species Caenorhabditis briggsae have been integrated into WormBase, including gene predictions, ortholog assignments and a new synteny viewer to display the relationships between the two species. Extensive site-wide refinement of the user interface now provides quick access to the most frequently accessed resources and a consistent browsing experience across the site. Unified single-page views now provide complete summaries of commonly accessed entries like genes. These advances continue to increase the utility of WormBase for C.elegans researchers, as well as for those researchers exploring problems in functional and comparative genomics in the context of a powerful genetic system.
The genomes of numerous parasitic nematodes are currently being sequenced, but their complexity and size, together with high levels of intra-specific sequence variation and a lack of reference genomes, makes their assembly and annotation a challenging task. Haemonchus contortus is an economically significant parasite of livestock that is widely used for basic research as well as for vaccine development and drug discovery. It is one of many medically and economically important parasites within the strongylid nematode group. This group of parasites has the closest phylogenetic relationship with the model organism Caenorhabditis elegans, making comparative analysis a potentially powerful tool for genome annotation and functional studies. To investigate this hypothesis, we sequenced two contiguous fragments from the H. contortus genome and undertook detailed annotation and comparative analysis with C. elegans. The adult H. contortus transcriptome was sequenced using an Illumina platform and RNA-seq was used to annotate a 409 kb overlapping BAC tiling path relating to the X chromosome and a 181 kb BAC insert relating to chromosome I. In total, 40 genes and 12 putative transposable elements were identified. 97.5% of the annotated genes had detectable homologues in C. elegans of which 60% had putative orthologues, significantly higher than previous analyses based on EST analysis. Gene density appears to be less in H. contortus than in C. elegans, with annotated H. contortus genes being an average of two-to-three times larger than their putative C. elegans orthologues due to a greater intron number and size. Synteny appears high but gene order is generally poorly conserved, although areas of conserved microsynteny are apparent. C. elegans operons appear to be partially conserved in H. contortus. Our findings suggest that a combination of RNA-seq and comparative analysis with C. elegans is a powerful approach for the annotation and analysis of strongylid nematode genomes.
While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets across 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase.
The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with unusually many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs posed the greatest difficulty for gene-finders.
This experiment establishes a baseline of gene prediction accuracy in Caenorhabditis genomes, and has guided the choice of gene-finders for the annotation of newly sequenced genomes of Caenorhabditis and other nematode species. We have created new gene sets for C. briggsae, C. remanei, C. brenneri, C. japonica, and Brugia malayi using some of the best-performing gene-finders.
RNA interference (RNAi) has emerged as a powerful tool to generate loss-of-function phenotypes in a variety of organisms. Combined with the sequence information of almost completely annotated genomes, RNAi technologies have opened new avenues to conduct systematic genetic screens for every annotated gene in the genome. As increasing large datasets of RNAi-induced phenotypes become available, an important challenge remains the systematic integration and annotation of functional information. Genome-wide RNAi screens have been performed both in Caenorhabditis elegans and Drosophila for a variety of phenotypes and several RNAi libraries have become available to assess phenotypes for almost every gene in the genome. These screens were performed using different types of assays from visible phenotypes to focused transcriptional readouts and provide a rich data source for functional annotation across different species. The GenomeRNAi database provides access to published RNAi phenotypes obtained from cell-based screens and maps them to their genomic locus, including possible non-specific regions. The database also gives access to sequence information of RNAi probes used in various screens. It can be searched by phenotype, by gene, by RNAi probe or by sequence and is accessible at
WormBase (), a model organism database for Caenorhabditis elegans and other related nematodes, continues to evolve and expand. Over the past year WormBase has added new data on C.elegans, including data on classical genetics, cell biology and functional genomics; expanded the annotation of closely related nematodes with a new genome browser for Caenorhabditis remanei; and deployed new hardware for stronger performance. Several existing datasets including phenotype descriptions and RNAi experiments have seen a large increase in new content. New datasets such as the C.remanei draft assembly and annotations, the Vancouver Fosmid library and TEC-RED 5′ end sites are now available as well. Access to and searching WormBase has become more dependable and flexible via multiple mirror sites and indexing through Google.
The Drosophila MSL complex mediates dosage compensation by increasing transcription of the single X chromosome in males approximately two-fold. This is accomplished through recognition of the X chromosome and subsequent acetylation of histone H4K16 on X-linked genes. Initial binding to the X is thought to occur at “entry sites” that contain a consensus sequence motif (“MSL recognition element” or MRE). However, this motif is only ∼2 fold enriched on X, and only a fraction of the motifs on X are initially targeted. Here we ask whether chromatin context could distinguish between utilized and non-utilized copies of the motif, by comparing their relative enrichment for histone modifications and chromosomal proteins mapped in the modENCODE project. Through a comparative analysis of the chromatin features in male S2 cells (which contain MSL complex) and female Kc cells (which lack the complex), we find that the presence of active chromatin modifications, together with an elevated local GC content in the surrounding sequences, has strong predictive value for functional MSL entry sites, independent of MSL binding. We tested these sites for function in Kc cells by RNAi knockdown of Sxl, resulting in induction of MSL complex. We show that ectopic MSL expression in Kc cells leads to H4K16 acetylation around these sites and a relative increase in X chromosome transcription. Collectively, our results support a model in which a pre-existing active chromatin environment, coincident with H3K36me3, contributes to MSL entry site selection. The consequences of MSL targeting of the male X chromosome include increase in nucleosome lability, enrichment for H4K16 acetylation and JIL-1 kinase, and depletion of linker histone H1 on active X-linked genes. Our analysis can serve as a model for identifying chromatin and local sequence features that may contribute to selection of functional protein binding sites in the genome.
The genomes of complex organisms encompass hundreds of millions of base pairs of DNA, and regulatory molecules must distinguish specific targets within this vast landscape. In general, regulatory factors find target genes through sequence-specific interactions with the underlying DNA. However, sequence-specific factors typically bind only a fraction of the candidate genomic regions containing their specific target sequence motif. Here we identify potential roles for chromatin environment and flanking sequence composition in helping regulatory factors find their appropriate binding sites, using targeting of the Drosophila dosage compensation complex as a model. The initial stage of dosage compensation involves binding of the Male Specific Lethal (MSL) complex to a sequence motif called the MSL recognition element . Using data from a large chromatin mapping effort (the modENCODE project), we successfully identify an active chromatin environment as predictive of selective MRE binding by the MSL complex. Our study provides a framework for using genome-wide datasets to analyze and predict functional protein–DNA binding site selection.
Chromatin immunoprecipitation (ChIP) followed by microarray hybridization (ChIP-chip) or high-throughput sequencing (ChIP-seq) allows genome-wide discovery of protein-DNA interactions such as transcription factor bindings and histone modifications. Previous reports only compared a small number of profiles, and little has been done to compare histone modification profiles generated by the two technologies or to assess the impact of input DNA libraries in ChIP-seq analysis. Here, we performed a systematic analysis of a modENCODE dataset consisting of 31 pairs of ChIP-chip/ChIP-seq profiles of the coactivator CBP, RNA polymerase II (RNA PolII), and six histone modifications across four developmental stages of Drosophila melanogaster.
Both technologies produce highly reproducible profiles within each platform, ChIP-seq generally produces profiles with a better signal-to-noise ratio, and allows detection of more peaks and narrower peaks. The set of peaks identified by the two technologies can be significantly different, but the extent to which they differ varies depending on the factor and the analysis algorithm. Importantly, we found that there is a significant variation among multiple sequencing profiles of input DNA libraries and that this variation most likely arises from both differences in experimental condition and sequencing depth. We further show that using an inappropriate input DNA profile can impact the average signal profiles around genomic features and peak calling results, highlighting the importance of having high quality input DNA data for normalization in ChIP-seq analysis.
Our findings highlight the biases present in each of the platforms, show the variability that can arise from both technology and analysis methods, and emphasize the importance of obtaining high quality and deeply sequenced input DNA libraries for ChIP-seq analysis.
The soil nematodes Caenorhabditis briggsae and Caenorhabditis elegans diverged from a common ancestor roughly 100 million years ago and yet are almost indistinguishable by eye. They have the same chromosome number and genome sizes, and they occupy the same ecological niche. To explore the basis for this striking conservation of structure and function, we have sequenced the C. briggsae genome to a high-quality draft stage and compared it to the finished C. elegans sequence. We predict approximately 19,500 protein-coding genes in the C. briggsae genome, roughly the same as in C. elegans. Of these, 12,200 have clear C. elegans orthologs, a further 6,500 have one or more clearly detectable C. elegans homologs, and approximately 800 C. briggsae genes have no detectable matches in C. elegans. Almost all of the noncoding RNAs (ncRNAs) known are shared between the two species. The two genomes exhibit extensive colinearity, and the rate of divergence appears to be higher in the chromosomal arms than in the centers. Operons, a distinctive feature of C. elegans, are highly conserved in C. briggsae, with the arrangement of genes being preserved in 96% of cases. The difference in size between the C. briggsae (estimated at approximately 104 Mbp) and C. elegans (100.3 Mbp) genomes is almost entirely due to repetitive sequence, which accounts for 22.4% of the C. briggsae genome in contrast to 16.5% of the C. elegans genome. Few, if any, repeat families are shared, suggesting that most were acquired after the two species diverged or are undergoing rapid evolution. Coclustering the C. elegans and C. briggsae proteins reveals 2,169 protein families of two or more members. Most of these are shared between the two species, but some appear to be expanding or contracting, and there seem to be as many as several hundred novel C. briggsae gene families. The C. briggsae draft sequence will greatly improve the annotation of the C. elegans genome. Based on similarity to C. briggsae, we found strong evidence for 1,300 new C. elegans genes. In addition, comparisons of the two genomes will help to understand the evolutionary forces that mold nematode genomes.
With the Caenorhabditis briggsae genome now in hand, C. elegans biologists have a powerful new research tool to refine their knowledge of gene function in C. elegans and to study the path of genome evolution
Advances in sequencing technology have boosted population genomics and made it possible to map the positions of transcription factor binding sites (TFBSs) with high precision. Here we investigate TFBS variability by combining transcription factor binding maps generated by ENCODE, modENCODE, our previously published data and other sources with genomic variation data for human individuals and Drosophila isogenic lines.
We introduce a metric of TFBS variability that takes into account changes in motif match associated with mutation and makes it possible to investigate TFBS functional constraints instance-by-instance as well as in sets that share common biological properties. We also take advantage of the emerging per-individual transcription factor binding data to show evidence that TFBS mutations, particularly at evolutionarily conserved sites, can be efficiently buffered to ensure coherent levels of transcription factor binding.
Our analyses provide insights into the relationship between individual and interspecies variation and show evidence for the functional buffering of TFBS mutations in both humans and flies. In a broad perspective, these results demonstrate the potential of combining functional genomics and population genetics approaches for understanding gene regulation.
Much of the morphological diversity in eukaryotes results from differential regulation of gene expression in which transcription factors (TFs) play a central role. The nematode Caenorhabditis elegans is an established model organism for the study of the roles of TFs in controlling the spatiotemporal pattern of gene expression. Using the fully sequenced genomes of three Caenorhabditid nematode species as well as genome information from additional more distantly related organisms (fruit fly, mouse, and human) we sought to identify orthologous TFs and characterized their patterns of evolution.
We identified 988 TF genes in C. elegans, and inferred corresponding sets in C. briggsae and C. remanei, containing 995 and 1093 TF genes, respectively. Analysis of the three gene sets revealed 652 3-way reciprocal 'best hit' orthologs (nematode TF set), approximately half of which are zinc finger (ZF-C2H2 and ZF-C4/NHR types) and HOX family members. Examination of the TF genes in C. elegans and C. briggsae identified the presence of significant tandem clustering on chromosome V, the majority of which belong to ZF-C4/NHR family. We also found evidence for lineage-specific duplications and rapid evolution of many of the TF genes in the two species. A search of the TFs conserved among nematodes in Drosophila melanogaster, Mus musculus and Homo sapiens revealed 150 reciprocal orthologs, many of which are associated with important biological processes and human diseases. Finally, a comparison of the sequence, gene interactions and function indicates that nematode TFs conserved across phyla exhibit significantly more interactions and are enriched in genes with annotated mutant phenotypes compared to those that lack orthologs in other species.
Our study represents the first comprehensive genome-wide analysis of TFs across three nematode species and other organisms. The findings indicate substantial conservation of transcription factors even across distant evolutionary lineages and form the basis for future experiments to examine TF gene function in nematodes and other divergent phyla.
Methylation of histone H3K36 in higher eukaryotes is mediated by multiple methyltransferases. Set2-related H3K36 methyltransferases are targeted to genes by association with RNA Polymerase II and are involved in preventing aberrant transcription initiation within the body of genes. The targeting and roles of the NSD family of mammalian H3K36 methyltransferases, known to be involved in human developmental disorders and oncogenesis, are not known. We used genome-wide chromatin immunoprecipitation (ChIP) to investigate the targeting and roles of the Caenorhabditis elegans NSD homolog MES-4, which is maternally provided to progeny and is required for the survival of nascent germ cells. ChIP analysis in early C. elegans embryos revealed that, consistent with immunostaining results, MES-4 binding sites are concentrated on the autosomes and the leftmost ∼2% (300 kb) of the X chromosome. MES-4 overlies the coding regions of approximately 5,000 genes, with a modest elevation in the 5′ regions of gene bodies. Although MES-4 is generally found over Pol II-bound genes, analysis of gene sets with different temporal-spatial patterns of expression revealed that Pol II association with genes is neither necessary nor sufficient to recruit MES-4. In early embryos, MES-4 associates with genes that were previously expressed in the maternal germ line, an interaction that does not require continued association of Pol II with those loci. Conversely, Pol II association with genes newly expressed in embryos does not lead to recruitment of MES-4 to those genes. These and other findings suggest that MES-4, and perhaps the related mammalian NSD proteins, provide an epigenetic function for H3K36 methylation that is novel and likely to be unrelated to ongoing transcription. We propose that MES-4 transmits the memory of gene expression in the parental germ line to offspring and that this memory role is critical for the PGCs to execute a proper germline program.
Germ cells transmit the genome from one generation to the next. The identity and immortality of germ cells are crucial for the perpetuation of species, yet the mechanisms that regulate these properties remain elusive. In C.elegans, a histone methyltransferase MES-4 is required for survival of the primordial germ cells. MES-4 methylates histone H3 at lysine 36 (H3K36), a modification previously linked to transcription elongation and involved in preventing aberrant transcription initiation within the body of genes. Surprisingly, our genome-wide analysis of MES-4 binding sites in C. elegans embryos revealed that MES-4 is capable of associating with genes that were expressed in the germ line of the parent worms but are no longer being actively transcribed in embryos. To our knowledge, this is the first example of transcription-uncoupled H3K36 methylation. We suggest that MES-4-generated H3K36 methylation serves an “epigenetic role,” by marking germline-expressed genes and by carrying the memory of gene expression from one generation of germ cells to the next.