Despite the success of genome-wide association studies (GWAS) in identifying loci associated with common diseases, a significant proportion of the causality remains unexplained. Recent advances in genomic technologies have placed us in a position to initiate large-scale studies of human disease-associated epigenetic variation, specifically variation in DNA methylation (DNAm). Such Epigenome-Wide Association Studies (EWAS) present novel opportunities but also create new challenges that are not encountered in GWAS. We discuss EWAS study design, cohort and sample selections, statistical significance and power, confounding factors, and follow-up studies. We also discuss how integration of EWAS with GWAS can help to dissect complex GWAS haplotypes for functional analysis.
Epigenomics; Disease Genetics; DNA Methylation; Epigenetics; Quantitative Trait
The T-box transcription factor Brachyury (T) is essential for formation of the posterior mesoderm and the notochord in vertebrate embryos. Work in the frog and the zebrafish has identified some direct genomic targets of Brachyury, but little is known about Brachyury targets in the mouse.
Here we use chromatin immunoprecipitation and mouse promoter microarrays to identify targets of Brachyury in embryoid bodies formed from differentiating mouse ES cells. The targets we identify are enriched for sequence-specific DNA binding proteins and include components of signal transduction pathways that direct cell fate in the primitive streak and tailbud of the early embryo. Expression of some of these targets, such as Axin2, Fgf8 and Wnt3a, is down regulated in Brachyury mutant embryos and we demonstrate that they are also Brachyury targets in the human. Surprisingly, we do not observe enrichment of the canonical T-domain DNA binding sequence 5′-TCACACCT-3′ in the vicinity of most Brachyury target genes. Rather, we have identified an (AC)n repeat sequence, which is conserved in the rat but not in human, zebrafish or Xenopus. We do not understand the significance of this sequence, but speculate that it enhances transcription factor binding in the regulatory regions of Brachyury target genes in rodents.
Our work identifies the genomic targets of a key regulator of mesoderm formation in the early mouse embryo, thereby providing insights into the Brachyury-driven genetic regulatory network and allowing us to compare the function of Brachyury in different species.
Methylated DNA immunoprecipitation followed by high-throughput sequencing (MeDIP-seq) has the potential to identify changes in DNA methylation important in cancer development. In order to understand the role of epigenetic modulation in the development of acute myeloid leukemia (AML) we have applied MeDIP-seq to the DNA of 12 AML patients and 4 normal bone marrows. This analysis revealed leukemia-associated differentially methylated regions that included gene promoters, gene bodies, CpG islands and CpG island shores. Two genes (SPHKAP and DPP6) with significantly methylated promoters were of interest and further analysis of their expression showed them to be repressed in AML. We also demonstrated considerable cytogenetic subtype specificity in the methylomes affecting different genomic features. Significantly distinct patterns of hypomethylation of certain interspersed repeat elements were associated with cytogenetic subtypes. The methylation patterns of members of the SINE family tightly clustered all leukemic patients with an enrichment of Alu repeats with a high CpG density (P<0.0001). We were able to demonstrate significant inverse correlation between intragenic interspersed repeat sequence methylation and gene expression with SINEs showing the strongest inverse correlation (R2 = 0.7). We conclude that the alterations in DNA methylation that accompany the development of AML affect not only the promoters, but also the non-promoter genomic features, with significant demethylation of certain interspersed repeat DNA elements being associated with AML cytogenetic subtypes. MeDIP-seq data were validated using bisulfite pyrosequencing and the Infinium array.
Monozygotic (MZ) twin pair discordance for childhood-onset Type 1 Diabetes (T1D) is ∼50%, implicating roles for genetic and non-genetic factors in the aetiology of this complex autoimmune disease. Although significant progress has been made in elucidating the genetics of T1D in recent years, the non-genetic component has remained poorly defined. We hypothesized that epigenetic variation could underlie some of the non-genetic component of T1D aetiology and, thus, performed an epigenome-wide association study (EWAS) for this disease. We generated genome-wide DNA methylation profiles of purified CD14+ monocytes (an immune effector cell type relevant to T1D pathogenesis) from 15 T1D–discordant MZ twin pairs. This identified 132 different CpG sites at which the direction of the intra-MZ pair DNA methylation difference significantly correlated with the diabetic state, i.e. T1D–associated methylation variable positions (T1D–MVPs). We confirmed these T1D–MVPs display statistically significant intra-MZ pair DNA methylation differences in the expected direction in an independent set of T1D–discordant MZ pairs (P = 0.035). Then, to establish the temporal origins of the T1D–MVPs, we generated two further genome-wide datasets and established that, when compared with controls, T1D–MVPs are enriched in singletons both before (P = 0.001) and at (P = 0.015) disease diagnosis, and also in singletons positive for diabetes-associated autoantibodies but disease-free even after 12 years follow-up (P = 0.0023). Combined, these results suggest that T1D–MVPs arise very early in the etiological process that leads to overt T1D. Our EWAS of T1D represents an important contribution toward understanding the etiological role of epigenetic variation in type 1 diabetes, and it is also the first systematic analysis of the temporal origins of disease-associated epigenetic variation for any human complex disease.
Type 1 diabetes (T1D) is a complex autoimmune disease affecting >30 million people worldwide. It is caused by a combination of genetic and non-genetic factors, leading to destruction of insulin-secreting cells. Although significant progress has recently been made in elucidating the genetics of T1D, the non-genetic component has remained poorly defined. Epigenetic modifications, such as methylation of DNA, are indispensable for genomic processes such as transcriptional regulation and are frequently perturbed in human disease. We therefore hypothesized that epigenetic variation could underlie some of the non-genetic component of T1D aetiology, and we performed a genome-wide DNA methylation analysis of a specific subset of immune cells (monocytes) from monozygotic twins discordant for T1D. This revealed the presence of T1D–specific methylation variable positions (T1D–MVPs) in the T1D–affected co-twins. Since these T1D–MVPs were found in MZ twins, they cannot be due to genetic differences. Additional experiments revealed that some of these T1D–MVPs are found in individuals before T1D diagnosis, suggesting they arise very early in the process that leads to overt T1D and are not simply due to post-disease associated factors (e.g. medication or long-term metabolic changes). T1D–MVPs may thus potentially represent a previously unappreciated, and important, component of type 1 diabetes risk.
Genomic sequences obtained through high-throughput sequencing are not uniformly distributed across the genome. For example, sequencing data of total genomic DNA show significant, yet unexpected enrichments on promoters and exons. This systematic bias is a particular problem for techniques such as chromatin immunoprecipitation, where the signal for a target factor is plotted across genomic features. We have focused on data obtained from Illumina’s Genome Analyser platform, where at least three factors contribute to sequence bias: GC content, mappability of sequencing reads, and regional biases that might be generated by local structure. We show that relying on input control as a normalizer is not generally appropriate due to sample to sample variation in bias. To correct sequence bias, we present BEADS (bias elimination algorithm for deep sequencing), a simple three-step normalization scheme that successfully unmasks real binding patterns in ChIP-seq data. We suggest that this procedure be done routinely prior to data interpretation and downstream analyses.
DNA methylation constitutes the most stable type of epigenetic modifications modulating the transcriptional plasticity of mammalian genomes. Using bisulfite DNA sequencing, we report high-resolution methylation reference profiles of human chromosomes 6, 20 and 22, providing a resource of about 1.9 million CpG methylation values derived from 12 different tissues. Analysis of 6 annotation categories, revealed evolutionary conserved regions to be the predominant sites for differential DNA methylation and a core region surrounding the transcriptional start site as informative surrogate for promoter methylation. We find 17% of the 873 analyzed genes differentially methylated in their 5′-untranslated regions (5′-UTR) and about one third of the differentially methylated 5′-UTRs to be inversely correlated with transcription. While our study was controlled for factors reported to affect DNA methylation such as sex and age, we did not find any significant attributable effects. Our data suggest DNA methylation to be ontogenetically more stable than previously thought.
Summary: Dalliance is a new genome viewer which offers a high level of interactivity while running within a web browser. All data is fetched using the established distributed annotation system (DAS) protocol, making it easy to customize the browser and add extra data.
Computational methods attempting to identify instances of cis-regulatory modules (CRMs) in the genome face a challenging problem of searching for potentially interacting transcription factor binding sites while knowledge of the specific interactions involved remains limited. Without a comprehensive comparison of their performance, the reliability and accuracy of these tools remains unclear. Faced with a large number of different tools that address this problem, we summarized and categorized them based on search strategy and input data requirements. Twelve representative methods were chosen and applied to predict CRMs from the Drosophila CRM database REDfly, and across the human ENCODE regions. Our results show that the optimal choice of method varies depending on species and composition of the sequences in question. When discriminating CRMs from non-coding regions, those methods considering evolutionary conservation have a stronger predictive power than methods designed to be run on a single genome. Different CRM representations and search strategies rely on different CRM properties, and different methods can complement one another. For example, some favour homotypical clusters of binding sites, while others perform best on short CRMs. Furthermore, most methods appear to be sensitive to the composition and structure of the genome to which they are applied. We analyze the principal features that distinguish the methods that performed well, identify weaknesses leading to poor performance, and provide a guide for users. We also propose key considerations for the development and evaluation of future CRM-prediction methods.
Transcriptional regulation involves multiple transcription factors binding to DNA sequences. A limited repertoire of transcription factors performs this complex regulatory step through various spatial and temporal interactions between themselves and their binding sites. These transcription factor binding interactions are clustered as distinct modules: cis-regulatory modules (CRMs). Computational methods attempting to identify instances of CRMs in the genome face a challenging problem because a majority of these interactions between transcription factors remain unknown. To investigate the reliability and accuracy of these methods, we chose twelve representative methods and applied them to predict CRMs on both the fly and human genomes. Our results show that the optimal choice of method varies depending on species and composition of the sequences in question. Different CRM representations and search strategies rely on different CRM properties, and different methods can complement one another. We provide a guide for users and key considerations for developers. We also expect that, along with new technology generating new types of genomic data, future CRM prediction methods will be able to reveal transcription binding interactions in three-dimensional space.
Recent multi-dimensional approaches to the study of complex disease have revealed powerful insights into how genetic and epigenetic factors may underlie their aetiopathogenesis. We examined genotype-epigenotype interactions in the context of Type 2 Diabetes (T2D), focussing on known regions of genomic susceptibility. We assayed DNA methylation in 60 females, stratified according to disease susceptibility haplotype using previously identified association loci. CpG methylation was assessed using methylated DNA immunoprecipitation on a targeted array (MeDIP-chip) and absolute methylation values were estimated using a Bayesian algorithm (BATMAN). Absolute methylation levels were quantified across LD blocks, and we identified increased DNA methylation on the FTO obesity susceptibility haplotype, tagged by the rs8050136 risk allele A (p = 9.40×10−4, permutation p = 1.0×10−3). Further analysis across the 46 kb LD block using sliding windows localised the most significant difference to be within a 7.7 kb region (p = 1.13×10−7). Sequence level analysis, followed by pyrosequencing validation, revealed that the methylation difference was driven by the co-ordinated phase of CpG-creating SNPs across the risk haplotype. This 7.7 kb region of haplotype-specific methylation (HSM), encapsulates a Highly Conserved Non-Coding Element (HCNE) that has previously been validated as a long-range enhancer, supported by the histone H3K4me1 enhancer signature. This study demonstrates that integration of Genome-Wide Association (GWA) SNP and epigenomic DNA methylation data can identify potential novel genotype-epigenotype interactions within disease-associated loci, thus providing a novel route to aid unravelling common complex diseases.
Mouse Embryonic Stem (ES) cells express a unique set of microRNAs (miRNAs), the miR-290-295 cluster. To elucidate the role of these miRNAs and how they integrate into the ES cell regulatory network requires identification of their direct regulatory targets. The difficulty, however, arises from the limited complementarity of metazoan miRNAs to their targets, with the interaction requiring as few as six nucleotides of the miRNA seed sequence. To identify miR-294 targets, we used Dicer1-null ES cells, which lack all endogenous mature miRNAs, and introduced just miR-294 into these ES cells. We then employed two approaches to discover miR-294 targets in mouse ES cells: transcriptome profiling using microarrays and a biochemical approach to isolate mRNA targets associated with the Argonaute2 (Ago2) protein of the RISC (RNA Induced Silencing Complex) effector, followed by RNA–sequencing. In the absence of Dicer1, the RISC complexes are largely devoid of mature miRNAs and should therefore contain only transfected miR-294 and its base-paired targets. Our data suggest that miR-294 may promote pluripotency by regulating a subset of c-Myc target genes and upregulating pluripotency-associated genes such as Lin28.
Stem cells in plants and animals contain many small RNAs, which help to regulate differentiation into diverse cell types. Mutation in a gene necessary for the maturation of small RNAs in plants causes the stem cells (called meristem cells) to remain in an indeterminate, overproliferating state. Similarly in worms, a small RNA called lin-4 miRNA prevents “stem cell–like cells” appearing at inappropriate times. Thus, it is important to determine the precise functions of key individual small RNAs in embryonic stem cells. To address this, we created embryonic stem cells lacking all miRNAs into which we introduced a single miRNA. We discovered that a single miRNA could affect the expression of many genes in stem cells, which in turn regulate key properties of stem cells. These together help establish an intricate network of gene regulation in stem cells that defines their properties. Our findings are of broad interest because different miRNAs have critical functions in diverse cell types in developing embryos. It is important to understand the function of these molecules also because misregulation of miRNA function underlies some human diseases, including cancers.
DNA methylation can regulate gene expression by modulating the interaction between DNA and proteins or protein complexes. Conserved consensus motifs exist across the human genome ("predicted transcription factor binding sites": "predicted TFBS") but the large majority of these are proven by chromatin immunoprecipitation and high throughput sequencing (ChIP-seq) not to be biological transcription factor binding sites ("empirical TFBS"). We hypothesize that DNA methylation at conserved consensus motifs prevents promiscuous or disorderly transcription factor binding.
Using genome-wide methylation maps of the human heart and sperm, we found that all conserved consensus motifs as well as the subset of those that reside outside CpG islands have an aggregate profile of hyper-methylation. In contrast, empirical TFBS with conserved consensus motifs have a profile of hypo-methylation. 40% of empirical TFBS with conserved consensus motifs resided in CpG islands whereas only 7% of all conserved consensus motifs were in CpG islands. Finally we further identified a minority subset of TF whose profiles are either hypo-methylated or neutral at their respective conserved consensus motifs implicating that these TF may be responsible for establishing or maintaining an un-methylated DNA state, or whose binding is not regulated by DNA methylation.
Our analysis supports the hypothesis that at least for a subset of TF, empirical binding to conserved consensus motifs genome-wide may be controlled by DNA methylation.
Development of high-throughput methods for measuring DNA interactions of transcription factors together with computational advances in short motif inference algorithms is expanding our understanding of transcription factor binding site motifs. The consequential growth of sequence motif data sets makes it important to systematically group and categorise regulatory motifs. It has been shown that there are familial tendencies in DNA sequence motifs that are predictive of the family of factors that binds them. Further development of methods that detect and describe familial motif trends has the potential to help in measuring the similarity of novel computational motif predictions to previously known data and sensitively detecting regulatory motifs similar to previously known ones from novel sequence.
We propose a probabilistic model for position weight matrix (PWM) sequence motif families. The model, which we call the 'metamotif' describes recurring familial patterns in a set of motifs. The metamotif framework models variation within a family of sequence motifs. It allows for simultaneous estimation of a series of independent metamotifs from input position weight matrix (PWM) motif data and does not assume that all input motif columns contribute to a familial pattern. We describe an algorithm for inferring metamotifs from weight matrix data. We then demonstrate the use of the model in two practical tasks: in the Bayesian NestedMICA model inference algorithm as a PWM prior to enhance motif inference sensitivity, and in a motif classification task where motifs are labelled according to their interacting DNA binding domain.
We show that metamotifs can be used as PWM priors in the NestedMICA motif inference algorithm to dramatically increase the sensitivity to infer motifs. Metamotifs were also successfully applied to a motif classification problem where sequence motif features were used to predict the family of protein DNA binding domains that would interact with it. The metamotif based classifier is shown to compare favourably to previous related methods. The metamotif has great potential for further use in machine learning tasks related to especially de novo computational sequence motif inference. The metamotif methods presented have been incorporated into the NestedMICA suite.
Motivation: Short sequence motifs are an important class of models in molecular biology, used most commonly for describing transcription factor binding site specificity patterns. High-throughput methods have been recently developed for detecting regulatory factor binding sites in vivo and in vitro and consequently high-quality binding site motif data are becoming available for increasing number of organisms and regulatory factors. Development of intuitive tools for the study of sequence motifs is therefore important.
iMotifs is a graphical motif analysis environment that allows visualization of annotated sequence motifs and scored motif hits in sequences. It also offers motif inference with the sensitive NestedMICA algorithm, as well as overrepresentation and pairwise motif matching capabilities. All of the analysis functionality is provided without the need to convert between file formats or learn different command line interfaces.
The application includes a bundled and graphically integrated version of the NestedMICA motif inference suite that has no outside dependencies. Problems associated with local deployment of software are therefore avoided.
Availability: iMotifs is licensed with the GNU Lesser General Public License v2.0 (LGPL 2.0). The software and its source is available at http://wiki.github.com/mz2/imotifs and can be run on Mac OS X Leopard (Intel/PowerPC). We also provide a cross-platform (Linux, OS X, Windows) LGPL 2.0 licensed library libxms for the Perl, Ruby, R and Objective-C programming languages for input and output of XMS formatted annotated sequence motif set files.
Contact: firstname.lastname@example.org; email@example.com
Epigenetic mechanisms such as microRNA and histone modification are crucially responsible for dysregulated gene expression in heart failure. In contrast, the role of DNA methylation, another well-characterized epigenetic mark, is unknown. In order to examine whether human cardiomyopathy of different etiologies are connected by a unifying pattern of DNA methylation pattern, we undertook profiling with ischaemic and idiopathic end-stage cardiomyopathic left ventricular (LV) explants from patients who had undergone cardiac transplantation compared to normal control. We performed a preliminary analysis using methylated-DNA immunoprecipitation-chip (MeDIP-chip), validated differential methylation loci by bisulfite-(BS) PCR and high throughput sequencing, and identified 3 angiogenesis-related genetic loci that were differentially methylated. Using quantitative RT-PCR, we found that the expression of these genes differed significantly between CM hearts and normal control (p<0.01). Moreover, for each individual LV tissue, differential methylation showed a predicted correlation to differential expression of the corresponding gene. Thus, differential DNA methylation exists in human cardiomyopathy. In this series of heterogenous cardiomyopathic LV explants, differential DNA methylation was found in at least 3 angiogenesis-related genes. While in other systems, changes in DNA methylation at specific genomic loci usually precede changes in the expression of corresponding genes, our current findings in cardiomyopathy merit further investigation to determine whether DNA methylation changes play a causative role in the progression of heart failure.
The genome of extraembryonic tissue, such as the placenta, is hypomethylated relative to that in somatic tissues. However, the origin and role of this hypomethylation remains unclear. The DNA methyltransferases DNMT1, -3A, and -3B are the primary mediators of the establishment and maintenance of DNA methylation in mammals. In this study, we investigated promoter methylation-mediated epigenetic down-regulation of DNMT genes as a potential regulator of global methylation levels in placental tissue. Although DNMT3A and -3B promoters lack methylation in all somatic and extraembryonic tissues tested, we found specific hypermethylation of the maintenance DNA methyltransferase (DNMT1) gene and found hypomethylation of the DNMT3L gene in full term and first trimester placental tissues. Bisulfite DNA sequencing revealed monoallelic methylation of DNMT1, with no evidence of imprinting (parent of origin effect). In vitro reporter experiments confirmed that DNMT1 promoter methylation attenuates transcriptional activity in trophoblast cells. However, global hypomethylation in the absence of DNMT1 down-regulation is apparent in non-primate placentas and in vitro derived human cytotrophoblast stem cells, suggesting that DNMT1 down-regulation is not an absolute requirement for genomic hypomethylation in all instances. These data represent the first demonstration of methylation-mediated regulation of the DNMT1 gene in any system and demonstrate that the unique epigenome of the human placenta includes down-regulation of DNMT1 with concomitant hypomethylation of the DNMT3L gene. This strongly implicates epigenetic regulation of the DNMT gene family in the establishment of the unique epigenetic profile of extraembryonic tissue in humans.
Development Differentiation/Tissue; DNA/Methylation; DNA/Methyltransferase; Epigenetics; Gene Transcription; Extraembryonic Tissue; Placenta; Trophoblast
Variation in patterns of methylations of histone tails reflects and modulates chromatin structure and function1-3. To provide a framework for the analysis of chromatin function in C. elegans, we generated a genome-wide map of histone H3 tail methylations. We find that C. elegans genes show similarities in distributions of histone modifications to those of other organisms, with H3K4me3 near transcription start sites, H3K36me3 in the body of genes, and H3K9me3 enriched on silent genes. Unexpectedly, we also observe a striking novel pattern: exons are preferentially marked with H3K36me3 relative to introns. H3K36me3 exon marking is dependent on transcription and its level is lower in alternatively spliced exons, supporting a splicing related marking mechanism. We further show that the difference in H3K36me3 marking between exons and introns is evolutionarily conserved in human and mouse. We propose that H3K36me3 exon marking in chromatin provides a dynamic link between transcription and splicing.
Plasma concentrations of biologically active vitamin D
(1,25-(OH)2D) are tightly controlled via feedback regulation of
renal 1α-hydroxylase (CYP27B1; positive) and 24-hydroxylase
(CYP24A1; catabolic) enzymes. In pregnancy, this regulation is
uncoupled, and 1,25-(OH)2D levels are significantly elevated,
suggesting a role in pregnancy progression. Epigenetic regulation of
CYP27B1 and CYP24A1 has previously been described in cell
and animal models, and despite emerging evidence for a critical role of
epigenetics in placentation generally, little is known about the regulation of
enzymes modulating vitamin D homeostasis at the fetomaternal interface. In
this study, we investigated the methylation status of genes regulating vitamin
D bioavailability and activity in the placenta. No methylation of the
VDR (vitamin D receptor) and CYP27B1 genes was found in any
placental tissues. In contrast, the CYP24A1 gene is methylated in
human placenta, purified cytotrophoblasts, and primary and cultured chorionic
villus sampling tissue. No methylation was detected in any somatic human
tissue tested. Methylation was also evident in marmoset and mouse placental
tissue. All three genes were hypermethylated in choriocarcinoma cell lines,
highlighting the role of vitamin D deregulation in this cancer. Gene
expression analysis confirmed a reduced capacity for CYP24A1
induction with promoter methylation in primary cells and in vitro
reporter analysis demonstrated that promoter methylation directly
down-regulates basal promoter activity and abolishes vitamin D-mediated
feedback activation. This study strongly suggests that epigenetic decoupling
of vitamin D feedback catabolism plays an important role in maximizing active
vitamin D bioavailability at the fetomaternal interface.
A central aim of cancer research has been to identify the mutated genes that are causally implicated in oncogenesis (‘cancer genes’). After two decades of searching, how many have been identified and how do they compare to the complete gene set that has been revealed by the human genome sequence? We have conducted a ‘census’ of cancer genes that indicates that mutations in more than 1% of genes contribute to human cancer. The census illustrates striking features in the types of sequence alteration, cancer classes in which oncogenic mutations have been identified and protein domains that are encoded by cancer genes.
The distributed annotation system (DAS) defines a communication protocol used to exchange biological annotations. It is motivated by the idea that annotations should not be provided by single centralized databases but instead be spread over multiple sites. Data distribution, performed by DAS servers, is separated from visualization, which is carried out by DAS clients. The original DAS protocol was designed to serve annotation of genomic sequences. We have extended the protocol to be applicable to macromolecular structures. Here we present SPICE, a new DAS client that can be used to visualize protein sequence and structure annotations.
DNA methylation is an indispensible epigenetic modification of mammalian genomes. Consequently there is great interest in strategies for genome-wide/whole-genome DNA methylation analysis, and immunoprecipitation-based methods have proven to be a powerful option. Such methods are rapidly shifting the bottleneck from data generation to data analysis, necessitating the development of better analytical tools. Until now, a major analytical difficulty associated with immunoprecipitation-based DNA methylation profiling has been the inability to estimate absolute methylation levels. Here we report the development of a novel cross-platform algorithm – Bayesian Tool for Methylation Analysis (Batman) – for analyzing Methylated DNA Immunoprecipitation (MeDIP) profiles generated using arrays (MeDIP-chip) or next-generation sequencing (MeDIP-seq). The latter is an approach we have developed to elucidate the first high-resolution whole-genome DNA methylation profile (DNA methylome) of any mammalian genome. MeDIP-seq/MeDIP-chip combined with Batman represent robust, quantitative, and cost-effective functional genomic strategies for elucidating the function of DNA methylation.
The Distributed Annotation System (DAS) is a widely adopted protocol for dynamically integrating a wide range of biological data from geographically diverse sources. DAS continues to expand its applicability and evolve in response to new challenges facing integrative bioinformatics.
Here we describe the various infrastructure components of DAS and present a new extended version of the DAS specification. Version 1.53E incorporates several recent developments, including its extension to serve new data types and an ontology for protein features.
Our extensions to the DAS protocol have facilitated the integration of new data types, and our improvements to the existing DAS infrastructure have addressed recent challenges. The steadily increasing numbers of available data sources demonstrates further adoption of the DAS protocol.
Discovering overrepresented patterns in amino acid sequences is an important step in protein functional element identification. We adapted and extended NestedMICA, an ab initio motif finder originally developed for finding transcription binding site motifs, to find short protein signals, and compared its performance with another popular protein motif finder, MEME. NestedMICA, an open source protein motif discovery tool written in Java, is driven by a Monte Carlo technique called Nested Sampling. It uses multi-class sequence background models to represent different "uninteresting" parts of sequences that do not contain motifs of interest. In order to assess NestedMICA as a protein motif finder, we have tested it on synthetic datasets produced by spiking instances of known motifs into a randomly selected set of protein sequences. NestedMICA was also tested using a biologically-authentic test set, where we evaluated its performance with respect to varying sequence length.
Generally NestedMICA recovered most of the short (3–9 amino acid long) test protein motifs spiked into a test set of sequences at different frequencies. We showed that it can be used to find multiple motifs at the same time, too. In all the assessment experiments we carried out, its overall motif discovery performance was better than that of MEME.
NestedMICA proved itself to be a robust and sensitive ab initio protein motif finder, even for relatively short motifs that exist in only a small fraction of sequences.
NestedMICA is available under the Lesser GPL open-source license from:
The Distributed Annotation System (DAS) is a network protocol for exchanging biological data. It is frequently used to share annotations of genomes and protein sequence.
Here we present several extensions to the current DAS 1.5 protocol. These provide new commands to share alignments, three dimensional molecular structure data, add the possibility for registration and discovery of DAS servers, and provide a convention how to provide different types of data plots. We present examples of web sites and applications that use the new extensions. We operate a public registry of DAS sources, which now includes entries for more than 250 distinct sources.
Our DAS extensions are essential for the management of the growing number of services and exchange of diverse biological data sets. In addition the extensions allow new types of applications to be developed and scientific questions to be addressed. The registry of DAS sources is available at
A key step in understanding gene regulation is to identify the repertoire of transcription factor binding motifs (TFBMs) that form the building blocks of promoters and other regulatory elements. Identifying these experimentally is very laborious, and the number of TFBMs discovered remains relatively small, especially when compared with the hundreds of transcription factor genes predicted in metazoan genomes. We have used a recently developed statistical motif discovery approach, NestedMICA, to detect candidate TFBMs from a large set of Drosophila melanogaster promoter regions. Of the 120 motifs inferred in our initial analysis, 25 were statistically significant matches to previously reported motifs, while 87 appeared to be novel. Analysis of sequence conservation and motif positioning suggested that the great majority of these discovered motifs are predictive of functional elements in the genome. Many motifs showed associations with specific patterns of gene expression in the D. melanogaster embryo, and we were able to obtain confident annotation of expression patterns for 25 of our motifs, including eight of the novel motifs. The motifs are available through Tiffin, a new database of DNA sequence motifs. We have discovered many new motifs that are overrepresented in D. melanogaster promoter regions, and offer several independent lines of evidence that these are novel TFBMs. Our motif dictionary provides a solid foundation for further investigation of regulatory elements in Drosophila, and demonstrates techniques that should be applicable in other species. We suggest that further improvements in computational motif discovery should narrow the gap between the set of known motifs and the total number of transcription factors in metazoan genomes.
In contrast to the genomic sequences that encode proteins, little is known about the regulatory elements that instruct the cell as to when and where a given gene should be active. Regulatory elements are thought to consist of clusters of short DNA words (motifs), each of which acts as a binding site for sequence-specific DNA binding protein. Thus, building a comprehensive dictionary of such motifs is an important step towards a broader understanding of gene regulation. Using the recently published NestedMICA method for detecting overrepresented motifs in a set of sequences, we build a dictionary of 120 motifs from regulatory sequences in the fruitfly genome, 87 of which are novel. Analysis of positional biases, conservation across species, and association with specific patterns of gene expression in fruitfly embryos suggest that the great majority of these newly discovered motifs represent functional regulatory elements. In addition to providing an initial motif dictionary for one of the most intensively studied model organisms, this work provides an analytical framework for the comprehensive discovery of regulatory motifs in complex animal genomes.
The splicing of RNA transcripts is thought to be partly promoted and regulated by sequences embedded within exons. Known sequences include binding sites for SR proteins, which are thought to mediate interactions between splicing factors bound to the 5' and 3' splice sites. It would be useful to identify further candidate sequences, however identifying them computationally is hard since exon sequences are also constrained by their functional role in coding for proteins.
This strategy identified a collection of motifs including several previously reported splice enhancer elements. Although only trained on coding exons, the model discriminates both coding and non-coding exons from intragenic sequence.
We have trained a computational model able to detect signals in coding exons which seem to be orthogonal to the sequences' primary function of coding for proteins. We believe that many of the motifs detected here represent binding sites for both previously unrecognized proteins which influence RNA splicing as well as other regulatory elements.