Characterizing the DNA-binding specificities of transcription factors is a key problem in computational biology that has been addressed by multiple algorithms. These usually take as input sequences that are putatively bound by the same factor and output one or more DNA motifs. A common practice is to apply several such algorithms simultaneously to improve coverage at the price of redundancy. In interpreting such results, two tasks are crucial: clustering of redundant motifs, and attributing the motifs to transcription factors by retrieval of similar motifs from previously characterized motif libraries. Both tasks inherently involve motif comparison. Here we present a novel method for comparing and merging motifs, based on Bayesian probabilistic principles. This method takes into account both the similarity in positional nucleotide distributions of the two motifs and their dissimilarity to the background distribution. We demonstrate the use of the new comparison method as a basis for motif clustering and retrieval procedures, and compare it to several commonly used alternatives. Our results show that the new method outperforms other available methods in accuracy and sensitivity. We incorporated the resulting motif clustering and retrieval procedures in a large-scale automated pipeline for analyzing DNA motifs. This pipeline integrates the results of various DNA motif discovery algorithms and automatically merges redundant motifs from multiple training sets into a coherent annotated library of motifs. Application of this pipeline to recent genome-wide transcription factor location data in S. cerevisiae successfully identified DNA motifs in a manner that is as good as semi-automated analysis reported in the literature. Moreover, we show how this analysis elucidates the mechanisms of condition-specific preferences of transcription factors.
Regulation of gene expression plays a central role in the activity of living cells and in their response to internal (e.g., cell division) or external (e.g., stress) stimuli. Key players in determining gene-specific regulation are transcription factors that bind sequence-specific sites on the DNA, modulating the expression of nearby genes. To understand the regulatory program of the cell, we need to identify these transcription factors, when they act, and on which genes. Transcription regulatory maps can be assembled by computational analysis of experimental data, by discovering the DNA recognition sequences (motifs) of transcription factors and their occurrences along the genome. Such an analysis usually results in a large number of overlapping motifs. To reconstruct regulatory maps, it is crucial to combine similar motifs and to relate them to transcription factors. To this end we developed an accurate fully-automated method, termed BLiC, based upon an improved similarity measure for comparing DNA motifs. By applying it to genome-wide data in yeast, we identified the DNA motifs of transcription factors and their putative target genes. Finally, we analyze motifs of transcription factor that alter their target genes under different conditions, and show how cells adjust their regulatory program in response to environmental changes.
Motif discovery aims to detect short, highly conserved patterns in a collection of unaligned DNA or protein sequences. Discriminative motif finding algorithms aim to increase the sensitivity and selectivity of motif discovery by utilizing a second set of sequences, and searching only for patterns that can differentiate the two sets of sequences. Potential applications of discriminative motif discovery include discovering transcription factor binding site motifs in ChIP-chip data and finding protein motifs involved in thermal stability using sets of orthologous proteins from thermophilic and mesophilic organisms.
We describe DEME, a discriminative motif discovery algorithm for use with protein and DNA sequences. Input to DEME is two sets of sequences; a "positive" set and a "negative" set. DEME represents motifs using a probabilistic model, and uses a novel combination of global and local search to find the motif that optimally discriminates between the two sets of sequences. DEME is unique among discriminative motif finders in that it uses an informative Bayesian prior on protein motif columns, allowing it to incorporate prior knowledge of residue characteristics. We also introduce four, synthetic, discriminative motif discovery problems that are designed for evaluating discriminative motif finders in various biologically motivated contexts. We test DEME using these synthetic problems and on two biological problems: finding yeast transcription factor binding motifs in ChIP-chip data, and finding motifs that discriminate between groups of thermophilic and mesophilic orthologous proteins.
Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences. With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs. We also show that DEME can find highly informative thermal-stability protein motifs. Binaries for the stand-alone program DEME is free for academic use and is available at
Identification of transcription factor binding sites (also called ‘motif discovery’) in DNA sequences is a basic step in understanding genetic regulation. Although many successful programs have been developed, the problem is far from being solved on account of diversity in gene expression/regulation and the low specificity of binding sites. State-of-the-art algorithms have their own constraints (e.g., high time or space complexity for finding long motifs, low precision in identification of weak motifs, or the OOPS constraint: one occurrence of the motif instance per sequence) which limit their scope of application.
In this paper, we present a novel and fast algorithm we call TFBSGroup. It is based on community detection from a graph and is used to discover long and weak (l,d) motifs under the ZOMOPS constraint (zero, one or multiple occurrence(s) of the motif instance(s) per sequence), where l is the length of a motif and d is the maximum number of mutations between a motif instance and the motif itself. Firstly, TFBSGroup transforms the (l, d) motif search in sequences to focus on the discovery of dense subgraphs within a graph. It identifies these subgraphs using a fast community detection method for obtaining coarse-grained candidate motifs. Next, it greedily refines these candidate motifs towards the true motif within their own communities. Empirical studies on synthetic (l, d) samples have shown that TFBSGroup is very efficient (e.g., it can find true (18, 6), (24, 8) motifs within 30 seconds). More importantly, the algorithm has succeeded in rapidly identifying motifs in a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB. The algorithm has also accurately identified motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal.
Our novel heuristic algorithm, TFBSGroup, is able to quickly identify nearly exact matches for long and weak (l, d) motifs in DNA sequences under the ZOMOPS constraint. It is also capable of finding motifs in real applications. The source code for TFBSGroup can be obtained from http://bioinformatics.bioengr.uic.edu/TFBSGroup/.
SCOPE is an ensemble motif finder that uses three component algorithms in parallel to identify potential regulatory motifs by over-representation and motif position preference1. Each component algorithm is optimized to find a different kind of motif. By taking the best of these three approaches, SCOPE performs better than any single algorithm, even in the presence of noisy data1. In this article, we utilize a web version of SCOPE2 to examine genes that are involved in telomere maintenance. SCOPE has been incorporated into at least two other motif finding programs3,4 and has been used in other studies5-8.
The three algorithms that comprise SCOPE are BEAM9, which finds non-degenerate motifs (ACCGGT), PRISM10, which finds degenerate motifs (ASCGWT), and SPACER11, which finds longer bipartite motifs (ACCnnnnnnnnGGT). These three algorithms have been optimized to find their corresponding type of motif. Together, they allow SCOPE to perform extremely well.
Once a gene set has been analyzed and candidate motifs identified, SCOPE can look for other genes that contain the motif which, when added to the original set, will improve the motif score. This can occur through over-representation or motif position preference. Working with partial gene sets that have biologically verified transcription factor binding sites, SCOPE was able to identify most of the rest of the genes also regulated by the given transcription factor.
Output from SCOPE shows candidate motifs, their significance, and other information both as a table and as a graphical motif map. FAQs and video tutorials are available at the SCOPE web site which also includes a "Sample Search" button that allows the user to perform a trial run.
Scope has a very friendly user interface that enables novice users to access the algorithm's full power without having to become an expert in the bioinformatics of motif finding. As input, SCOPE can take a list of genes, or FASTA sequences. These can be entered in browser text fields, or read from a file. The output from SCOPE contains a list of all identified motifs with their scores, number of occurrences, fraction of genes containing the motif, and the algorithm used to identify the motif. For each motif, result details include a consensus representation of the motif, a sequence logo, a position weight matrix, and a list of instances for every motif occurrence (with exact positions and "strand" indicated). Results are returned in a browser window and also optionally by email. Previous papers describe the SCOPE algorithms in detail1,2,9-11.
Genetics; Issue 51; gene regulation; computational biology; algorithm; promoter sequence motif
MEME-ChIP is a web-based tool for analyzing motifs in large DNA or RNA data sets. It can analyze peak regions identified by ChIP-seq, cross-linking sites identified by cLIP-seq and related assays, as well as sets of genomic regions selected using other criteria. MEME-ChIP performs de novo motif discovery, motif enrichment analysis, motif location analysis and motif clustering, providing a comprehensive picture of the DNA or RNA motifs that are enriched in the input sequences. MEME-ChIP performs two complementary types of de novo motif discovery: weight matrix–based discovery for high accuracy; and word-based discovery for high sensitivity. Motif enrichment analysis using DNA or RNA motifs from human, mouse, worm, fly and other model organisms provides even greater sensitivity. MEME-ChIP’s interactive HTML output groups and aligns significant motifs to ease interpretation. this protocol takes less than 3 h, and it provides motif discovery approaches that are distinct and complementary to other online methods.
Detection of rare events happening in a set of DNA/protein sequences could lead to new biological discoveries. One kind of such rare events is the presence of patterns called motifs in DNA/protein sequences. Finding motifs is a challenging problem since the general version of motif search has been proven to be intractable. Motifs discovery is an important problem in biology. For example, it is useful in the detection of transcription factor binding sites and transcriptional regulatory elements that are very crucial in understanding gene function, human disease, drug design, etc. Many versions of the motif search problem have been proposed in the literature. One such is the -motif search (or Planted Motif Search (PMS)). A generalized version of the PMS problem, namely, Quorum Planted Motif Search (qPMS), is shown to accurately model motifs in real data. However, solving the qPMS problem is an extremely difficult task because a special case of it, the PMS Problem, is already NP-hard, which means that any algorithm solving it can be expected to take exponential time in the worse case scenario. In this paper, we propose a novel algorithm named qPMS7 that tackles the qPMS problem on real data as well as challenging instances. Experimental results show that our Algorithm qPMS7 is on an average 5 times faster than the state-of-art algorithm. The executable program of Algorithm qPMS7 is freely available on the web at http://pms.engr.uconn.edu/downloads/qPMS7.zip. Our online motif discovery tools that use Algorithm qPMS7 are freely available at http://pms.engr.uconn.edu or http://motifsearch.com.
DMINDA (DNA motif identification and
analyses) is an integrated web server for DNA motif identification
and analyses, which is accessible at http://csbl.bmb.uga.edu/DMINDA/. This web site is freely
available to all users and there is no login requirement. This server provides a
suite of cis-regulatory motif analysis functions on DNA
sequences, which are important to elucidation of the mechanisms of
transcriptional regulation: (i) de novo motif finding for a
given set of promoter sequences along with statistical scores for the predicted
motifs derived based on information extracted from a control set, (ii) scanning
motif instances of a query motif in provided genomic sequences, (iii) motif
comparison and clustering of identified motifs, and (iv) co-occurrence analyses
of query motifs in given promoter sequences. The server is powered by a backend
computer cluster with over 150 computing nodes, and is particularly useful for
motif prediction and analyses in prokaryotic genomes. We believe that DMINDA, as
a new and comprehensive web server for cis-regulatory motif
finding and analyses, will benefit the genomic research community in general and
prokaryotic genome researchers in particular.
Transcription factors are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in promoters. The de-novo discovery of transcription factor binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not been fully solved yet. Here, we present a de-novo motif discovery tool called Dispom for finding differentially abundant transcription factor binding sites that models existing positional preferences of binding sites and adjusts the length of the motif in the learning process. Evaluating Dispom, we find that its prediction performance is superior to existing tools for de-novo motif discovery for 18 benchmark data sets with planted binding sites, and for a metazoan compendium based on experimental data from micro-array, ChIP-chip, ChIP-DSL, and DamID as well as Gene Ontology data. Finally, we apply Dispom to find binding sites differentially abundant in promoters of auxin-responsive genes extracted from Arabidopsis thaliana microarray data, and we find a motif that can be interpreted as a refined auxin responsive element predominately positioned in the 250-bp region upstream of the transcription start site. Using an independent data set of auxin-responsive genes, we find in genome-wide predictions that the refined motif is more specific for auxin-responsive genes than the canonical auxin-responsive element. In general, Dispom can be used to find differentially abundant motifs in sequences of any origin. However, the positional distribution learned by Dispom is especially beneficial if all sequences are aligned to some anchor point like the transcription start site in case of promoter sequences. We demonstrate that the combination of searching for differentially abundant motifs and inferring a position distribution from the data is beneficial for de-novo motif discovery. Hence, we make the tool freely available as a component of the open-source Java framework Jstacs and as a stand-alone application at http://www.jstacs.de/index.php/Dispom.
Binding of transcription factors to promoters of genes, and subsequent enhancement or repression of transcription, is one of the main steps of transcriptional gene regulation. Direct or indirect wet-lab experiments allow the identification of approximate regions potentially bound or regulated by a transcription factor. Subsequently, de-novo motif discovery tools can be used for detecting the precise positions of binding sites. Many traditional tools focus on motifs over-represented in the target regions, which often turn out to be similarly over-represented in the entire genome. In contrast, several recent tools focus on differentially abundant motifs in target regions compared to a control set. As binding sites are often located at some preferred distance to the transcription start site, it is favorable to include this information into de-novo motif discovery. Here, we present Dispom a novel approach for learning differentially abundant motifs and their positional preferences simultaneously, which predicts binding sites with increased accuracy compared to many popular de-novo motif discovery tools. When applying Dispom to promoters of auxin-responsive genes of Arabidopsis thaliana, we find a binding motif slightly different from the canonical auxin-response element, which exhibits a strong positional preference and which is considerably more specific to auxin-responsive genes.
The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over-or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over-or under-represented patterns. Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let NW(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of NW(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of NW(n) for frequent patterns and compound Poisson approximation of NW(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.
Hidden Markov model; motif; pattern recognition; statistical power
A central problem in the bioinformatics of gene regulation is to find the binding sites for regulatory proteins. One of the most promising approaches toward identifying these short and fuzzy sequence patterns is the comparative analysis of orthologous intergenic regions of related species. This analysis is complicated by various factors. First, one needs to take the phylogenetic relationship between the species into account in order to distinguish conservation that is due to the occurrence of functional sites from spurious conservation that is due to evolutionary proximity. Second, one has to deal with the complexities of multiple alignments of orthologous intergenic regions, and one has to consider the possibility that functional sites may occur outside of conserved segments. Here we present a new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors (TFs) can be assigned to the multiple sequence alignments. These binding site configurations are scored by a Bayesian probabilistic model that treats aligned sequences by a model for the evolution of binding sites and “background” intergenic DNA. This model takes the phylogenetic relationship between the species in the alignment explicitly into account. The algorithm uses simulated annealing and Monte Carlo Markov-chain sampling to rigorously assign posterior probabilities to all the binding sites that it reports. In tests on synthetic data and real data from five Saccharomyces species our algorithm performs significantly better than four other motif-finding algorithms, including algorithms that also take phylogeny into account. Our results also show that, in contrast to the other algorithms, PhyloGibbs can make realistic estimates of the reliability of its predictions. Our tests suggest that, running on the five-species multiple alignment of a single gene's upstream region, PhyloGibbs on average recovers over 50% of all binding sites in S. cerevisiae at a specificity of about 50%, and 33% of all binding sites at a specificity of about 85%. We also tested PhyloGibbs on collections of multiple alignments of intergenic regions that were recently annotated, based on ChIP-on-chip data, to contain binding sites for the same TF. We compared PhyloGibbs's results with the previous analysis of these data using six other motif-finding algorithms. For 16 of 21 TFs for which all other motif-finding methods failed to find a significant motif, PhyloGibbs did recover a motif that matches the literature consensus. In 11 cases where there was disagreement in the results we compiled lists of known target genes from the literature, and found that running PhyloGibbs on their regulatory regions yielded a binding motif matching the literature consensus in all but one of the cases. Interestingly, these literature gene lists had little overlap with the targets annotated based on the ChIP-on-chip data. The PhyloGibbs code can be downloaded from http://www.biozentrum.unibas.ch/~nimwegen/cgi-bin/phylogibbs.cgi or http://www.imsc.res.in/~rsidd/phylogibbs. The full set of predicted sites from our tests on yeast are available at http://www.swissregulon.unibas.ch.
Computational discovery of regulatory sites in intergenic DNA is one of the central problems in bioinformatics. Up until recently motif finders would typically take one of the following two general approaches. Given a known set of co-regulated genes, one searches their promoter regions for significantly overrepresented sequence motifs. Alternatively, in a “phylogenetic footprinting” approach one searches multiple alignments of orthologous intergenic regions for short segments that are significantly more conserved than expected based on the phylogeny of the species.
In this work the authors present an algorithm, PhyloGibbs, that combines these two approaches into one integrated Bayesian framework. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors can be assigned to arbitrary collections of multiple sequence alignments while taking into account the phylogenetic relations between the sequences.
The authors perform a number of tests on synthetic data and real data from Saccharomyces genomes in which PhyloGibbs significantly outperforms other existing methods. Finally, a novel anneal-and-track strategy allows PhyloGibbs to make accurate estimates of the reliability of its predictions.
An increasing number of cis-regulatory RNA elements have been found to regulate gene expression post-transcriptionally in various biological processes in bacterial systems. Effective computational tools for large-scale identification of novel regulatory RNAs are strongly desired to facilitate our exploration of gene regulation mechanisms and regulatory networks. We present a new computational program named RSSVM (RNA Sampler+Support Vector Machine), which employs Support Vector Machines (SVMs) for efficient identification of functional RNA motifs from random RNA secondary structures. RSSVM uses a set of distinctive features to represent the common RNA secondary structure and structural alignment predicted by RNA Sampler, a tool for accurate common RNA secondary structure prediction, and is trained with functional RNAs from a variety of bacterial RNA motif/gene families covering a wide range of sequence identities. When tested on a large number of known and random RNA motifs, RSSVM shows a significantly higher sensitivity than other leading RNA identification programs while maintaining the same false positive rate. RSSVM performs particularly well on sets with low sequence identities. The combination of RNA Sampler and RSSVM provides a new, fast, and efficient pipeline for large-scale discovery of regulatory RNA motifs. We applied RSSVM to multiple Shewanella genomes and identified putative regulatory RNA motifs in the 5′ untranslated regions (UTRs) in S. oneidensis, an important bacterial organism with extraordinary respiratory and metal reducing abilities and great potential for bioremediation and alternative energy generation. From 1002 sets of 5′-UTRs of orthologous operons, we identified 166 putative regulatory RNA motifs, including 17 of the 19 known RNA motifs from Rfam, an additional 21 RNA motifs that are supported by literature evidence, 72 RNA motifs overlapping predicted transcription terminators or attenuators, and other candidate regulatory RNA motifs. Our study provides a list of promising novel regulatory RNA motifs potentially involved in post-transcriptional gene regulation. Combined with the previous cis-regulatory DNA motif study in S. oneidensis, this genome-wide discovery of cis-regulatory RNA motifs may offer more comprehensive views of gene regulation at a different level in this organism. The RSSVM software, predictions, and analysis results on Shewanella genomes are available at http://ural.wustl.edu/resources.html#RSSVM.
RNA is remarkably versatile, acting not only as messengers to transfer genetic information from DNA to protein but also as critical structural components and catalytic enzymes in the cell. More intriguingly, RNA elements in messenger RNAs have been widely found in bacteria to control the expression of their downstream genes. The functions of these RNA elements are intrinsically linked to their secondary structures, which are usually conserved across multiple closely related species during evolution and often shared by genes in the same metabolic pathways. We developed a new computational approach to find putative functional RNA elements by looking for conserved RNA secondary structures that are distinguished from random RNA secondary structures in the orthologous RNA sequences from related species. We applied this approach to multiple Shewanella genomes and predicted putative regulatory RNA elements in Shewanella oneidensis, a bacterium that has extraordinary respiratory and metal reducing abilities and great potential for bioremediation and alternative energy generation. Our findings not only recovered many RNA elements that are known or supported by literature evidence but also included exciting novel RNA elements for further exploration.
SCOPE is a novel parameter-free method for the de novo identification of potential regulatory motifs in sets of coordinately regulated genes. The SCOPE algorithm combines the output of three component algorithms, each designed to identify a particular class of motifs. Using an ensemble learning approach, SCOPE identifies the best candidate motifs from its component algorithms. In tests on experimentally determined datasets, SCOPE identified motifs with a significantly higher level of accuracy than a number of other web-based motif finders run with their default parameters. Because SCOPE has no adjustable parameters, the web server has an intuitive interface, requiring only a set of gene names or FASTA sequences and a choice of species. The most significant motifs found by SCOPE are displayed graphically on the main results page with a table containing summary statistics for each motif. Detailed motif information, including the sequence logo, PWM, consensus sequence and specific matching sites can be viewed through a single click on a motif. SCOPE's efficient, parameter-free search strategy has enabled the development of a web server that is readily accessible to the practising biologist while providing results that compare favorably with those of other motif finders. The SCOPE web server is at .
The identification of transcription factor binding motifs is important for the study of gene transcriptional regulation. The chromatin immunoprecipitation (ChIP), followed by massive parallel sequencing (ChIP-seq) experiments, provides an unprecedented opportunity to discover binding motifs. Computational methods have been developed to identify motifs from ChIP-seq data, while at the same time encountering several problems. For example, existing methods are often not scalable to the large number of sequences obtained from ChIP-seq peak regions. Some methods heavily rely on well-annotated motifs even though the number of known motifs is limited. To simplify the problem, de novo motif discovery methods often neglect underrepresented motifs in ChIP-seq peak regions. To address these issues, we developed a novel approach called SIOMICS to de novo discover motifs from ChIP-seq data. Tested on 13 ChIP-seq data sets, SIOMICS identified motifs of many known and new cofactors. Tested on 13 simulated random data sets, SIOMICS discovered no motif in any data set. Compared with two recently developed methods for motif discovery, SIOMICS shows advantages in terms of speed, the number of known cofactor motifs predicted in experimental data sets and the number of false motifs predicted in random data sets. The SIOMICS software is freely available at http://eecs.ucf.edu/∼xiaoman/SIOMICS/SIOMICS.html.
Protein sorting is the process that newly synthesized proteins are transported to their target locations within or outside of the cell. This process is precisely regulated by protein sorting signals in different forms. A major category of sorting signals are amino acid sub-sequences usually located at the N-terminals or C-terminals of protein sequences. Genome-wide experimental identification of protein sorting signals is extremely time-consuming and costly. Effective computational algorithms for de novo discovery of protein sorting signals is needed to improve the understanding of protein sorting mechanisms.
We formulated the protein sorting motif discovery problem as a classification problem and proposed a Bayesian classifier based algorithm (BayesMotif) for de novo identification of a common type of protein sorting motifs in which a highly conserved anchor is present along with a less conserved motif regions. A false positive removal procedure is developed to iteratively remove sequences that are unlikely to contain true motifs so that the algorithm can identify motifs from impure input sequences.
Experiments on both implanted motif datasets and real-world datasets showed that the enhanced BayesMotif algorithm can identify anchored sorting motifs from pure or impure protein sequence dataset. It also shows that the false positive removal procedure can help to identify true motifs even when there is only 20% of the input sequences containing true motif instances.
We proposed BayesMotif, a novel Bayesian classification based algorithm for de novo discovery of a special category of anchored protein sorting motifs from impure datasets. Compared to conventional motif discovery algorithms such as MEME, our algorithm can find less-conserved motifs with short highly conserved anchors. Our algorithm also has the advantage of easy incorporation of additional meta-sequence features such as hydrophobicity or charge of the motifs which may help to overcome the limitations of PWM (position weight matrix) motif model.
Computational methods for discovery of sequence elements that are enriched in a target set compared with a background set are fundamental in molecular biology research. One example is the discovery of transcription factor binding motifs that are inferred from ChIP–chip (chromatin immuno-precipitation on a microarray) measurements. Several major challenges in sequence motif discovery still require consideration: (i) the need for a principled approach to partitioning the data into target and background sets; (ii) the lack of rigorous models and of an exact p-value for measuring motif enrichment; (iii) the need for an appropriate framework for accounting for motif multiplicity; (iv) the tendency, in many of the existing methods, to report presumably significant motifs even when applied to randomly generated data. In this paper we present a statistical framework for discovering enriched sequence elements in ranked lists that resolves these four issues. We demonstrate the implementation of this framework in a software application, termed DRIM (discovery of rank imbalanced motifs), which identifies sequence motifs in lists of ranked DNA sequences. We applied DRIM to ChIP–chip and CpG methylation data and obtained the following results. (i) Identification of 50 novel putative transcription factor (TF) binding sites in yeast ChIP–chip data. The biological function of some of them was further investigated to gain new insights on transcription regulation networks in yeast. For example, our discoveries enable the elucidation of the network of the TF ARO80. Another finding concerns a systematic TF binding enhancement to sequences containing CA repeats. (ii) Discovery of novel motifs in human cancer CpG methylation data. Remarkably, most of these motifs are similar to DNA sequence elements bound by the Polycomb complex that promotes histone methylation. Our findings thus support a model in which histone methylation and CpG methylation are mechanistically linked. Overall, we demonstrate that the statistical framework embodied in the DRIM software tool is highly effective for identifying regulatory sequence elements in a variety of applications ranging from expression and ChIP–chip to CpG methylation data. DRIM is publicly available at http://bioinfo.cs.technion.ac.il/drim.
A computational problem with many applications in molecular biology is to identify short DNA sequence patterns (motifs) that are significantly overrepresented in a target set of genomic sequences relative to a background set of genomic sequences. One example is a target set that contains DNA sequences to which a specific transcription factor protein was experimentally measured as bound while the background set contains sequences to which the same transcription factor was not bound. Overrepresented sequence motifs in the target set may represent a subsequence that is molecularly recognized by the transcription factor. An inherent limitation of the above formulation of the problem lies in the fact that in many cases data cannot be clearly partitioned into distinct target and background sets in a biologically justified manner. We describe a statistical framework for discovering motifs in a list of genomic sequences that are ranked according to a biological parameter or measurement (e.g., transcription factor to sequence binding measurements). Our approach circumvents the need to partition the data into target and background sets using arbitrarily set parameters. The framework is implemented in a software tool called DRIM. The application of DRIM led to the identification of novel putative transcription factor binding sites in yeast and to the discovery of previously unknown motifs in CpG methylation regions in human cancer cell lines.
Discovery of functionally significant short, statistically overrepresented subsequence patterns (motifs) in a set of sequences is a challenging problem in bioinformatics. Oftentimes, not all sequences in the set contain a motif. These non-motif-containing sequences complicate the algorithmic discovery of motifs. Filtering the non-motif-containing sequences from the larger set of sequences while simultaneously determining the identity of the motif is, therefore, desirable and a non-trivial problem in motif discovery research.
We describe MotifCatcher, a framework that extends the sensitivity of existing motif-finding tools by employing random sampling to effectively remove non-motif-containing sequences from the motif search. We developed two implementations of our algorithm; each built around a commonly used motif-finding tool, and applied our algorithm to three diverse chromatin immunoprecipitation (ChIP) data sets. In each case, the motif finder with the MotifCatcher extension demonstrated improved sensitivity over the motif finder alone. Our approach organizes candidate functionally significant discovered motifs into a tree, which allowed us to make additional insights. In all cases, we were able to support our findings with experimental work from the literature.
Our framework demonstrates that additional processing at the sequence entry level can significantly improve the performance of existing motif-finding tools. For each biological data set tested, we were able to propose novel biological hypotheses supported by experimental work from the literature. Specifically, in Escherichia coli, we suggested binding site motifs for 6 non-traditional LexA protein binding sites; in Saccharomyces cerevisiae, we hypothesize 2 disparate mechanisms for novel binding sites of the Cse4p protein; and in Halobacterium sp. NRC-1, we discoverd subtle differences in a general transcription factor (GTF) binding site motif across several data sets. We suggest that small differences in our discovered motif could confer specificity for one or more homologous GTF proteins. We offer a free implementation of the MotifCatcher software package at
Motif; Monte Carlo; ChIP-seq; ChIP-chip; Comparative genomics; MEME; STAMP; TFB
A key step in understanding gene regulation is to identify the repertoire of transcription factor binding motifs (TFBMs) that form the building blocks of promoters and other regulatory elements. Identifying these experimentally is very laborious, and the number of TFBMs discovered remains relatively small, especially when compared with the hundreds of transcription factor genes predicted in metazoan genomes. We have used a recently developed statistical motif discovery approach, NestedMICA, to detect candidate TFBMs from a large set of Drosophila melanogaster promoter regions. Of the 120 motifs inferred in our initial analysis, 25 were statistically significant matches to previously reported motifs, while 87 appeared to be novel. Analysis of sequence conservation and motif positioning suggested that the great majority of these discovered motifs are predictive of functional elements in the genome. Many motifs showed associations with specific patterns of gene expression in the D. melanogaster embryo, and we were able to obtain confident annotation of expression patterns for 25 of our motifs, including eight of the novel motifs. The motifs are available through Tiffin, a new database of DNA sequence motifs. We have discovered many new motifs that are overrepresented in D. melanogaster promoter regions, and offer several independent lines of evidence that these are novel TFBMs. Our motif dictionary provides a solid foundation for further investigation of regulatory elements in Drosophila, and demonstrates techniques that should be applicable in other species. We suggest that further improvements in computational motif discovery should narrow the gap between the set of known motifs and the total number of transcription factors in metazoan genomes.
In contrast to the genomic sequences that encode proteins, little is known about the regulatory elements that instruct the cell as to when and where a given gene should be active. Regulatory elements are thought to consist of clusters of short DNA words (motifs), each of which acts as a binding site for sequence-specific DNA binding protein. Thus, building a comprehensive dictionary of such motifs is an important step towards a broader understanding of gene regulation. Using the recently published NestedMICA method for detecting overrepresented motifs in a set of sequences, we build a dictionary of 120 motifs from regulatory sequences in the fruitfly genome, 87 of which are novel. Analysis of positional biases, conservation across species, and association with specific patterns of gene expression in fruitfly embryos suggest that the great majority of these newly discovered motifs represent functional regulatory elements. In addition to providing an initial motif dictionary for one of the most intensively studied model organisms, this work provides an analytical framework for the comprehensive discovery of regulatory motifs in complex animal genomes.
Accurate computational identification of cis-regulatory motifs is difficult, particularly in eukaryotic promoters, which typically contain multiple short and degenerate DNA sequences bound by several interacting factors. Enrichment in combinations of rare motifs in the promoter sequence of functionally or evolutionarily related genes among several species is an indicator of conserved transcriptional regulatory mechanisms. This provides a basis for the computational identification of cis-regulatory motifs.
We have used a discriminative seeding DNA motif discovery algorithm for an in-depth analysis of 54 seed storage protein (SSP) gene promoters from three plant families, namely Brassicaceae (mustards), Fabaceae (legumes) and Poaceae (grasses) using backgrounds based on complete sets of promoters from a representative species in each family, namely Arabidopsis (Arabidopsis thaliana (L.) Heynh.), soybean (Glycine max (L.) Merr.) and rice (Oryza sativa L.) respectively. We have identified three conserved motifs (two RY-like and one ACGT-like) in Brassicaceae and Fabaceae SSP gene promoters that are similar to experimentally characterized seed-specific cis-regulatory elements. Fabaceae SSP gene promoter sequences are also enriched in a novel, seed-specific E2Fb-like motif. Conserved motifs identified in Poaceae SSP gene promoters include a GCN4-like motif, two prolamin-box-like motifs and an Skn-1-like motif. Evidence of the presence of a variant of the TATA-box is found in the SSP gene promoters from the three plant families. Motifs discovered in SSP gene promoters were used to score whole-genome sets of promoters from Arabidopsis, soybean and rice. The highest-scoring promoters are associated with genes coding for different subunits or precursors of seed storage proteins.
Seed storage protein gene promoter motifs are conserved in diverse species, and different plant families are characterized by a distinct combination of conserved motifs. The majority of discovered motifs match experimentally characterized cis-regulatory elements. These results provide a good starting point for further experimental analysis of plant seed-specific promoters and our methodology can be used to unravel more transcriptional regulatory mechanisms in plants and other eukaryotes.
Traditional methods for computational motif discovery often suffer from poor performance. In particular, methods that search for sequence matches to known binding motifs tend to predict many non-functional binding sites because they fail to take into consideration the biological state of the cell. In recent years, genome-wide studies have generated a lot of data that has the potential to improve our ability to identify functional motifs and binding sites, such as information about chromatin accessibility and epigenetic states in different cell types. However, it is not always trivial to make use of this data in combination with existing motif discovery tools, especially for researchers who are not skilled in bioinformatics programming.
Here we present MotifLab, a general workbench for analysing regulatory sequence regions and discovering transcription factor binding sites and cis-regulatory modules. MotifLab supports comprehensive motif discovery and analysis by allowing users to integrate several popular motif discovery tools as well as different kinds of additional information, including phylogenetic conservation, epigenetic marks, DNase hypersensitive sites, ChIP-Seq data, positional binding preferences of transcription factors, transcription factor interactions and gene expression. MotifLab offers several data-processing operations that can be used to create, manipulate and analyse data objects, and complete analysis workflows can be constructed and automatically executed within MotifLab, including graphical presentation of the results.
We have developed MotifLab as a flexible workbench for motif analysis in a genomic context. The flexibility and effectiveness of this workbench has been demonstrated on selected test cases, in particular two previously published benchmark data sets for single motifs and modules, and a realistic example of genes responding to treatment with forskolin. MotifLab is freely available at http://www.motiflab.org.
DNA motifs are short sequences varying from 6 to 25 bp and can be highly variable and degenerated. One major approach for predicting transcription factor (TF) binding is using position weight matrix (PWM) to represent information content of regulatory sites; however, when used as the sole means of identifying binding sites suffers from the limited amount of training data available and a high rate of false-positive predictions. ChIPMotifs program is a de novo motif finding tool developed for ChIP-based high-throughput data, and W-ChIPMotifs is a Web application tool for ChIPMotifs. It composes various ab initio motif discovery tools such as MEME, MaMF, Weeder and optimizes the significance of the detected motifs by using bootstrap re-sampling error estimation and a Fisher test. Using these techniques, we determined a PWM for OCT4 which is similar to canonical OCT4 consensus sequence. In a separate study, we also use de novo motif discovery to suggest that ZNF263 binds to a 24-nt site that differs from the motif predicted by the zinc finger code in several positions.
Motif; ChIP; Position weight matrix; OCT4; ZNF263
MEME (Multiple EM for Motif Elicitation) is one of the most widely used tools for searching for novel ‘signals’ in sets of biological sequences. Applications include the discovery of new transcription factor binding sites and protein domains. MEME works by searching for repeated, ungapped sequence patterns that occur in the DNA or protein sequences provided by the user. Users can perform MEME searches via the web server hosted by the National Biomedical Computation Resource () and several mirror sites. Through the same web server, users can also access the Motif Alignment and Search Tool to search sequence databases for matches to motifs encoded in several popular formats. By clicking on buttons in the MEME output, users can compare the motifs discovered in their input sequences with databases of known motifs, search sequence databases for matches to the motifs and display the motifs in various formats. This article describes the freely accessible web server and its architecture, and discusses ways to use MEME effectively to find new sequence patterns in biological sequences and analyze their significance.
The discovery of genetic networks and cis-acting DNA motifs underlying their regulation is a major objective of transcriptome studies. The recent release of the maize genome (Zea mays L.) has facilitated in silico searches for regulatory motifs. Several algorithms exist to predict cis-acting elements, but none have been adapted for maize.
A benchmark data set was used to evaluate the accuracy of three motif discovery programs: BioProspector, Weeder and MEME. Analysis showed that each motif discovery tool had limited accuracy and appeared to retrieve a distinct set of motifs. Therefore, using the benchmark, statistical filters were optimized to reduce the false discovery ratio, and then remaining motifs from all programs were combined to improve motif prediction. These principles were integrated into a user-friendly pipeline for motif discovery in maize called Promzea, available at http://www.promzea.org and on the Discovery Environment of the iPlant Collaborative website. Promzea was subsequently expanded to include rice and Arabidopsis. Within Promzea, a user enters cDNA sequences or gene IDs; corresponding upstream sequences are retrieved from the maize genome. Predicted motifs are filtered, combined and ranked. Promzea searches the chosen plant genome for genes containing each candidate motif, providing the user with the gene list and corresponding gene annotations. Promzea was validated in silico using a benchmark data set: the Promzea pipeline showed a 22% increase in nucleotide sensitivity compared to the best standalone program tool, Weeder, with equivalent nucleotide specificity. Promzea was also validated by its ability to retrieve the experimentally defined binding sites of transcription factors that regulate the maize anthocyanin and phlobaphene biosynthetic pathways. Promzea predicted additional promoter motifs, and genome-wide motif searches by Promzea identified 127 non-anthocyanin/phlobaphene genes that each contained all five predicted promoter motifs in their promoters, perhaps uncovering a broader co-regulated gene network. Promzea was also tested against tissue-specific microarray data from maize.
An online tool customized for promoter motif discovery in plants has been generated called Promzea. Promzea was validated in silico by its ability to retrieve benchmark motifs and experimentally defined motifs and was tested using tissue-specific microarray data. Promzea predicted broader networks of gene regulation associated with the historic anthocyanin and phlobaphene biosynthetic pathways. Promzea is a new bioinformatics tool for understanding transcriptional gene regulation in maize and has been expanded to include rice and Arabidopsis.
Promoter; cis-acting; Motif; Maize; Anthocyanin; Phlobaphene; Bioprospector; MEME; Weeder; C1; P
Motif enrichment analysis of transcription factor ChIP-seq data can help identify transcription factors that cooperate or compete. Previously, little attention has been given to comparative motif enrichment analysis of pairs of ChIP-seq experiments, where the binding of the same transcription factor is assayed under different conditions. Such comparative analysis could potentially identify the distinct regulatory partners/competitors of the assayed transcription factor under different conditions or at different stages of development.
We describe a new methodology for identifying sequence motifs that are differentially enriched in one set of DNA or RNA sequences relative to another set, and apply it to paired ChIP-seq experiments. We show that, using paired ChIP-seq data for a single transcription factor, differential motif enrichment analysis identifies all the known key transcription factors involved in the transformation of non-cancerous immortalized breast cells (MCF10A-ER-Src cells) into cancer stem cells whereas non-differential motif enrichment analysis does not. We also show that differential motif enrichment analysis identifies regulatory motifs that are significantly enriched at constrained locations within the bound promoters, and that these motifs are not identified by non-differential motif enrichment analysis. Our methodology differs from other approaches in that it leverages both comparative enrichment and positional enrichment of motifs in ChIP-seq peak regions or in the promoters of genes bound by the transcription factor.
We show that differential motif enrichment analysis of paired ChIP-seq experiments offers biological insights not available from non-differential analysis. In contrast to previous approaches, our method detects motifs that are enriched in a constrained region in one set of sequences, but not enriched in the same region in the comparative set. We have enhanced the web-based CentriMo algorithm to allow it to perform the constrained differential motif enrichment analysis described in this paper, and CentriMo’s on-line interface (http://meme.ebi.edu.au) provides dozens of databases of DNA- and RNA-binding motifs from a full range of organisms. All data and output files presented here are available at http://research.imb.uq.edu.au/t.bailey/supplementary_data/Lesluyes2014.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-752) contains supplementary material, which is available to authorized users.
Comparative ChIP-seq analysis; Constrained differential motif enrichment analysis; MCF10A-ER-Src cells; ChIP-seq; Regulation of transcription; Gene expression
Summary: A motif is a short DNA or protein sequence that contributes to the biological function of the sequence in which it resides. Over the past several decades, many computational methods have been described for identifying, characterizing and searching with sequence motifs. Critical to nearly any motif-based sequence analysis pipeline is the ability to scan a sequence database for occurrences of a given motif described by a position-specific frequency matrix.
Results: We describe Find Individual Motif Occurrences (FIMO), a software tool for scanning DNA or protein sequences with motifs described as position-specific scoring matrices. The program computes a log-likelihood ratio score for each position in a given sequence database, uses established dynamic programming methods to convert this score to a P-value and then applies false discovery rate analysis to estimate a q-value for each position in the given sequence. FIMO provides output in a variety of formats, including HTML, XML and several Santa Cruz Genome Browser formats. The program is efficient, allowing for the scanning of DNA sequences at a rate of 3.5 Mb/s on a single CPU.
Availability and Implementation: FIMO is part of the MEME Suite software toolkit. A web server and source code are available at http://meme.sdsc.edu.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
Discovering transcription factor binding sites (TFBS) is one of primary challenges to decipher complex gene regulatory networks encrypted in a genome. A set of short DNA sequences identified by a transcription factor (TF) is known as a motif, which can be expressed accurately in matrix form such as a position-specific scoring matrix (PSSM) and a position frequency matrix. Very frequently, we need to query a motif in a database of motifs by seeking its similar motifs, merge similar TFBS motifs possibly identified by the same TF, separate irrelevant motifs, or filter out spurious motifs. Therefore, a novel metric is required to seize slight differences between irrelevant motifs and highlight the similarity between motifs of the same group in all these applications. While there are already several metrics for motif similarity proposed before, their performance is still far from satisfactory for these applications.
A novel metric has been proposed in this paper with name as SPIC (Similarity with Position Information Contents) for measuring the similarity between a column of a motif and a column of another motif. When defining this similarity score, we consider the likelihood that the column of the first motif's PFM can be produced by the column of the second motif's PSSM, and multiply the likelihood by the information content of the column of the second motif's PSSM, and vise versa. We evaluated the performance of SPIC combined with a local or a global alignment method having a function for affine gap penalty, for computing the similarity between two motifs. We also compared SPIC with seven existing state-of-the-arts metrics for their capability of clustering motifs from the same group and retrieving motifs from a database on three datasets.
When used jointly with the Smith-Waterman local alignment method with an affine gap penalty function (gap open penalty is equal to1, gap extension penalty is equal to 0.5), SPIC outperforms the seven existing state-of-the-art motif similarity metrics combined with their best alignments for matching motifs in database searches, and clustering the same TF's sub-motifs or distinguishing relevant ones from a miscellaneous group of motifs.
We have developed a novel motif similarity metric that can more accurately match motifs in database searches, and more effectively cluster similar motifs and differentiate irrelevant motifs than do the other seven metrics we are aware of.
gene regulatory networks; information contents; transcription factor binding site (TFBS); motif; similarity metric