Search tips
Search criteria

Results 1-8 (8)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Multiple novel promoter-architectures revealed by decoding the hidden heterogeneity within the genome 
Nucleic Acids Research  2014;42(20):12388-12403.
An important question in biology is how different promoter-architectures contribute to the diversity in regulation of transcription initiation. A step forward has been the production of genome-wide maps of transcription start sites (TSSs) using high-throughput sequencing. However, the subsequent step of characterizing promoters and their functions is still largely done on the basis of previously established promoter-elements like the TATA-box in eukaryotes or the -10 box in bacteria. Unfortunately, a majority of promoters and their activities cannot be explained by these few elements. Traditional motif discovery methods that identify novel elements also fail here, because TSS neighborhoods are often highly heterogeneous containing no overrepresented motif. We present a new, organism-independent method that explicitly models this heterogeneity while unraveling different promoter-architectures. For example, in five bacteria, we detect the presence of a pyrimidine preceding the TSS under very specific circumstances. In tuberculosis, we show for the first time that the spacing between the bacterial 10-motif and TSS is utilized by the pathogen for dynamic gene-regulation. In eukaryotes, we identify several new elements that are important for development. Identified promoter-architectures show differential patterns of evolution, chromatin structure and TSS spread, suggesting distinct regulatory functions. This work highlights the importance of characterizing heterogeneity within high-throughput genomic data rather than analyzing average patterns of nucleotide composition.
PMCID: PMC4227772  PMID: 25326324
2.  CLARE: Cracking the LAnguage of Regulatory Elements 
Bioinformatics  2011;28(4):581-583.
Summary: CLARE is a computational method designed to reveal sequence encryption of tissue-specific regulatory elements. Starting with a set of regulatory elements known to be active in a particular tissue/process, it learns the sequence code of the input set and builds a predictive model from features specific to those elements. The resulting model can then be applied to user-supplied genomic regions to identify novel candidate regulatory elements. CLARE's model also provides a detailed analysis of transcription factors that most likely bind to the elements, making it an invaluable tool for understanding mechanisms of tissue-specific gene regulation.
Availability: CLARE is freely accessible at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3278760  PMID: 22199387
3.  One size does not fit all: On how Markov model order dictates performance of genomic sequence analyses 
Nucleic Acids Research  2012;41(3):1416-1424.
The structural simplicity and ability to capture serial correlations make Markov models a popular modeling choice in several genomic analyses, such as identification of motifs, genes and regulatory elements. A critical, yet relatively unexplored, issue is the determination of the order of the Markov model. Most biological applications use a predetermined order for all data sets indiscriminately. Here, we show the vast variation in the performance of such applications with the order. To identify the ‘optimal’ order, we investigated two model selection criteria: Akaike information criterion and Bayesian information criterion (BIC). The BIC optimal order delivers the best performance for mammalian phylogeny reconstruction and motif discovery. Importantly, this order is different from orders typically used by many tools, suggesting that a simple additional step determining this order can significantly improve results. Further, we describe a novel classification approach based on BIC optimal Markov models to predict functionality of tissue-specific promoters. Our classifier discriminates between promoters active across 12 different tissues with remarkable accuracy, yielding 3 times the precision expected by chance. Application to the metagenomics problem of identifying the taxum from a short DNA fragment yields accuracies at least as high as the more complex mainstream methodologies, while retaining conceptual and computational simplicity.
PMCID: PMC3562003  PMID: 23267010
4.  MuMoD: a Bayesian approach to detect multiple modes of protein–DNA binding from genome-wide ChIP data 
Nucleic Acids Research  2012;41(1):21-32.
High-throughput chromatin immunoprecipitation has become the method of choice for identifying genomic regions bound by a protein. Such regions are then investigated for overrepresented sequence motifs, the assumption being that they must correspond to the binding specificity of the profiled protein. However this approach often fails: many bound regions do not contain the ‘expected’ motif. This is because binding DNA directly at its recognition site is not the only way the protein can cause the region to immunoprecipitate. Its binding specificity can change through association with different co-factors, it can bind DNA indirectly, through intermediaries, or even enforce its function through long-range chromosomal interactions. Conventional motif discovery methods, though largely capable of identifying overrepresented motifs from bound regions, lack the ability to characterize such diverse modes of protein–DNA binding and binding specificities. We present a novel Bayesian method that identifies distinct protein–DNA binding mechanisms without relying on any motif database. The method successfully identifies co-factors of proteins that do not bind DNA directly, such as mediator and p300. It also predicts literature-supported enhancer–promoter interactions. Even for well-studied direct-binding proteins, this method provides compelling evidence for previously uncharacterized dependencies within positions of binding sites, long-range chromosomal interactions and dimerization.
PMCID: PMC3592440  PMID: 23093591
5.  Genome-wide analyses of transcription factor GATA3-mediated gene regulation in distinct T cell types 
Immunity  2011;35(2):299-311.
The transcription factor GATA3 plays an essential role during T cell development and T helper 2 (Th2) cell differentiation. To understand GATA3-mediated gene regulation, we identified genome-wide GATA3 binding sites in ten well-defined developmental and effector T lymphocyte lineages. In the thymus, GATA3 directly regulated many critical factors, including Th-POK, Notch1 and T cell receptor subunits. In the periphery, GATA3 induced a large number of Th2 cell-specific as well as Th2 cell non-specific genes, including several transcription factors. Our data also indicate that GATA3 regulates both active and repressive histone modifications of many target genes at their regulatory elements near GATA3 binding sites. Overall, although GATA3 binding exhibited both shared and cell-specific patterns among various T cell lineages, many genes were either positively or negatively regulated by GATA3 in a cell type-specific manner, suggesting that GATA3-mediated gene regulation depends strongly on co-factors existing in different T cells.
PMCID: PMC3169184  PMID: 21867929
6.  Identifying regulatory elements in eukaryotic genomes 
Proper development and functioning of an organism depends on precise spatial and temporal expression of all its genes. These coordinated expression-patterns are maintained primarily through the process of transcriptional regulation. Transcriptional regulation is mediated by proteins binding to regulatory elements on the DNA in a combinatorial manner, where particular combinations of transcription factor binding sites establish specific regulatory codes. In this review, we survey experimental and computational approaches geared towards the identification of proximal and distal gene regulatory elements in the genomes of complex eukaryotes. Available approaches that decipher the genetic structure and function of regulatory elements by exploiting various sources of information like gene expression data, chromatin structure, DNA-binding specificities of transcription factors, cooperativity of transcription factors, etc. are highlighted. We also discuss the relevance of regulatory elements in the context of human health through examples of mutations in some of these regions having serious implications in misregulation of genes and being strongly associated with human disorders.
PMCID: PMC2764519  PMID: 19498043
transcriptional regulation; enhancers; silencers; tissue-specific regulatory elements; population variation; non-coding diseases; computational analysis of regulatory element sequence composition
7.  Finding regulatory DNA motifs using alignment-free evolutionary conservation information 
Nucleic Acids Research  2010;38(6):e90.
As an increasing number of eukaryotic genomes are being sequenced, comparative studies aimed at detecting regulatory elements in intergenic sequences are becoming more prevalent. Most comparative methods for transcription factor (TF) binding site discovery make use of global or local alignments of orthologous regulatory regions to assess whether a particular DNA site is conserved across related organisms, and thus more likely to be functional. Since binding sites are usually short, sometimes degenerate, and often independent of orientation, alignment algorithms may not align them correctly. Here, we present a novel, alignment-free approach for using conservation information for TF binding site discovery. We relax the definition of conserved sites: we consider a DNA site within a regulatory region to be conserved in an orthologous sequence if it occurs anywhere in that sequence, irrespective of orientation. We use this definition to derive informative priors over DNA sequence positions, and incorporate these priors into a Gibbs sampling algorithm for motif discovery. Our approach is simple and fast. It requires neither sequence alignments nor the phylogenetic relationships between the orthologous sequences, yet it is more effective on real biological data than methods that do.
PMCID: PMC2847231  PMID: 20047961
8.  A Nucleosome-Guided Map of Transcription Factor Binding Sites in Yeast 
PLoS Computational Biology  2007;3(11):e215.
Finding functional DNA binding sites of transcription factors (TFs) throughout the genome is a crucial step in understanding transcriptional regulation. Unfortunately, these binding sites are typically short and degenerate, posing a significant statistical challenge: many more matches to known TF motifs occur in the genome than are actually functional. However, information about chromatin structure may help to identify the functional sites. In particular, it has been shown that active regulatory regions are usually depleted of nucleosomes, thereby enabling TFs to bind DNA in those regions. Here, we describe a novel motif discovery algorithm that employs an informative prior over DNA sequence positions based on a discriminative view of nucleosome occupancy. When a Gibbs sampling algorithm is applied to yeast sequence-sets identified by ChIP-chip, the correct motif is found in 52% more cases with our informative prior than with the commonly used uniform prior. This is the first demonstration that nucleosome occupancy information can be used to improve motif discovery. The improvement is dramatic, even though we are using only a statistical model to predict nucleosome occupancy; we expect our results to improve further as high-resolution genome-wide experimental nucleosome occupancy data becomes increasingly available.
Author Summary
Identifying transcription factor (TF) binding sites across the genome is an important problem in molecular biology. Large-scale discovery of TF binding sites is usually carried out by searching for short DNA patterns that appear often within promoter regions of genes that are known to be co-bound by a TF. In such problems, promoters have traditionally been treated as strings of nucleotide bases in which TF binding sites are assumed to be equally likely to occur at any position. In vivo, however, TFs localize to DNA binding sites as part of a complicated thermodynamic process of cooperativity and competition, both with one another and, importantly, with DNA packaging proteins called nucleosomes. In particular, TFs are more likely to bind DNA at sites that are not occupied by nucleosomes. In this paper, we show that it is possible to incorporate knowledge of the nucleosome landscape across the genome to aid binding site discovery; indeed, our algorithm incorporating nucleosome occupancy information is significantly more accurate than conventional methods. We use our algorithm to generate a condition-dependent, nucleosome-guided map of binding sites for 55 TFs in yeast.
PMCID: PMC2065891  PMID: 17997593

Results 1-8 (8)