PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (37)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
Document Types
1.  NFIX Regulates Neural Progenitor Cell Differentiation During Hippocampal Morphogenesis 
Cerebral Cortex (New York, NY)  2012;24(1):261-279.
Neural progenitor cells have the ability to give rise to neurons and glia in the embryonic, postnatal and adult brain. During development, the program regulating whether these cells divide and self-renew or exit the cell cycle and differentiate is tightly controlled, and imbalances to the normal trajectory of this process can lead to severe functional consequences. However, our understanding of the molecular regulation of these fundamental events remains limited. Moreover, processes underpinning development of the postnatal neurogenic niches within the cortex remain poorly defined. Here, we demonstrate that Nuclear factor one X (NFIX) is expressed by neural progenitor cells within the embryonic hippocampus, and that progenitor cell differentiation is delayed within Nfix−/− mice. Moreover, we reveal that the morphology of the dentate gyrus in postnatal Nfix−/− mice is abnormal, with fewer subgranular zone neural progenitor cells being generated in the absence of this transcription factor. Mechanistically, we demonstrate that the progenitor cell maintenance factor Sry-related HMG box 9 (SOX9) is upregulated in the hippocampus of Nfix−/− mice and demonstrate that NFIX can repress Sox9 promoter-driven transcription. Collectively, our findings demonstrate that NFIX plays a central role in hippocampal morphogenesis, regulating the formation of neuronal and glial populations within this structure.
doi:10.1093/cercor/bhs307
PMCID: PMC3862270  PMID: 23042739
glia; glial fibrillary acidic protein; neural progenitor cell; nuclear factor one X; SOX9
2.  Motif-based analysis of large nucleotide data sets using MEME-ChIP 
Nature protocols  2014;9(6):1428-1450.
MEME-ChIP is a web-based tool for analyzing motifs in large DNA or RNA data sets. It can analyze peak regions identified by ChIP-seq, cross-linking sites identified by cLIP-seq and related assays, as well as sets of genomic regions selected using other criteria. MEME-ChIP performs de novo motif discovery, motif enrichment analysis, motif location analysis and motif clustering, providing a comprehensive picture of the DNA or RNA motifs that are enriched in the input sequences. MEME-ChIP performs two complementary types of de novo motif discovery: weight matrix–based discovery for high accuracy; and word-based discovery for high sensitivity. Motif enrichment analysis using DNA or RNA motifs from human, mouse, worm, fly and other model organisms provides even greater sensitivity. MEME-ChIP’s interactive HTML output groups and aligns significant motifs to ease interpretation. this protocol takes less than 3 h, and it provides motif discovery approaches that are distinct and complementary to other online methods.
doi:10.1038/nprot.2014.083
PMCID: PMC4175909  PMID: 24853928
3.  Creating and validating cis-regulatory maps of tissue-specific gene expression regulation 
Nucleic Acids Research  2014;42(17):11000-11010.
Predicting which genomic regions control the transcription of a given gene is a challenge. We present a novel computational approach for creating and validating maps that associate genomic regions (cis-regulatory modules–CRMs) with genes. The method infers regulatory relationships that explain gene expression observed in a test tissue using widely available genomic data for ‘other’ tissues. To predict the regulatory targets of a CRM, we use cross-tissue correlation between histone modifications present at the CRM and expression at genes within 1 Mbp of it. To validate cis-regulatory maps, we show that they yield more accurate models of gene expression than carefully constructed control maps. These gene expression models predict observed gene expression from transcription factor binding in the CRMs linked to that gene. We show that our maps are able to identify long-range regulatory interactions and improve substantially over maps linking genes and CRMs based on either the control maps or a ‘nearest neighbor’ heuristic. Our results also show that it is essential to include CRMs predicted in multiple tissues during map-building, that H3K27ac is the most informative histone modification, and that CAGE is the most informative measure of gene expression for creating cis-regulatory maps.
doi:10.1093/nar/gku801
PMCID: PMC4176179  PMID: 25200088
4.  Differential motif enrichment analysis of paired ChIP-seq experiments 
BMC Genomics  2014;15(1):752.
Background
Motif enrichment analysis of transcription factor ChIP-seq data can help identify transcription factors that cooperate or compete. Previously, little attention has been given to comparative motif enrichment analysis of pairs of ChIP-seq experiments, where the binding of the same transcription factor is assayed under different conditions. Such comparative analysis could potentially identify the distinct regulatory partners/competitors of the assayed transcription factor under different conditions or at different stages of development.
Results
We describe a new methodology for identifying sequence motifs that are differentially enriched in one set of DNA or RNA sequences relative to another set, and apply it to paired ChIP-seq experiments. We show that, using paired ChIP-seq data for a single transcription factor, differential motif enrichment analysis identifies all the known key transcription factors involved in the transformation of non-cancerous immortalized breast cells (MCF10A-ER-Src cells) into cancer stem cells whereas non-differential motif enrichment analysis does not. We also show that differential motif enrichment analysis identifies regulatory motifs that are significantly enriched at constrained locations within the bound promoters, and that these motifs are not identified by non-differential motif enrichment analysis. Our methodology differs from other approaches in that it leverages both comparative enrichment and positional enrichment of motifs in ChIP-seq peak regions or in the promoters of genes bound by the transcription factor.
Conclusions
We show that differential motif enrichment analysis of paired ChIP-seq experiments offers biological insights not available from non-differential analysis. In contrast to previous approaches, our method detects motifs that are enriched in a constrained region in one set of sequences, but not enriched in the same region in the comparative set. We have enhanced the web-based CentriMo algorithm to allow it to perform the constrained differential motif enrichment analysis described in this paper, and CentriMo’s on-line interface (http://meme.ebi.edu.au) provides dozens of databases of DNA- and RNA-binding motifs from a full range of organisms. All data and output files presented here are available at http://research.imb.uq.edu.au/t.bailey/supplementary_data/Lesluyes2014.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-752) contains supplementary material, which is available to authorized users.
doi:10.1186/1471-2164-15-752
PMCID: PMC4167127  PMID: 25179504
Comparative ChIP-seq analysis; Constrained differential motif enrichment analysis; MCF10A-ER-Src cells; ChIP-seq; Regulation of transcription; Gene expression
5.  NFIB-Mediated Repression of the Epigenetic Factor Ezh2 Regulates Cortical Development 
The Journal of Neuroscience  2014;34(8):2921-2930.
Epigenetic mechanisms are essential in regulating neural progenitor cell self-renewal, with the chromatin-modifying protein Enhancer of zeste homolog 2 (EZH2) emerging as a central player in promoting progenitor cell self-renewal during cortical development. Despite this, how Ezh2 is itself regulated remains unclear. Here, we demonstrate that the transcription factor nuclear factor IB (NFIB) plays a key role in this process. Nfib−/− mice exhibit an increased number of proliferative ventricular zone cells that express progenitor cell markers and upregulation of EZH2 expression within the neocortex and hippocampus. NFIB binds to the Ezh2 promoter and overexpression of NFIB represses Ezh2 transcription. Finally, key downstream targets of EZH2-mediated epigenetic repression are misregulated in Nfib−/− mice. Collectively, these results suggest that the downregulation of Ezh2 transcription by NFIB is an important component of the process of neural progenitor cell differentiation during cortical development.
doi:10.1523/JNEUROSCI.2319-13.2014
PMCID: PMC3931505  PMID: 24553933
cortex; Ezh2; hippocampus; neural progenitor cell; Nfib
6.  Triplex-Inspector: an analysis tool for triplex-mediated targeting of genomic loci 
Bioinformatics  2013;29(15):1895-1897.
Summary: At the heart of many modern biotechnological and therapeutic applications lies the need to target specific genomic loci with pinpoint accuracy. Although landmark experiments demonstrate technological maturity in manufacturing and delivering genetic material, the genomic sequence analysis to find suitable targets lags behind. We provide a computational aid for the sophisticated design of sequence-specific ligands and selection of appropriate targets, taking gene location and genomic architecture into account.
Availability: Source code and binaries are downloadable from www.bioinformatics.org.au/triplexator/inspector.
Contact: t.bailey@uq.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt315
PMCID: PMC3712220  PMID: 23740745
7.  An overlapping set of genes is regulated by both NFIB and the glucocorticoid receptor during lung maturation 
BMC Genomics  2014;15:231.
Background
Lung maturation is a late fetal developmental event in both mice and humans. Because of this, lung immaturity is a serious problem in premature infants. Disruption of genes for either the glucocorticoid receptor (Nr3c1) or the NFIB transcription factors results in perinatal lethality due to lung immaturity. In both knockouts, the phenotype includes excess cell proliferation, failure of saccularization and reduced expression of markers of epithelial differentiation. This similarity suggests that the two genes may co-regulate a specific set of genes essential for lung maturation.
Results
We analyzed the roles of these two transcription factors in regulating transcription using ChIP-seq data for NFIB, and RNA expression data and motif analysis for both. Our new ChIP-seq data for NFIB in lung at E16.5 shows that NFIB binds to a NFI motif. This motif is over-represented in the promoters of genes that are under-expressed in Nfib-KO mice at E18.5, suggesting an activator role for NFIB. Using available microarray data from Nr3c1-KO mice, we further identified 52 genes that are under-expressed in both Nfib and Nr3c1 knockouts, an overlap which is 13.1 times larger than what would be expected by chance. Finally, we looked for enrichment of 738 recently published transcription factor motifs in the promoters of these putative target genes and found that the NFIB and glucocorticoid receptor motifs were among the most enriched, suggesting that a subset of these genes may be directly activated by Nfib and Nr3c1.
Conclusions
Our data provide the first evidence for Nfib and Nr3c1 co-regulating genes related to lung maturation. They also establish that the in vivo DNA-binding specificity of NFIB is the same as previously seen in vitro, and highly similar to that of the other NFI-family members NFIA, NFIC and NFIX.
doi:10.1186/1471-2164-15-231
PMCID: PMC4023408  PMID: 24661679
Lung development; Nr3c1; Glucocorticoid receptor; Nfib; Regulation of transcription; ChIP-seq analysis; Expression analysis; Motif analysis; Transcription factor
8.  Potential in vivo roles of nucleic acid triple-helices 
RNA Biology  2011;8(3):427-439.
The ability of double-stranded DNA to form a triple-helical structure by hydrogen bonding with a third strand is well established, but the biological functions of these structures remain largely unknown. There is considerable albeit circumstantial evidence for the existence of nucleic triplexes in vivo and their potential participation in a variety of biological processes including chromatin organization, DNA repair, transcriptional regulation and RNA processing has been investigated in a number of studies to date. There is also a range of possible mechanisms to regulate triplex formation through differential expression of triplex-forming RNAs, alteration of chromatin accessibility, sequence unwinding and nucleotide modifications. With the advent of next generation sequencing technology combined with targeted approaches to isolate triplexes, it is now possible to survey triplex formation with respect to their genomic context, abundance and dynamical changes during differentiation and development, which may open up new vistas in understanding genome biology and gene regulation.
doi:10.4161/rna.8.3.14999
PMCID: PMC3218511  PMID: 21525785
triple-helix; RNA-DNA interaction; triplex-forming oligonucleotide; sequence-specific; transcriptional regulation
9.  Genome-wide in silico prediction of gene expression 
Bioinformatics  2012;28(21):2789-2796.
Motivation: Modelling the regulation of gene expression can provide insight into the regulatory roles of individual transcription factors (TFs) and histone modifications. Recently, Ouyang et al. in 2009 modelled gene expression levels in mouse embryonic stem (mES) cells using in vivo ChIP-seq measurements of TF binding. ChIP-seq TF binding data, however, are tissue-specific and relatively difficult to obtain. This limits the applicability of gene expression models that rely on ChIP-seq TF binding data.
Results: In this study, we build regression-based models that relate gene expression to the binding of 12 different TFs, 7 histone modifications and chromatin accessibility (DNase I hypersensitivity) in two different tissues. We find that expression models based on computationally predicted TF binding can achieve similar accuracy to those using in vivo TF binding data and that including binding at weak sites is critical for accurate prediction of gene expression. We also find that incorporating histone modification and chromatin accessibility data results in additional accuracy. Surprisingly, we find that models that use no TF binding data at all, but only histone modification and chromatin accessibility data, can be as (or more) accurate than those based on in vivo TF binding data.
Availability and implementation: All scripts, motifs and data presented in this article are available online at http://research.imb.uq.edu.au/t.bailey/supplementary_data/McLeay2011a.
Contact: t.bailey@imb.uq.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts529
PMCID: PMC3476338  PMID: 22954627
10.  Epigenetic priors for identifying active transcription factor binding sites 
Bioinformatics  2011;28(1):56-62.
Motivation Accurate knowledge of the genome-wide binding of transcription factors in a particular cell type or under a particular condition is necessary for understanding transcriptional regulation. Using epigenetic data such as histone modification and DNase I, accessibility data has been shown to improve motif-based in silico methods for predicting such binding, but this approach has not yet been fully explored.
Results We describe a probabilistic method for combining one or more tracks of epigenetic data with a standard DNA sequence motif model to improve our ability to identify active transcription factor binding sites (TFBSs). We convert each data type into a position-specific probabilistic prior and combine these priors with a traditional probabilistic motif model to compute a log-posterior odds score. Our experiments, using histone modifications H3K4me1, H3K4me3, H3K9ac and H3K27ac, as well as DNase I sensitivity, show conclusively that the log-posterior odds score consistently outperforms a simple binary filter based on the same data. We also show that our approach performs competitively with a more complex method, CENTIPEDE, and suggest that the relative simplicity of the log-posterior odds scoring method makes it an appealing and very general method for identifying functional TFBSs on the basis of DNA and epigenetic evidence.
Availability and implementation: FIMO, part of the MEME Suite software toolkit, now supports log-posterior odds scoring using position-specific priors for motif search. A web server and source code are available at http://meme.nbcr.net. Utilities for creating priors are at http://research.imb.uq.edu.au/t.bailey/SD/Cuellar2011.
Contact: t.bailey@uq.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr614
PMCID: PMC3244768  PMID: 22072382
11.  Prediction of novel long non-coding RNAs based on RNA-Seq data of mouse Klf1 knockout study 
BMC Bioinformatics  2012;13:331.
Background
Study on long non-coding RNAs (lncRNAs) has been promoted by high-throughput RNA sequencing (RNA-Seq). However, it is still not trivial to identify lncRNAs from the RNA-Seq data and it remains a challenge to uncover their functions.
Results
We present a computational pipeline for detecting novel lncRNAs from the RNA-Seq data. First, the genome-guided transcriptome reconstruction is used to generate initially assembled transcripts. The possible partial transcripts and artefacts are filtered according to the quantified expression level. After that, novel lncRNAs are detected by further filtering known transcripts and those with high protein coding potential, using a newly developed program called lncRScan. We applied our pipeline to a mouse Klf1 knockout dataset, and discussed the plausible functions of the novel lncRNAs we detected by differential expression analysis. We identified 308 novel lncRNA candidates, which have shorter transcript length, fewer exons, shorter putative open reading frame, compared with known protein-coding transcripts. Of the lncRNAs, 52 large intergenic ncRNAs (lincRNAs) show lower expression level than the protein-coding ones and 13 lncRNAs represent significant differential expression between the wild-type and Klf1 knockout conditions.
Conclusions
Our method can predict a set of novel lncRNAs from the RNA-Seq data. Some of the lncRNAs are showed differentially expressed between the wild-type and Klf1 knockout strains, suggested that those novel lncRNAs can be given high priority in further functional studies.
doi:10.1186/1471-2105-13-331
PMCID: PMC3577497  PMID: 23237380
12.  Tissue-specific prediction of directly regulated genes 
Bioinformatics  2011;27(17):2354-2360.
Direct binding by a transcription factor (TF) to the proximal promoter of a gene is a strong evidence that the TF regulates the gene. Assaying the genome-wide binding of every TF in every cell type and condition is currently impractical. Histone modifications correlate with tissue/cell/condition-specific (‘tissue specific’) TF binding, so histone ChIP-seq data can be combined with traditional position weight matrix (PWM) methods to make tissue-specific predictions of TF–promoter interactions.
Results: We use supervised learning to train a naïve Bayes predictor of TF–promoter binding. The predictor's features are the histone modification levels and a PWM-based score for the promoter. Training and testing uses sets of promoters labeled using TF ChIP-seq data, and we use cross-validation on 23 such datasets to measure the accuracy. A PWM+histone naïve Bayes predictor using a single histone modification (H3K4me3) is substantially more accurate than a PWM score or a conservation-based score (phylogenetic motif model). The naïve Bayes predictor is more accurate (on average) at all sensitivity levels, and makes only half as many false positive predictions at sensitivity levels from 10% to 80%. On average, it correctly predicts 80% of bound promoters at a false positive rate of 20%. Accuracy does not diminish when we test the predictor in a different cell type (and species) from training. Accuracy is barely diminished even when we train the predictor without using TF ChIP-seq data.
Availability: Our tissue-specific predictor of promoters bound by a TF is called Dr Gene and is available at http://bioinformatics.org.au/drgene.
Contact: t.bailey@imb.uq.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr399
PMCID: PMC3157924  PMID: 21724591
13.  Inferring direct DNA binding from ChIP-seq 
Nucleic Acids Research  2012;40(17):e128.
Genome-wide binding data from transcription factor ChIP-seq experiments is the best source of information for inferring the relative DNA-binding affinity of these proteins in vivo. However, standard motif enrichment analysis and motif discovery approaches sometimes fail to correctly identify the binding motif for the ChIP-ed factor. To overcome this problem, we propose ‘central motif enrichment analysis’ (CMEA), which is based on the observation that the positional distribution of binding sites matching the direct-binding motif tends to be unimodal, well centered and maximal in the precise center of the ChIP-seq peak regions. We describe a novel visualization and statistical analysis tool—CentriMo—that identifies the region of maximum central enrichment in a set of ChIP-seq peak regions and displays the positional distributions of predicted sites. Using CentriMo for motif enrichment analysis, we provide evidence that one transcription factor (Nanog) has different binding affinity in vivo than in vitro, that another binds DNA cooperatively (E2f1), and confirm the in vivo affinity of NFIC, rescuing a difficult ChIP-seq data set. In another data set, CentriMo strongly suggests that there is no evidence of direct DNA binding by the ChIP-ed factor (Smad1). CentriMo is now part of the MEME Suite software package available at http://meme.nbcr.net. All data and output files presented here are available at: http://research.imb.uq.edu.au/t.bailey/sd/Bailey2011a.
doi:10.1093/nar/gks433
PMCID: PMC3458523  PMID: 22610855
14.  Sorting the nuclear proteome 
Bioinformatics  2011;27(13):i7-i14.
Motivation: Quantitative experimental analyses of the nuclear interior reveal a morphologically structured yet dynamic mix of membraneless compartments. Major nuclear events depend on the functional integrity and timely assembly of these intra-nuclear compartments. Yet, unknown drivers of protein mobility ensure that they are in the right place at the time when they are needed.
Results: This study investigates determinants of associations between eight intra-nuclear compartments and their proteins in heterogeneous genome-wide data. We develop a model based on a range of candidate determinants, capable of mapping the intra-nuclear organization of proteins. The model integrates protein interactions, protein domains, post-translational modification sites and protein sequence data. The predictions of our model are accurate with a mean AUC (over all compartments) of 0.71.
We present a complete map of the association of 3567 mouse nuclear proteins with intra-nuclear compartments. Each decision is explained in terms of essential interactions and domains, and qualified with a false discovery assessment. Using this resource, we uncover the collective role of transcription factors in each of the compartments. We create diagrams illustrating the outcomes of a Gene Ontology enrichment analysis. Associated with an extensive range of transcription factors, the analysis suggests that PML bodies coordinate regulatory immune responses.
Contact: m.boden@uq.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr217
PMCID: PMC3117375  PMID: 21685104
15.  Inferring transcription factor complexes from ChIP-seq data 
Nucleic Acids Research  2011;39(15):e98.
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) allows researchers to determine the genome-wide binding locations of individual transcription factors (TFs) at high resolution. This information can be interrogated to study various aspects of TF behaviour, including the mechanisms that control TF binding. Physical interaction between TFs comprises one important aspect of TF binding in eukaryotes, mediating tissue-specific gene expression. We have developed an algorithm, spaced motif analysis (SpaMo), which is able to infer physical interactions between the given TF and TFs bound at neighbouring sites at the DNA interface. The algorithm predicts TF interactions in half of the ChIP-seq data sets we test, with the majority of these predictions supported by direct evidence from the literature or evidence of homodimerization. High resolution motif spacing information obtained by this method can facilitate an improved understanding of individual TF complex structures. SpaMo can assist researchers in extracting maximum information relating to binding mechanisms from their TF ChIP-seq data. SpaMo is available for download and interactive use as part of the MEME Suite (http://meme.nbcr.net).
doi:10.1093/nar/gkr341
PMCID: PMC3159476  PMID: 21602262
16.  DREME: motif discovery in transcription factor ChIP-seq data 
Bioinformatics  2011;27(12):1653-1659.
Motivation: Transcription factor (TF) ChIP-seq datasets have particular characteristics that provide unique challenges and opportunities for motif discovery. Most existing motif discovery algorithms do not scale well to such large datasets, or fail to report many motifs associated with cofactors of the ChIP-ed TF.
Results: We present DREME, a motif discovery algorithm specifically designed to find the short, core DNA-binding motifs of eukaryotic TFs, and optimized to analyze very large ChIP-seq datasets in minutes. Using DREME, we discover the binding motifs of the the ChIP-ed TF and many cofactors in mouse ES cell (mESC), mouse erythrocyte and human cell line ChIP-seq datasets. For example, in mESC ChIP-seq data for the TF Esrrb, we discover the binding motifs for eight cofactor TFs important in the maintenance of pluripotency. Several other commonly used algorithms find at most two cofactor motifs in this same dataset. DREME can also perform discriminative motif discovery, and we use this feature to provide evidence that Sox2 and Oct4 do not bind in mES cells as an obligate heterodimer. DREME is much faster than many commonly used algorithms, scales linearly in dataset size, finds multiple, non-redundant motifs and reports a reliable measure of statistical significance for each motif found. DREME is available as part of the MEME Suite of motif-based sequence analysis tools (http://meme.nbcr.net).
Contact: t.bailey@uq.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr261
PMCID: PMC3106199  PMID: 21543442
17.  MEME-ChIP: motif analysis of large DNA datasets 
Bioinformatics  2011;27(12):1696-1697.
Motivation: Advances in high-throughput sequencing have resulted in rapid growth in large, high-quality datasets including those arising from transcription factor (TF) ChIP-seq experiments. While there are many existing tools for discovering TF binding site motifs in such datasets, most web-based tools cannot directly process such large datasets.
Results: The MEME-ChIP web service is designed to analyze ChIP-seq ‘peak regions’—short genomic regions surrounding declared ChIP-seq ‘peaks’. Given a set of genomic regions, it performs (i) ab initio motif discovery, (ii) motif enrichment analysis, (iii) motif visualization, (iv) binding affinity analysis and (v) motif identification. It runs two complementary motif discovery algorithms on the input data—MEME and DREME—and uses the motifs they discover in subsequent visualization, binding affinity and identification steps. MEME-ChIP also performs motif enrichment analysis using the AME algorithm, which can detect very low levels of enrichment of binding sites for TFs with known DNA-binding motifs. Importantly, unlike with the MEME web service, there is no restriction on the size or number of uploaded sequences, allowing very large ChIP-seq datasets to be analyzed. The analyses performed by MEME-ChIP provide the user with a varied view of the binding and regulatory activity of the ChIP-ed TF, as well as the possible involvement of other DNA-binding TFs.
Availability: MEME-ChIP is available as part of the MEME Suite at http://meme.nbcr.net.
Contact: t.bailey@uq.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr189
PMCID: PMC3106185  PMID: 21486936
18.  FIMO: scanning for occurrences of a given motif 
Bioinformatics  2011;27(7):1017-1018.
Summary: A motif is a short DNA or protein sequence that contributes to the biological function of the sequence in which it resides. Over the past several decades, many computational methods have been described for identifying, characterizing and searching with sequence motifs. Critical to nearly any motif-based sequence analysis pipeline is the ability to scan a sequence database for occurrences of a given motif described by a position-specific frequency matrix.
Results: We describe Find Individual Motif Occurrences (FIMO), a software tool for scanning DNA or protein sequences with motifs described as position-specific scoring matrices. The program computes a log-likelihood ratio score for each position in a given sequence database, uses established dynamic programming methods to convert this score to a P-value and then applies false discovery rate analysis to estimate a q-value for each position in the given sequence. FIMO provides output in a variety of formats, including HTML, XML and several Santa Cruz Genome Browser formats. The program is efficient, allowing for the scanning of DNA sequences at a rate of 3.5 Mb/s on a single CPU.
Availability and Implementation: FIMO is part of the MEME Suite software toolkit. A web server and source code are available at http://meme.sdsc.edu.
Contact: t.bailey@imb.uq.edu.au; t.bailey@imb.uq.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr064
PMCID: PMC3065696  PMID: 21330290
19.  Dual-functioning transcription factors in the developmental gene network of Drosophila melanogaster 
BMC Bioinformatics  2010;11:366.
Background
Quantitative models for transcriptional regulation have shown great promise for advancing our understanding of the biological mechanisms underlying gene regulation. However, all of the models to date assume a transcription factor (TF) to have either activating or repressing function towards all the genes it is regulating.
Results
In this paper we demonstrate, on the example of the developmental gene network in D. melanogaster, that the data-fit can be improved by up to 40% if the model is allowing certain TFs to have dual function, that is, acting as activator for some genes and as repressor for others. We demonstrate that the improvement is not due to additional flexibility in the model but rather derived from the data itself. We also found no evidence for the involvement of other known site-specific TFs in regulating this network. Finally, we propose SUMOylation as a candidate biological mechanism allowing TFs to switch their role when a small ubiquitin-like modifier (SUMO) is covalently attached to the TF. We strengthen this hypothesis by demonstrating that the TFs predicted to have dual function also contain the known SUMO consensus motif, while TFs predicted to have only one role lack this motif.
Conclusions
We argue that a SUMOylation-dependent mechanism allowing TFs to have dual function represents a promising area for further research and might be another step towards uncovering the biological mechanisms underlying transcriptional regulation.
doi:10.1186/1471-2105-11-366
PMCID: PMC2912886  PMID: 20594356
20.  The value of position-specific priors in motif discovery using MEME 
BMC Bioinformatics  2010;11:179.
Background
Position-specific priors have been shown to be a flexible and elegant way to extend the power of Gibbs sampler-based motif discovery algorithms. Information of many types–including sequence conservation, nucleosome positioning, and negative examples–can be converted into a prior over the location of motif sites, which then guides the sequence motif discovery algorithm. This approach has been shown to confer many of the benefits of conservation-based and discriminative motif discovery approaches on Gibbs sampler-based motif discovery methods, but has not previously been studied with methods based on expectation maximization (EM).
Results
We extend the popular EM-based MEME algorithm to utilize position-specific priors and demonstrate their effectiveness for discovering transcription factor (TF) motifs in yeast and mouse DNA sequences. Utilizing a discriminative, conservation-based prior dramatically improves MEME's ability to discover motifs in 156 yeast TF ChIP-chip datasets, more than doubling the number of datasets where it finds the correct motif. On these datasets, MEME using the prior has a higher success rate than eight other conservation-based motif discovery approaches. We also show that the same type of prior improves the accuracy of motifs discovered by MEME in mouse TF ChIP-seq data, and that the motifs tend to be of slightly higher quality those found by a Gibbs sampling algorithm using the same prior.
Conclusions
We conclude that using position-specific priors can substantially increase the power of EM-based motif discovery algorithms such as MEME algorithm.
doi:10.1186/1471-2105-11-179
PMCID: PMC2868008  PMID: 20380693
21.  Motif Enrichment Analysis: a unified framework and an evaluation on ChIP data 
BMC Bioinformatics  2010;11:165.
Background
A major goal of molecular biology is determining the mechanisms that control the transcription of genes. Motif Enrichment Analysis (MEA) seeks to determine which DNA-binding transcription factors control the transcription of a set of genes by detecting enrichment of known binding motifs in the genes' regulatory regions. Typically, the biologist specifies a set of genes believed to be co-regulated and a library of known DNA-binding models for transcription factors, and MEA determines which (if any) of the factors may be direct regulators of the genes. Since the number of factors with known DNA-binding models is rapidly increasing as a result of high-throughput technologies, MEA is becoming increasingly useful. In this paper, we explore ways to make MEA applicable in more settings, and evaluate the efficacy of a number of MEA approaches.
Results
We first define a mathematical framework for Motif Enrichment Analysis that relaxes the requirement that the biologist input a selected set of genes. Instead, the input consists of all regulatory regions, each labeled with the level of a biological signal. We then define and implement a number of motif enrichment analysis methods. Some of these methods require a user-specified signal threshold, some identify an optimum threshold in a data-driven way and two of our methods are threshold-free. We evaluate these methods, along with two existing methods (Clover and PASTAA), using yeast ChIP-chip data. Our novel threshold-free method based on linear regression performs best in our evaluation, followed by the data-driven PASTAA algorithm. The Clover algorithm performs as well as PASTAA if the user-specified threshold is chosen optimally. Data-driven methods based on three statistical tests–Fisher Exact Test, rank-sum test, and multi-hypergeometric test—perform poorly, even when the threshold is chosen optimally. These methods (and Clover) perform even worse when unrestricted data-driven threshold determination is used.
Conclusions
Our novel, threshold-free linear regression method works well on ChIP-chip data. Methods using data-driven threshold determination can perform poorly unless the range of thresholds is limited a priori. The limits implemented in PASTAA, however, appear to be well-chosen. Our novel algorithms—AME (Analysis of Motif Enrichment)—are available at http://bioinformatics.org.au/ame/.
doi:10.1186/1471-2105-11-165
PMCID: PMC2868005  PMID: 20356413
22.  Assigning roles to DNA regulatory motifs using comparative genomics 
Bioinformatics  2010;26(7):860-866.
Motivation: Transcription factors (TFs) are crucial during the lifetime of the cell. Their functional roles are defined by the genes they regulate. Uncovering these roles not only sheds light on the TF at hand but puts it into the context of the complete regulatory network.
Results: Here, we present an alignment- and threshold-free comparative genomics approach for assigning functional roles to DNA regulatory motifs. We incorporate our approach into the Gomo algorithm, a computational tool for detecting associations between a user-specified DNA regulatory motif [expressed as a position weight matrix (PWM)] and Gene Ontology (GO) terms. Incorporating multiple species into the analysis significantly improves Gomo's ability to identify GO terms associated with the regulatory targets of TFs. Including three comparative species in the process of predicting TF roles in Saccharomyces cerevisiae and Homo sapiens increases the number of significant predictions by 75 and 200%, respectively. The predicted GO terms are also more specific, yielding deeper biological insight into the role of the TF. Adjusting motif (binding) affinity scores for individual sequence composition proves to be essential for avoiding false positive associations. We describe a novel DNA sequence-scoring algorithm that compensates a thermodynamic measure of DNA-binding affinity for individual sequence base composition. Gomo's prediction accuracy proves to be relatively insensitive to how promoters are defined. Because Gomo uses a threshold-free form of gene set analysis, there are no free parameters to tune. Biologists can investigate the potential roles of DNA regulatory motifs of interest using Gomo via the web (http://meme.nbcr.net).
Contact: t.bailey@uq.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq049
PMCID: PMC2844991  PMID: 20147307
23.  Assessing phylogenetic motif models for predicting transcription factor binding sites 
Bioinformatics  2009;25(12):i339-i347.
Motivation: A variety of algorithms have been developed to predict transcription factor binding sites (TFBSs) within the genome by exploiting the evolutionary information implicit in multiple alignments of the genomes of related species. One such approach uses an extension of the standard position-specific motif model that incorporates phylogenetic information via a phylogenetic tree and a model of evolution. However, these phylogenetic motif models (PMMs) have never been rigorously benchmarked in order to determine whether they lead to better prediction of TFBSs than obtained using simple position weight matrix scanning.
Results: We evaluate three PMM-based prediction algorithms, each of which uses a different treatment of gapped alignments, and we compare their prediction accuracy with that of a non-phylogenetic motif scanning approach. Surprisingly, all of these algorithms appear to be inferior to simple motif scanning, when accuracy is measured using a gold standard of validated yeast TFBSs. However, the PMM scanners perform much better than simple motif scanning when we abandon the gold standard and consider the number of statistically significant sites predicted, using column-shuffled ‘random’ motifs to measure significance. These results suggest that the common practice of measuring the accuracy of binding site predictors using collections of known sites may be dangerously misleading since such collections may be missing ‘weak’ sites, which are exactly the type of sites needed to discriminate among predictors. We then extend our previous theoretical model of the statistical power of PMM-based prediction algorithms to allow for loss of binding sites during evolution, and show that it gives a more accurate upper bound on scanner accuracy. Finally, utilizing our theoretical model, we introduce a new method for predicting the number of real binding sites in a genome. The results suggest that the number of true sites for a yeast TF is in general several times greater than the number of known sites listed in the Saccharomyces cerevisiae Database (SCPD). Among the three scanning algorithms that we test, the MONKEY algorithm has the highest accuracy for predicting yeast TFBSs.
Contact: j.hawkins@imb.uq.edu.au
doi:10.1093/bioinformatics/btp201
PMCID: PMC2687955  PMID: 19478008
24.  MEME Suite: tools for motif discovery and searching 
Nucleic Acids Research  2009;37(Web Server issue):W202-W208.
The MEME Suite web server provides a unified portal for online discovery and analysis of sequence motifs representing features such as DNA binding sites and protein interaction domains. The popular MEME motif discovery algorithm is now complemented by the GLAM2 algorithm which allows discovery of motifs containing gaps. Three sequence scanning algorithms—MAST, FIMO and GLAM2SCAN—allow scanning numerous DNA and protein sequence databases for motifs discovered by MEME and GLAM2. Transcription factor motifs (including those discovered using MEME) can be compared with motifs in many popular motif databases using the motif database scanning algorithm Tomtom. Transcription factor motifs can be further analyzed for putative function by association with Gene Ontology (GO) terms using the motif-GO term association tool GOMO. MEME output now contains sequence LOGOS for each discovered motif, as well as buttons to allow motifs to be conveniently submitted to the sequence and motif database scanning algorithms (MAST, FIMO and Tomtom), or to GOMO, for further analysis. GLAM2 output similarly contains buttons for further analysis using GLAM2SCAN and for rerunning GLAM2 with different parameters. All of the motif-based tools are now implemented as web services via Opal. Source code, binaries and a web server are freely available for noncommercial use at http://meme.nbcr.net.
doi:10.1093/nar/gkp335
PMCID: PMC2703892  PMID: 19458158
25.  Optimizing static thermodynamic models of transcriptional regulation 
Bioinformatics  2009;25(13):1640-1646.
Motivation: Modeling transcriptional regulation using thermo-dynamic modeling approaches has become increasingly relevant as a way to gain a detailed understanding of transcriptional regulation. Thermodynamic models are able to model the interactions between transcription factors (TFs) and DNA that lead to a specific transcriptional output of the target gene. Such models can be ‘trained’ by fitting their free parameters to data on the transcription rate of a gene and the concentrations of its regulating factors. However, the parameter fitting process is computationally very expensive and this limits the number of alternative types of model that can be explored.
Results: In this study, we evaluate the ‘optimization landscape’ of a class of static, quantitative models of regulation and explore the efficiency of a range of optimization methods. We evaluate eight optimization methods: two variants of simulated annealing (SA), four variants of gradient descent (GD), a hybrid SA/GD algorithm and a genetic algorithm. We show that the optimization landscape has numerous local optima, resulting in poor performance for the GD methods. SA with a simple geometric cooling schedule performs best among all tested methods. In particular, we see no advantage to using the more sophisticated ‘LAM’ cooling schedule. Overall, a good approximate solution is achievable in minutes using SA with a simple cooling schedule.
Contact: d.bauer@uq.edu.au; t.bailey@imb.uq.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp283
PMCID: PMC2732318  PMID: 19398449

Results 1-25 (37)