PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1113899)

Clipboard (0)
None

Related Articles

1.  PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny 
PLoS Computational Biology  2005;1(7):e67.
A central problem in the bioinformatics of gene regulation is to find the binding sites for regulatory proteins. One of the most promising approaches toward identifying these short and fuzzy sequence patterns is the comparative analysis of orthologous intergenic regions of related species. This analysis is complicated by various factors. First, one needs to take the phylogenetic relationship between the species into account in order to distinguish conservation that is due to the occurrence of functional sites from spurious conservation that is due to evolutionary proximity. Second, one has to deal with the complexities of multiple alignments of orthologous intergenic regions, and one has to consider the possibility that functional sites may occur outside of conserved segments. Here we present a new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors (TFs) can be assigned to the multiple sequence alignments. These binding site configurations are scored by a Bayesian probabilistic model that treats aligned sequences by a model for the evolution of binding sites and “background” intergenic DNA. This model takes the phylogenetic relationship between the species in the alignment explicitly into account. The algorithm uses simulated annealing and Monte Carlo Markov-chain sampling to rigorously assign posterior probabilities to all the binding sites that it reports. In tests on synthetic data and real data from five Saccharomyces species our algorithm performs significantly better than four other motif-finding algorithms, including algorithms that also take phylogeny into account. Our results also show that, in contrast to the other algorithms, PhyloGibbs can make realistic estimates of the reliability of its predictions. Our tests suggest that, running on the five-species multiple alignment of a single gene's upstream region, PhyloGibbs on average recovers over 50% of all binding sites in S. cerevisiae at a specificity of about 50%, and 33% of all binding sites at a specificity of about 85%. We also tested PhyloGibbs on collections of multiple alignments of intergenic regions that were recently annotated, based on ChIP-on-chip data, to contain binding sites for the same TF. We compared PhyloGibbs's results with the previous analysis of these data using six other motif-finding algorithms. For 16 of 21 TFs for which all other motif-finding methods failed to find a significant motif, PhyloGibbs did recover a motif that matches the literature consensus. In 11 cases where there was disagreement in the results we compiled lists of known target genes from the literature, and found that running PhyloGibbs on their regulatory regions yielded a binding motif matching the literature consensus in all but one of the cases. Interestingly, these literature gene lists had little overlap with the targets annotated based on the ChIP-on-chip data. The PhyloGibbs code can be downloaded from http://www.biozentrum.unibas.ch/~nimwegen/cgi-bin/phylogibbs.cgi or http://www.imsc.res.in/~rsidd/phylogibbs. The full set of predicted sites from our tests on yeast are available at http://www.swissregulon.unibas.ch.
Synopsis
Computational discovery of regulatory sites in intergenic DNA is one of the central problems in bioinformatics. Up until recently motif finders would typically take one of the following two general approaches. Given a known set of co-regulated genes, one searches their promoter regions for significantly overrepresented sequence motifs. Alternatively, in a “phylogenetic footprinting” approach one searches multiple alignments of orthologous intergenic regions for short segments that are significantly more conserved than expected based on the phylogeny of the species.
In this work the authors present an algorithm, PhyloGibbs, that combines these two approaches into one integrated Bayesian framework. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors can be assigned to arbitrary collections of multiple sequence alignments while taking into account the phylogenetic relations between the sequences.
The authors perform a number of tests on synthetic data and real data from Saccharomyces genomes in which PhyloGibbs significantly outperforms other existing methods. Finally, a novel anneal-and-track strategy allows PhyloGibbs to make accurate estimates of the reliability of its predictions.
doi:10.1371/journal.pcbi.0010067
PMCID: PMC1309704  PMID: 16477324
2.  Combinatorial depletion analysis to assemble the network architecture of the SAGA and ADA chromatin remodeling complexes 
A combinatorial depletion strategy is combined with biochemistry, quantitative proteomics and computational approaches to elucidate the structure of the SAGA/ADA complexes. The analysis reveals five connected functional modules capable of independent assembly.
A combinatorial approach of gene depletions with multiple bait proteins coupled with biochemical, proteomic and computational approaches can experimentally determine modules of stable multi-protein complexes.SAGA is a 19-subunit complex consisting of five connected modules with Spt20 being particularly important for the assembly of the intact complex.One of the modules, the HAT/Core module, is also shared with the distinct six-subunit complex ADA.Architectural models of large multi-protein complexes can be assembled using our approach, which is an alternative method to generate novel insight into the organization and architecture of multi-protein complexes.
Determining the architectures of protein complexes improves our understanding of protein cellular functions. In order to efficiently characterize the subunits of protein complexes assembled in vivo, affinity purification followed by proteomics mass spectrometry (APMS) strategies have been devised. Partial or whole protein complexes are first biochemically isolated using tagged components of the complex, followed by an identification of all co-purified proteins using mass spectrometry. However, those approaches are insufficient to provide information about the spatial arrangement and the interrelationship of the proteins of the respective complex.
In this study, we developed and applied a novel method utilizing biochemistry, quantitative proteomics and computational approaches in order to characterize the organization of proteins in a complex. The key of our method is the systematic purification of several tagged components of the protein complex in multiple genetic deletion strains, which serve to compromise the integrity of the complex. Using a series of computational methods, these raw quantitative values are next interpreted in order to determine the modular organization of the complex as well as the interrelationships between its subunits, which in turn can be used to predict a macromolecular model of the complex.
We tested this approach to obtain novel insights into the architecture of multi-protein complexes on the Saccharomyces cerevisiae Spt–Ada–Gcn5 histone acetyltransferase (HAT) (SAGA) and ADA complexes, which are conserved complexes involved in chromatin remodeling (Koutelou et al, 2010). Regular quantitative APMS strategies in wild-type backgrounds were not sufficient to separate tight protein complexes like SAGA/ADA into its distinct modules. However, after perturbing the system using genetic deletions of several subunits located in different topological parts of SAGA, hierarchical cluster analysis performed on 34 purifications (generated using 10 different TAP-tagged baits) resulted in a dissociation of the Gcn5 HAT complexes into five modules: (1) the SA_TAF module, (2) the SA_SPT module, (3) the DUB module, (4) the HAT/Core module and (5) the ADA module (Figure 2A and B).
The approach of purifying a protein in a deletion strain furthermore provides valuable information about the influence of the deleted subunit on the association and interdependency of the bait and the remaining preys. In order to quantify these associations, we calculated a probability between every prey and bait in the deletion strain purifications based on Bayes' theorem (Sardiu et al, 2008). In conjunction with preexisting interaction data obtained from yeast two-hybrid and genetic complementation assays, we finally used these probabilities to predict a low-resolution model for the architecture of the SAGA and ADA complexes (Figure 4).
This novel approach revealed that the SAGA/ADA complexes are composed of five distinct functional modules, of which two were not previously described (SA_SPT and SA_TAF). These modules, which are responsible for different functions of the SAGA complex, are capable of assembling independently from the remaining modules of the complex. Furthermore, we identified a novel subunit of the ADA complex, termed Ahc2, and characterized Sgf29 as an ADA family protein present in all Gcn5 HAT complexes. Compared with other structural studies, which mapped 9 of the 19 known SAGA subunits using single EM reconstruction (Wu et al, 2004) or resolved the structure of the 4 subunits of the DUB module using X-ray crystallography (Kohler et al, 2010; Samara et al, 2010), our approach is not limited to a maximum number of complex subunits. Consequently, we were able to construct a macromolecular model consisting of all 21 SAGA/ADA subunits, which bridges the gap between the previous limited EM analysis and focused X-ray crystallography analysis.
Despite the availability of several large-scale proteomics studies aiming to identify protein interactions on a global scale, little is known about how proteins interact and are organized within macromolecular complexes. Here, we describe a technique that consists of a combination of biochemistry approaches, quantitative proteomics and computational methods using wild-type and deletion strains to investigate the organization of proteins within macromolecular protein complexes. We applied this technique to determine the organization of two well-studied complexes, Spt–Ada–Gcn5 histone acetyltransferase (SAGA) and ADA, for which no comprehensive high-resolution structures exist. This approach revealed that SAGA/ADA is composed of five distinct functional modules, which can persist separately. Furthermore, we identified a novel subunit of the ADA complex, termed Ahc2, and characterized Sgf29 as an ADA family protein present in all Gcn5 histone acetyltransferase complexes. Finally, we propose a model for the architecture of the SAGA and ADA complexes, which predicts novel functional associations within the SAGA complex and provides mechanistic insights into phenotypical observations in SAGA mutants.
doi:10.1038/msb.2011.40
PMCID: PMC3159981  PMID: 21734642
ADA; architecture; protein interaction network; quantitative proteomics; SAGA
3.  Metamotifs - a generative model for building families of nucleotide position weight matrices 
BMC Bioinformatics  2010;11:348.
Background
Development of high-throughput methods for measuring DNA interactions of transcription factors together with computational advances in short motif inference algorithms is expanding our understanding of transcription factor binding site motifs. The consequential growth of sequence motif data sets makes it important to systematically group and categorise regulatory motifs. It has been shown that there are familial tendencies in DNA sequence motifs that are predictive of the family of factors that binds them. Further development of methods that detect and describe familial motif trends has the potential to help in measuring the similarity of novel computational motif predictions to previously known data and sensitively detecting regulatory motifs similar to previously known ones from novel sequence.
Results
We propose a probabilistic model for position weight matrix (PWM) sequence motif families. The model, which we call the 'metamotif' describes recurring familial patterns in a set of motifs. The metamotif framework models variation within a family of sequence motifs. It allows for simultaneous estimation of a series of independent metamotifs from input position weight matrix (PWM) motif data and does not assume that all input motif columns contribute to a familial pattern. We describe an algorithm for inferring metamotifs from weight matrix data. We then demonstrate the use of the model in two practical tasks: in the Bayesian NestedMICA model inference algorithm as a PWM prior to enhance motif inference sensitivity, and in a motif classification task where motifs are labelled according to their interacting DNA binding domain.
Conclusions
We show that metamotifs can be used as PWM priors in the NestedMICA motif inference algorithm to dramatically increase the sensitivity to infer motifs. Metamotifs were also successfully applied to a motif classification problem where sequence motif features were used to predict the family of protein DNA binding domains that would interact with it. The metamotif based classifier is shown to compare favourably to previous related methods. The metamotif has great potential for further use in machine learning tasks related to especially de novo computational sequence motif inference. The metamotif methods presented have been incorporated into the NestedMICA suite.
doi:10.1186/1471-2105-11-348
PMCID: PMC2906491  PMID: 20579334
4.  Alteration of the carboxyl-terminal domain of Ada protein influences its inducibility, specificity, and strength as a transcriptional activator. 
Journal of Bacteriology  1988;170(11):5263-5271.
The ada gene of Escherichia coli K-12 encodes the regulatory protein for the adaptive response to alkylating agents. A set of plasmids carrying ordered deletions from the 3' end of the ada gene were isolated and characterized. These ada deletions encode fusion proteins that derive their amino termini from ada and their carboxyl termini from the downstream vector sequence that occurs before an in-frame stop codon. Several of these ada deletions encode Ada derivatives that constitutively activate ada transcription to very high levels. A second class of ada deletions encode Ada derivatives that are dominant inhibitors of the inducible transcription of ada but are inducible activators of alkA transcription. In addition, we found that two Ada derivatives containing the same ada sequences but fused to different vector-derived tails have strikingly different properties. One Ada derivative constitutively activates both ada and alkA expression to very high levels. In contrast, the other Ada derivative is an inducible activator of ada expression, like the wild-type Ada protein, but is not an inducible activator of alkA transcription. Our data suggest that the carboxyl terminus of the Ada protein plays a key role in modulating the ability of the Ada protein to function as a transcriptional activator.
PMCID: PMC211600  PMID: 3141384
5.  Tree-Based Position Weight Matrix Approach to Model Transcription Factor Binding Site Profiles 
PLoS ONE  2011;6(9):e24210.
Most of the position weight matrix (PWM) based bioinformatics methods developed to predict transcription factor binding sites (TFBS) assume each nucleotide in the sequence motif contributes independently to the interaction between protein and DNA sequence, usually producing high false positive predictions. The increasing availability of TF enrichment profiles from recent ChIP-Seq methodology facilitates the investigation of dependent structure and accurate prediction of TFBSs. We develop a novel Tree-based PWM (TPWM) approach to accurately model the interaction between TF and its binding site. The whole tree-structured PWM could be considered as a mixture of different conditional-PWMs. We propose a discriminative approach, called TPD (TPWM based Discriminative Approach), to construct the TPWM from the ChIP-Seq data with a pre-existing PWM. To achieve the maximum discriminative power between the positive and negative datasets, the cutoff value is determined based on the Matthew Correlation Coefficient (MCC). The resulting TPWMs are evaluated with respect to accuracy on extensive synthetic datasets. We then apply our TPWM discriminative approach on several real ChIP-Seq datasets to refine the current TFBS models stored in the TRANSFAC database. Experiments on both the simulated and real ChIP-Seq data show that the proposed method starting from existing PWM has consistently better performance than existing tools in detecting the TFBSs. The improved accuracy is the result of modelling the complete dependent structure of the motifs and better prediction of true positive rate. The findings could lead to better understanding of the mechanisms of TF-DNA interactions.
doi:10.1371/journal.pone.0024210
PMCID: PMC3166302  PMID: 21912677
6.  Increasing Coverage of Transcription Factor Position Weight Matrices through Domain-level Homology 
PLoS ONE  2012;7(8):e42779.
Transcription factor-DNA interactions, central to cellular regulation and control, are commonly described by position weight matrices (PWMs). These matrices are frequently used to predict transcription factor binding sites in regulatory regions of DNA to complement and guide further experimental investigation. The DNA sequence preferences of transcription factors, encoded in PWMs, are dictated primarily by select residues within the DNA binding domain(s) that interact directly with DNA. Therefore, the DNA binding properties of homologous transcription factors with identical DNA binding domains may be characterized by PWMs derived from different species. Accordingly, we have implemented a fully automated domain-level homology searching method for identical DNA binding sequences.
By applying the domain-level homology search to transcription factors with existing PWMs in the JASPAR and TRANSFAC databases, we were able to significantly increase coverage in terms of the total number of PWMs associated with a given species, assign PWMs to transcription factors that did not previously have any associations, and increase the number of represented species with PWMs over an order of magnitude. Additionally, using protein binding microarray (PBM) data, we have validated the domain-level method by demonstrating that transcription factor pairs with matching DNA binding domains exhibit comparable DNA binding specificity predictions to transcription factor pairs with completely identical sequences.
The increased coverage achieved herein demonstrates the potential for more thorough species-associated investigation of protein-DNA interactions using existing resources. The PWM scanning results highlight the challenging nature of transcription factors that contain multiple DNA binding domains, as well as the impact of motif discovery on the ability to predict DNA binding properties. The method is additionally suitable for identifying domain-level homology mappings to enable utilization of additional information sources in the study of transcription factors. The domain-level homology search method, resulting PWM mappings, web-based user interface, and web API are publicly available at http://dodoma.systemsbiology.netdodoma.systemsbiology.net.
doi:10.1371/journal.pone.0042779
PMCID: PMC3428306  PMID: 22952610
7.  The Next Generation of Transcription Factor Binding Site Prediction 
PLoS Computational Biology  2013;9(9):e1003214.
Finding where transcription factors (TFs) bind to the DNA is of key importance to decipher gene regulation at a transcriptional level. Classically, computational prediction of TF binding sites (TFBSs) is based on basic position weight matrices (PWMs) which quantitatively score binding motifs based on the observed nucleotide patterns in a set of TFBSs for the corresponding TF. Such models make the strong assumption that each nucleotide participates independently in the corresponding DNA-protein interaction and do not account for flexible length motifs. We introduce transcription factor flexible models (TFFMs) to represent TF binding properties. Based on hidden Markov models, TFFMs are flexible, and can model both position interdependence within TFBSs and variable length motifs within a single dedicated framework. The availability of thousands of experimentally validated DNA-TF interaction sequences from ChIP-seq allows for the generation of models that perform as well as PWMs for stereotypical TFs and can improve performance for TFs with flexible binding characteristics. We present a new graphical representation of the motifs that convey properties of position interdependence. TFFMs have been assessed on ChIP-seq data sets coming from the ENCODE project, revealing that they can perform better than both PWMs and the dinucleotide weight matrix extension in discriminating ChIP-seq from background sequences. Under the assumption that ChIP-seq signal values are correlated with the affinity of the TF-DNA binding, we find that TFFM scores correlate with ChIP-seq peak signals. Moreover, using available TF-DNA affinity measurements for the Max TF, we demonstrate that TFFMs constructed from ChIP-seq data correlate with published experimentally measured DNA-binding affinities. Finally, TFFMs allow for the straightforward computation of an integrated TF occupancy score across a sequence. These results demonstrate the capacity of TFFMs to accurately model DNA-protein interactions, while providing a single unified framework suitable for the next generation of TFBS prediction.
Author Summary
Transcription factors are critical proteins for sequence-specific control of transcriptional regulation. Finding where these proteins bind to DNA is of key importance for global efforts to decipher the complex mechanisms of gene regulation. Greater understanding of the regulation of transcription promises to improve human genetic analysis by specifying critical gene components that have eluded investigators. Classically, computational prediction of transcription factor binding sites (TFBS) is based on models giving weights to each nucleotide at each position. We introduce a novel statistical model for the prediction of TFBS tolerant of a broader range of TFBS configurations than can be conveniently accommodated by existing methods. The new models are designed to address the confounding properties of nucleotide composition, inter-positional sequence dependence and variable lengths (e.g. variable spacing between half-sites) observed in the more comprehensive experimental data now emerging. The new models generate scores consistent with DNA-protein affinities measured experimentally and can be represented graphically, retaining desirable attributes of past methods. It demonstrates the capacity of the new approach to accurately assess DNA-protein interactions. With the rich experimental data generated from chromatin immunoprecipitation experiments, a greater diversity of TFBS properties has emerged that can now be accommodated within a single predictive approach.
doi:10.1371/journal.pcbi.1003214
PMCID: PMC3764009  PMID: 24039567
8.  MYBS: a comprehensive web server for mining transcription factor binding sites in yeast 
Nucleic Acids Research  2007;35(Web Server issue):W221-W226.
Correct interactions between transcription factors (TFs) and their binding sites (TFBSs) are of central importance to gene regulation. Recently developed chromatin-immunoprecipitation DNA chip (ChIP-chip) techniques and the phylogenetic footprinting method provide ways to identify TFBSs with high precision. In this study, we constructed a user-friendly interactive platform for dynamic binding site mapping using ChIP-chip data and phylogenetic footprinting as two filters. MYBS (Mining Yeast Binding Sites) is a comprehensive web server that integrates an array of both experimentally verified and predicted position weight matrixes (PWMs) from eleven databases, including 481 binding motif consensus sequences and 71 PWMs that correspond to 183 TFs. MYBS users can search within this platform for motif occurrences (possible binding sites) in the promoters of genes of interest via simple motif or gene queries in conjunction with the above two filters. In addition, MYBS enables users to visualize in parallel the potential regulators for a given set of genes, a feature useful for finding potential regulatory associations between TFs. MYBS also allows users to identify target gene sets of each TF pair, which could be used as a starting point for further explorations of TF combinatorial regulation. MYBS is available at http://cg1.iis.sinica.edu.tw/~mybs/.
doi:10.1093/nar/gkm379
PMCID: PMC1933147  PMID: 17537814
9.  Assessment of clusters of transcription factor binding sites in relationship to human promoter, CpG islands and gene expression 
BMC Genomics  2004;5:16.
Background
Gene expression is regulated mainly by transcription factors (TFs) that interact with regulatory cis-elements on DNA sequences. To identify functional regulatory elements, computer searching can predict TF binding sites (TFBS) using position weight matrices (PWMs) that represent positional base frequencies of collected experimentally determined TFBS. A disadvantage of this approach is the large output of results for genomic DNA. One strategy to identify genuine TFBS is to utilize local concentrations of predicted TFBS. It is unclear whether there is a general tendency for TFBS to cluster at promoter regions, although this is the case for certain TFBS. Also unclear is the identification of TFs that have TFBS concentrated in promoters and to what level this occurs. This study hopes to answer some of these questions.
Results
We developed the cluster score measure to evaluate the correlation between predicted TFBS clusters and promoter sequences for each PWM. Non-promoter sequences were used as a control. Using the cluster score, we identified a PWM group called PWM-PCP, in which TFBS clusters positively correlate with promoters, and another PWM group called PWM-NCP, in which TFBS clusters negatively correlate with promoters. The PWM-PCP group comprises 47% of the 199 vertebrate PWMs, while the PWM-NCP group occupied 11 percent. After reducing the effect of CpG islands (CGI) against the clusters using partial correlation coefficients among three properties (promoter, CGI and predicted TFBS cluster), we identified two PWM groups including those strongly correlated with CGI and those not correlated with CGI.
Conclusion
Not all PWMs predict TFBS correlated with human promoter sequences. Two main PWM groups were identified: (1) those that show TFBS clustered in promoters associated with CGI, and (2) those that show TFBS clustered in promoters independent of CGI. Assessment of PWM matches will allow more positive interpretation of TFBS in regulatory regions.
doi:10.1186/1471-2164-5-16
PMCID: PMC375527  PMID: 15053842
promoter; tissue-specific gene expression; position weight matrix; regulatory motif
10.  Rule-Based Cell Systems Model of Aging using Feedback Loop Motifs Mediated by Stress Responses 
PLoS Computational Biology  2010;6(6):e1000820.
Investigating the complex systems dynamics of the aging process requires integration of a broad range of cellular processes describing damage and functional decline co-existing with adaptive and protective regulatory mechanisms. We evolve an integrated generic cell network to represent the connectivity of key cellular mechanisms structured into positive and negative feedback loop motifs centrally important for aging. The conceptual network is casted into a fuzzy-logic, hybrid-intelligent framework based on interaction rules assembled from a priori knowledge. Based upon a classical homeostatic representation of cellular energy metabolism, we first demonstrate how positive-feedback loops accelerate damage and decline consistent with a vicious cycle. This model is iteratively extended towards an adaptive response model by incorporating protective negative-feedback loop circuits. Time-lapse simulations of the adaptive response model uncover how transcriptional and translational changes, mediated by stress sensors NF-κB and mTOR, counteract accumulating damage and dysfunction by modulating mitochondrial respiration, metabolic fluxes, biosynthesis, and autophagy, crucial for cellular survival. The model allows consideration of lifespan optimization scenarios with respect to fitness criteria using a sensitivity analysis. Our work establishes a novel extendable and scalable computational approach capable to connect tractable molecular mechanisms with cellular network dynamics underlying the emerging aging phenotype.
Author Summary
The global process of aging disturbs a broad range of cellular mechanisms in a complex fashion and is not well understood. One important goal of computational approaches in aging is to develop integrated models in terms of a unifying aging theory, predicting progression of aging phenotypes grounded on molecular mechanisms. However, current experimental data incoherently reflects many isolated processes from a large diversity of approaches, biological model systems, and species, which makes such integration a challenging task. In an attempt to close this gap, we iteratively develop a fuzzy-logic cell systems model considering the interplay of damage, metabolism, and signaling by positive and negative feedback-loop motifs using relationships drawn from literature data. Because cellular biodynamics may be considered a complex control system, this approach seems particularly suitable. Here, we demonstrate that rule-based fuzzy-logic models provide semi-quantitative predictions that enhance our understanding of complex and interlocked molecular mechanisms and their implications on the aging physiome.
doi:10.1371/journal.pcbi.1000820
PMCID: PMC2887462  PMID: 20585546
11.  PiDNA: predicting protein–DNA interactions with structural models 
Nucleic Acids Research  2013;41(Web Server issue):W523-W530.
Predicting binding sites of a transcription factor in the genome is an important, but challenging, issue in studying gene regulation. In the past decade, a large number of protein–DNA co-crystallized structures available in the Protein Data Bank have facilitated the understanding of interacting mechanisms between transcription factors and their binding sites. Recent studies have shown that both physics-based and knowledge-based potential functions can be applied to protein–DNA complex structures to deliver position weight matrices (PWMs) that are consistent with the experimental data. To further use the available structural models, the proposed Web server, PiDNA, aims at first constructing reliable PWMs by applying an atomic-level knowledge-based scoring function on numerous in silico mutated complex structures, and then using the PWM constructed by the structure models with small energy changes to predict the interaction between proteins and DNA sequences. With PiDNA, the users can easily predict the relative preference of all the DNA sequences with limited mutations from the native sequence co-crystallized in the model in a single run. More predictions on sequences with unlimited mutations can be realized by additional requests or file uploading. Three types of information can be downloaded after prediction: (i) the ranked list of mutated sequences, (ii) the PWM constructed by the favourable mutated structures, and (iii) any mutated protein–DNA complex structure models specified by the user. This study first shows that the constructed PWMs are similar to the annotated PWMs collected from databases or literature. Second, the prediction accuracy of PiDNA in detecting relatively high-specificity sites is evaluated by comparing the ranked lists against in vitro experiments from protein-binding microarrays. Finally, PiDNA is shown to be able to select the experimentally validated binding sites from 10 000 random sites with high accuracy. With PiDNA, the users can design biological experiments based on the predicted sequence specificity and/or request mutated structure models for further protein design. As well, it is expected that PiDNA can be incorporated with chromatin immunoprecipitation data to refine large-scale inference of in vivo protein–DNA interactions. PiDNA is available at: http://dna.bime.ntu.edu.tw/pidna.
doi:10.1093/nar/gkt388
PMCID: PMC3692134  PMID: 23703214
12.  Discovering Motifs in Ranked Lists of DNA Sequences 
PLoS Computational Biology  2007;3(3):e39.
Computational methods for discovery of sequence elements that are enriched in a target set compared with a background set are fundamental in molecular biology research. One example is the discovery of transcription factor binding motifs that are inferred from ChIP–chip (chromatin immuno-precipitation on a microarray) measurements. Several major challenges in sequence motif discovery still require consideration: (i) the need for a principled approach to partitioning the data into target and background sets; (ii) the lack of rigorous models and of an exact p-value for measuring motif enrichment; (iii) the need for an appropriate framework for accounting for motif multiplicity; (iv) the tendency, in many of the existing methods, to report presumably significant motifs even when applied to randomly generated data. In this paper we present a statistical framework for discovering enriched sequence elements in ranked lists that resolves these four issues. We demonstrate the implementation of this framework in a software application, termed DRIM (discovery of rank imbalanced motifs), which identifies sequence motifs in lists of ranked DNA sequences. We applied DRIM to ChIP–chip and CpG methylation data and obtained the following results. (i) Identification of 50 novel putative transcription factor (TF) binding sites in yeast ChIP–chip data. The biological function of some of them was further investigated to gain new insights on transcription regulation networks in yeast. For example, our discoveries enable the elucidation of the network of the TF ARO80. Another finding concerns a systematic TF binding enhancement to sequences containing CA repeats. (ii) Discovery of novel motifs in human cancer CpG methylation data. Remarkably, most of these motifs are similar to DNA sequence elements bound by the Polycomb complex that promotes histone methylation. Our findings thus support a model in which histone methylation and CpG methylation are mechanistically linked. Overall, we demonstrate that the statistical framework embodied in the DRIM software tool is highly effective for identifying regulatory sequence elements in a variety of applications ranging from expression and ChIP–chip to CpG methylation data. DRIM is publicly available at http://bioinfo.cs.technion.ac.il/drim.
Author Summary
A computational problem with many applications in molecular biology is to identify short DNA sequence patterns (motifs) that are significantly overrepresented in a target set of genomic sequences relative to a background set of genomic sequences. One example is a target set that contains DNA sequences to which a specific transcription factor protein was experimentally measured as bound while the background set contains sequences to which the same transcription factor was not bound. Overrepresented sequence motifs in the target set may represent a subsequence that is molecularly recognized by the transcription factor. An inherent limitation of the above formulation of the problem lies in the fact that in many cases data cannot be clearly partitioned into distinct target and background sets in a biologically justified manner. We describe a statistical framework for discovering motifs in a list of genomic sequences that are ranked according to a biological parameter or measurement (e.g., transcription factor to sequence binding measurements). Our approach circumvents the need to partition the data into target and background sets using arbitrarily set parameters. The framework is implemented in a software tool called DRIM. The application of DRIM led to the identification of novel putative transcription factor binding sites in yeast and to the discovery of previously unknown motifs in CpG methylation regions in human cancer cell lines.
doi:10.1371/journal.pcbi.0030039
PMCID: PMC1829477  PMID: 17381235
13.  Application of experimentally verified transcription factor binding sites models for computational analysis of ChIP-Seq data 
BMC Genomics  2014;15:80.
Background
ChIP-Seq is widely used to detect genomic segments bound by transcription factors (TF), either directly at DNA binding sites (BSs) or indirectly via other proteins. Currently, there are many software tools implementing different approaches to identify TFBSs within ChIP-Seq peaks. However, their use for the interpretation of ChIP-Seq data is usually complicated by the absence of direct experimental verification, making it difficult both to set a threshold to avoid recognition of too many false-positive BSs, and to compare the actual performance of different models.
Results
Using ChIP-Seq data for FoxA2 binding loci in mouse adult liver and human HepG2 cells we compared FoxA binding-site predictions for four computational models of two fundamental classes: pattern matching based on existing training set of experimentally confirmed TFBSs (oPWM and SiteGA) and de novo motif discovery (ChIPMunk and diChIPMunk). To properly select prediction thresholds for the models, we experimentally evaluated affinity of 64 predicted FoxA BSs using EMSA that allows safely distinguishing sequences able to bind TF. As a result we identified thousands of reliable FoxA BSs within ChIP-Seq loci from mouse liver and human HepG2 cells. It was found that the performance of conventional position weight matrix (PWM) models was inferior with the highest false positive rate. On the contrary, the best recognition efficiency was achieved by the combination of SiteGA & diChIPMunk/ChIPMunk models, properly identifying FoxA BSs in up to 90% of loci for both mouse and human ChIP-Seq datasets.
Conclusions
The experimental study of TF binding to oligonucleotides corresponding to predicted sites increases the reliability of computational methods for TFBS-recognition in ChIP-Seq data analysis. Regarding ChIP-Seq data interpretation, basic PWMs have inferior TFBS recognition quality compared to the more sophisticated SiteGA and de novo motif discovery methods. A combination of models from different principles allowed identification of proper TFBSs.
doi:10.1186/1471-2164-15-80
PMCID: PMC4234207  PMID: 24472686
ChIP-Seq; EMSA; Transcription factor binding sites; FoxA; SiteGA; PWM; Transcription factor binding model; Dinucleotide frequencies
14.  Linear fuzzy gene network models obtained from microarray data by exhaustive search 
BMC Bioinformatics  2004;5:108.
Background
Recent technological advances in high-throughput data collection allow for experimental study of increasingly complex systems on the scale of the whole cellular genome and proteome. Gene network models are needed to interpret the resulting large and complex data sets. Rationally designed perturbations (e.g., gene knock-outs) can be used to iteratively refine hypothetical models, suggesting an approach for high-throughput biological system analysis. We introduce an approach to gene network modeling based on a scalable linear variant of fuzzy logic: a framework with greater resolution than Boolean logic models, but which, while still semi-quantitative, does not require the precise parameter measurement needed for chemical kinetics-based modeling.
Results
We demonstrated our approach with exhaustive search for fuzzy gene interaction models that best fit transcription measurements by microarray of twelve selected genes regulating the yeast cell cycle. Applying an efficient, universally applicable data normalization and fuzzification scheme, the search converged to a small number of models that individually predict experimental data within an error tolerance. Because only gene transcription levels are used to develop the models, they include both direct and indirect regulation of genes.
Conclusion
Biological relationships in the best-fitting fuzzy gene network models successfully recover direct and indirect interactions predicted from previous knowledge to result in transcriptional correlation. Fuzzy models fit on one yeast cell cycle data set robustly predict another experimental data set for the same system. Linear fuzzy gene networks and exhaustive rule search are the first steps towards a framework for an integrated modeling and experiment approach to high-throughput "reverse engineering" of complex biological systems.
doi:10.1186/1471-2105-5-108
PMCID: PMC514698  PMID: 15304201
15.  Activity of the adenosine deaminase promoter in transgenic mice. 
Nucleic Acids Research  1988;16(21):10083-10097.
The promoter of the human gene for adenosine deaminase (ADA) is extremely G/C-rich, contains several G/C-box motifs (GGGCGGG) and lacks any apparent TATA or CAAT boxes. These features are commonly found in promoters of genes that lack a strong tissue specificity, and are referred to as "housekeeping genes". Like other housekeeping genes, the ADA gene is expressed in all tissues. However, there is a considerable variation in the levels of expression of the ADA protein in different tissues. In order to study the activity of the ADA promoter, transgenic mice were generated that harbor a chimeric gene composed of the ADA promoter linked to a reporter gene encoding the bacterial enzyme Chloramphenicol Acetyl Transferase (CAT). These mice reproducibly showed CAT expression in all tissues examined, including the hemopoietic organs (spleen, thymus and bone marrow). However, examination of the actual cell types expressing the CAT gene revealed the ADA promoter to be inactive in the hemopoietic cells. This was substantiated by a transplantation experiment in which bone marrow from ADA-CAT transgenic mice was used to reconstitute the hemopoietic compartment of lethally irradiated mice. The engrafted recipients revealed strongly reduced CAT activity in their hemopoietic organs. The lack of expression in hemopoietic cells was further shown to be correlated with a hypermethylated state of the transgene. Combined, our data suggest that the ADA promoter sequences tested can direct expression in a wide variety of tissues as expected for a regular housekeeping gene promoter. However, the activity of the ADA promoter fragment did not reflect the tissue-specific variations in expression levels of the endogenous ADA gene. Additionally, regulatory elements are needed for expression in the hemopoietic cells.
Images
PMCID: PMC338838  PMID: 3057438
16.  CisMiner: Genome-Wide In-Silico Cis-Regulatory Module Prediction by Fuzzy Itemset Mining 
PLoS ONE  2014;9(9):e108065.
Eukaryotic gene control regions are known to be spread throughout non-coding DNA sequences which may appear distant from the gene promoter. Transcription factors are proteins that coordinately bind to these regions at transcription factor binding sites to regulate gene expression. Several tools allow to detect significant co-occurrences of closely located binding sites (cis-regulatory modules, CRMs). However, these tools present at least one of the following limitations: 1) scope limited to promoter or conserved regions of the genome; 2) do not allow to identify combinations involving more than two motifs; 3) require prior information about target motifs. In this work we present CisMiner, a novel methodology to detect putative CRMs by means of a fuzzy itemset mining approach able to operate at genome-wide scale. CisMiner allows to perform a blind search of CRMs without any prior information about target CRMs nor limitation in the number of motifs. CisMiner tackles the combinatorial complexity of genome-wide cis-regulatory module extraction using a natural representation of motif combinations as itemsets and applying the Top-Down Fuzzy Frequent- Pattern Tree algorithm to identify significant itemsets. Fuzzy technology allows CisMiner to better handle the imprecision and noise inherent to regulatory processes. Results obtained for a set of well-known binding sites in the S. cerevisiae genome show that our method yields highly reliable predictions. Furthermore, CisMiner was also applied to putative in-silico predicted transcription factor binding sites to identify significant combinations in S. cerevisiae and D. melanogaster, proving that our approach can be further applied genome-wide to more complex genomes. CisMiner is freely accesible at: http://genome2.ugr.es/cisminer. CisMiner can be queried for the results presented in this work and can also perform a customized cis-regulatory module prediction on a query set of transcription factor binding sites provided by the user.
doi:10.1371/journal.pone.0108065
PMCID: PMC4182448  PMID: 25268582
17.  Toward an Integrated Model of Capsule Regulation in Cryptococcus neoformans 
PLoS Pathogens  2011;7(12):e1002411.
Cryptococcus neoformans is an opportunistic fungal pathogen that causes serious human disease in immunocompromised populations. Its polysaccharide capsule is a key virulence factor which is regulated in response to growth conditions, becoming enlarged in the context of infection. We used microarray analysis of cells stimulated to form capsule over a range of growth conditions to identify a transcriptional signature associated with capsule enlargement. The signature contains 880 genes, is enriched for genes encoding known capsule regulators, and includes many uncharacterized sequences. One uncharacterized sequence encodes a novel regulator of capsule and of fungal virulence. This factor is a homolog of the yeast protein Ada2, a member of the Spt-Ada-Gcn5 Acetyltransferase (SAGA) complex that regulates transcription of stress response genes via histone acetylation. Consistent with this homology, the C. neoformans null mutant exhibits reduced histone H3 lysine 9 acetylation. It is also defective in response to a variety of stress conditions, demonstrating phenotypes that overlap with, but are not identical to, those of other fungi with altered SAGA complexes. The mutant also exhibits significant defects in sexual development and virulence. To establish the role of Ada2 in the broader network of capsule regulation we performed RNA-Seq on strains lacking either Ada2 or one of two other capsule regulators: Cir1 and Nrg1. Analysis of the results suggested that Ada2 functions downstream of both Cir1 and Nrg1 via components of the high osmolarity glycerol (HOG) pathway. To identify direct targets of Ada2, we performed ChIP-Seq analysis of histone acetylation in the Ada2 null mutant. These studies supported the role of Ada2 in the direct regulation of capsule and mating responses and suggested that it may also play a direct role in regulating capsule-independent antiphagocytic virulence factors. These results validate our experimental approach to dissecting capsule regulation and provide multiple targets for future investigation.
Author Summary
Cryptococcus neoformans is a fungal pathogen that causes serious disease in immunocompromised individuals, killing over 600,000 people per year worldwide. A major factor in the ability of this microbe to cause disease is an extensive polysaccharide capsule that surrounds the cell and interferes with the host immune response to infection. This capsule expands dramatically in certain growth conditions, including those found in the mammalian host. We grew cells in multiple conditions and assessed gene expression and capsule size. This allowed us to identify a ‘transcriptional signature’ of genes whose expression correlates with capsule size; we speculated that a subset of these genes acts in capsule regulation. To test this hypothesis, we characterized one previously unstudied gene in this signature and found it to be a novel regulator of capsule expansion, fungal virulence, and mating. This gene encodes cryptococcal Ada2, a well-conserved protein that regulates genes involved in stress response and development. We used phenotypic analysis, RNA sequencing, and chromatin-immunoprecipitation sequencing (ChIP-Seq) to situate Ada2 in the complex network of genes that regulate capsule and other cryptococcal virulence factors. This approach, which yielded insights into the regulation of a critical fungal virulence factor, is applicable to similar questions in other pathogens.
doi:10.1371/journal.ppat.1002411
PMCID: PMC3234223  PMID: 22174677
18.  Tissue-specific prediction of directly regulated genes 
Bioinformatics  2011;27(17):2354-2360.
Direct binding by a transcription factor (TF) to the proximal promoter of a gene is a strong evidence that the TF regulates the gene. Assaying the genome-wide binding of every TF in every cell type and condition is currently impractical. Histone modifications correlate with tissue/cell/condition-specific (‘tissue specific’) TF binding, so histone ChIP-seq data can be combined with traditional position weight matrix (PWM) methods to make tissue-specific predictions of TF–promoter interactions.
Results: We use supervised learning to train a naïve Bayes predictor of TF–promoter binding. The predictor's features are the histone modification levels and a PWM-based score for the promoter. Training and testing uses sets of promoters labeled using TF ChIP-seq data, and we use cross-validation on 23 such datasets to measure the accuracy. A PWM+histone naïve Bayes predictor using a single histone modification (H3K4me3) is substantially more accurate than a PWM score or a conservation-based score (phylogenetic motif model). The naïve Bayes predictor is more accurate (on average) at all sensitivity levels, and makes only half as many false positive predictions at sensitivity levels from 10% to 80%. On average, it correctly predicts 80% of bound promoters at a false positive rate of 20%. Accuracy does not diminish when we test the predictor in a different cell type (and species) from training. Accuracy is barely diminished even when we train the predictor without using TF ChIP-seq data.
Availability: Our tissue-specific predictor of promoters bound by a TF is called Dr Gene and is available at http://bioinformatics.org.au/drgene.
Contact: t.bailey@imb.uq.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr399
PMCID: PMC3157924  PMID: 21724591
19.  iRegulon: From a Gene List to a Gene Regulatory Network Using Large Motif and Track Collections 
PLoS Computational Biology  2014;10(7):e1003731.
Identifying master regulators of biological processes and mapping their downstream gene networks are key challenges in systems biology. We developed a computational method, called iRegulon, to reverse-engineer the transcriptional regulatory network underlying a co-expressed gene set using cis-regulatory sequence analysis. iRegulon implements a genome-wide ranking-and-recovery approach to detect enriched transcription factor motifs and their optimal sets of direct targets. We increase the accuracy of network inference by using very large motif collections of up to ten thousand position weight matrices collected from various species, and linking these to candidate human TFs via a motif2TF procedure. We validate iRegulon on gene sets derived from ENCODE ChIP-seq data with increasing levels of noise, and we compare iRegulon with existing motif discovery methods. Next, we use iRegulon on more challenging types of gene lists, including microRNA target sets, protein-protein interaction networks, and genetic perturbation data. In particular, we over-activate p53 in breast cancer cells, followed by RNA-seq and ChIP-seq, and could identify an extensive up-regulated network controlled directly by p53. Similarly we map a repressive network with no indication of direct p53 regulation but rather an indirect effect via E2F and NFY. Finally, we generalize our computational framework to include regulatory tracks such as ChIP-seq data and show how motif and track discovery can be combined to map functional regulatory interactions among co-expressed genes. iRegulon is available as a Cytoscape plugin from http://iregulon.aertslab.org.
Author Summary
Gene regulatory networks control developmental, homeostatic, and disease processes by governing precise levels and spatio-temporal patterns of gene expression. Determining their topology can provide mechanistic insight into these processes. Gene regulatory networks consist of interactions between transcription factors and their direct target genes. Each regulatory interaction represents the binding of the transcription factor to a specific DNA binding site near its target gene. Here we present a computational method, called iRegulon, to identify master regulators and direct target genes in a human gene signature, i.e. a set of co-expressed genes. iRegulon relies on the analysis of the regulatory sequences around each gene in the gene set to detect enriched TF motifs or ChIP-seq peaks, using databases of nearly 10.000 TF motifs and 1000 ChIP-seq data sets or “tracks”. Next, it associates enriched motifs and tracks with candidate transcription factors and determines the optimal subset of direct target genes. We validate iRegulon on ENCODE data, and use it in combination with RNA-seq and ChIP-seq data to map a p53 downstream network with new predicted co-factors and targets. iRegulon is available as a Cytoscape plugin, supporting human, mouse, and Drosophila genes, and provides access to hundreds of cancer-related TF-target subnetworks or “regulons”.
doi:10.1371/journal.pcbi.1003731
PMCID: PMC4109854  PMID: 25058159
20.  Optimized Position Weight Matrices in Prediction of Novel Putative Binding Sites for Transcription Factors in the Drosophila melanogaster Genome 
PLoS ONE  2013;8(8):e68712.
Position weight matrices (PWMs) have become a tool of choice for the identification of transcription factor binding sites in DNA sequences. DNA-binding proteins often show degeneracy in their binding requirement and thus the overall binding specificity of many proteins is unknown and remains an active area of research. Although existing PWMs are more reliable predictors than consensus string matching, they generally result in a high number of false positive hits. Our previous study introduced a promising approach to PWM refinement in which known motifs are used to computationally mine putative binding sites directly from aligned promoter regions using composition of similar sites. In the present study, we extended this technique originally tested on single examples of transcription factors (TFs) and showed its capability to optimize PWM performance to predict new binding sites in the fruit fly genome. We propose refined PWMs in mono- and dinucleotide versions similarly computed for a large variety of transcription factors of Drosophila melanogaster. Along with the addition of many auxiliary sites the optimization includes variation of the PWM motif length, the binding sites location on the promoters and the PWM score threshold. To assess the predictive performance of the refined PWMs we compared them to conventional TRANSFAC and JASPAR sources. The results have been verified using performed tests and literature review. Overall, the refined PWMs containing putative sites derived from real promoter content processed using optimized parameters had better general accuracy than conventional PWMs.
doi:10.1371/journal.pone.0068712
PMCID: PMC3735551  PMID: 23936309
21.  Optimizing the GATA-3 position weight matrix to improve the identification of novel binding sites 
BMC Genomics  2012;13:416.
Background
The identifying of binding sites for transcription factors is a key component of gene regulatory network analysis. This is often done using position-weight matrices (PWMs). Because of the importance of in silico mapping of tentative binding sites, we previously developed an approach for PWM optimization that substantially improves the accuracy of such mapping.
Results
The present work implements the optimization algorithm applied to the existing PWM for GATA-3 transcription factor and builds a new di-nucleotide PWM. The existing available PWM is based on experimental data adopted from Jaspar. The optimized PWM substantially improves the sensitivity and specificity of the TF mapping compared to the conventional applications. The refined PWM also facilitates in silico identification of novel binding sites that are supported by experimental data. We also describe uncommon positioning of binding motifs for several T-cell lineage specific factors in human promoters.
Conclusion
Our proposed di-nucleotide PWM approach outperforms the conventional mono-nucleotide PWM approach with respect to GATA-3. Therefore our new di-nucleotide PWM provides new insight into plausible transcriptional regulatory interactions in human promoters.
doi:10.1186/1471-2164-13-416
PMCID: PMC3481455  PMID: 22913572
Transcription factor; Binding sites; GATA-3; Human promoter; Position weight matrix; Optimization
22.  Coordination of frontline defense mechanisms under severe oxidative stress 
Inference of an environmental and gene regulatory influence network (EGRINOS) by integrating transcriptional responses to H2O2 and paraquat (PQ) has revealed a multi-tiered oxidative stress (OS)-management program to transcriptionally coordinate three peroxidase/catalase enzymes, two superoxide dismutases, production of rhodopsins, carotenoids and gas vesicles, metal trafficking, and various other aspects of metabolism.ChIP-chip, microarray, and survival assays have validated important architectural aspects of this network, identified novel defense mechanisms (including two evolutionarily distant peroxidase enxymes), and showed that general transcription factors of the transcription factor B family have an important function in coordinating the OS response (OSR) despite their inability to directly sense ROS.A comparison of transcriptional responses to sub-lethal doses of H2O2 and PQ with predictions of these responses made by an EGRIN model generated earlier from responses to other environmental factors has confirmed that a significant fraction of the OSR is made up of a generalized component that is also observed in response to other stressors.Analysis of active regulons within environment and gene regulatory influence network for OS (EGRINOS) across diverse environmental conditions has identified the specialized component of oxidative stress response (OSR) that is triggered by sub-lethal OS, but not by other stressors, including sub-inhibitory levels of redox-active metals, extreme changes in oxygen tension, and a sub-lethal dose of γ rays.
Reactive oxygen species (ROS), such as hydrogen peroxide (H2O2), superoxide (O2−), and hydroxyl (OH−) radicals, are normal by-products of aerobic metabolism. Evolutionarily conserved mechanisms including detoxification enzymes (peroxidase/catalase and superoxide dismutase (SOD)) and free radical scavengers manage this endogenous production of ROS. OS is a condition reached when certain environmental stresses or genetic defects cause the production of ROS to exceed the management capacity. The damage to diverse cellular components including DNA, proteins, lipids, and carbohydrates resulting from OS (Imlay, 2003; Apel and Hirt, 2004; Perrone et al, 2008) is recognized as an important player in many diseases and in the aging process (Finkel, 2005).
We have applied a systems approach to characterize the OSR of an archaeal model organism, Halobacterium salinarum NRC-1. This haloarchaeon grows aerobically at 4.3 M salt concentration in which it routinely faces cycles of desiccation and rehydration, and increased ultraviolet radiation—both of which can increase the production of ROS (Farr and Kogoma, 1991; Oliver et al, 2001). We have reconstructed the physiological adjustments associated with management of excessive OS through the analysis of global transcriptional changes elicited by step exposure to growth sub-inhibitory and sub-lethal levels of H2O2 and PQ (a redox-cycling drug that produces O2−; Hassan and Fridovich, 1979) as well as during subsequent recovery from these stresses. We have integrated all of these data into a unified model for OSR to discover conditional functional links between protective mechanisms and normal aspects of metabolism. Subsequent phenotypic analysis of gene deletion strains has verified the conditional detoxification functions of three putative peroxidase/catalase enzymes, two SODs, and the protective function of rhodopsins under increased levels of H2O2 and PQ. Similarly, we have also validated ROS scavenging by carotenoids and flotation by gas vesicles as secondary mechanisms that may minimize OS.
Given the ubiquitous nature of OS, it is not entirely surprising that most organisms have evolved similar multiple lines of defense—both passive and active. Although such mechanisms have been extensively characterized using other model organisms, our integrated systems approach has uncovered additional protective mechanisms in H. salinarum (e.g. two evolutionarily distant peroxidase/catalase enzymes) and revealed a structure and hierarchy to the OSR through conditional regulatory associations among various components of the response. We have validated some aspects of the architecture of the regulatory network for managing OS by confirming physical protein–DNA interactions of six transcription factors (TFs) with promoters of genes they were predicted to influence in EGRINOS. Furthermore, we have also shown the consequence of deleting two of these TFs on transcript levels of genes they control and survival rate under OS. It is notable that these TFs are not directly associated with sensing ROS, but, rather, they have a general function in coordinating the overall response. This insight would not have been possible without constructing EGRINOS through systems integration of diverse datasets.
Although it has been known that OS is a component of diverse environmental stress conditions, we quantitatively show for the first time that much of the transcriptional responses induced by the two treatments could indeed have been predicted using a model constructed from the analysis of transcriptional responses to changes in other environmental factors (UV and γ-radiation, light, oxygen, and six metals). However, using specific examples we also reveal the specific components of the OSR that are triggered only under severe OS. Notably, this model of OSR gives a unified perspective of the interconnections among all of these generalized and OS-specific regulatory mechanisms.
Complexity of cellular response to oxidative stress (OS) stems from its wide-ranging damage to nucleic acids, proteins, carbohydrates, and lipids. We have constructed a systems model of OS response (OSR) for Halobacterium salinarum NRC-1 in an attempt to understand the architecture of its regulatory network that coordinates this complex response. This has revealed a multi-tiered OS-management program to transcriptionally coordinate three peroxidase/catalase enzymes, two superoxide dismutases, production of rhodopsins, carotenoids and gas vesicles, metal trafficking, and various other aspects of metabolism. Through experimental validation of interactions within the OSR regulatory network, we show that despite their inability to directly sense reactive oxygen species, general transcription factors have an important function in coordinating this response. Remarkably, a significant fraction of this OSR was accurately recapitulated by a model that was earlier constructed from cellular responses to diverse environmental perturbations—this constitutes the general stress response component. Notwithstanding this observation, comparison of the two models has identified the coordination of frontline defense and repair systems by regulatory mechanisms that are triggered uniquely by severe OS and not by other environmental stressors, including sub-inhibitory levels of redox-active metals, extreme changes in oxygen tension, and a sub-lethal dose of γ rays.
doi:10.1038/msb.2010.50
PMCID: PMC2925529  PMID: 20664639
gene regulatory network; microbiology; oxidative stress
23.  TRStalker: an efficient heuristic for finding fuzzy tandem repeats 
Bioinformatics  2010;26(12):i358-i366.
Motivation: Genomes in higher eukaryotic organisms contain a substantial amount of repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage and are characterized by close spatial contiguity. They play an important role in several molecular regulatory mechanisms, and also in several diseases (e.g. in the group of trinucleotide repeat disorders). While for TRs with a low or medium level of divergence the current methods are rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases. Fuzzy TRs are also important as tools to shed light on the evolutionary history of the genome, where higher divergence correlates with more remote duplication events.
Results: We have developed an algorithm (christened TRStalker) with the aim of detecting efficiently TRs that are hard to detect because of their inherent fuzziness, due to high levels of base substitutions, insertions and deletions. To attain this goal, we developed heuristics to solve a Steiner version of the problem for which the fuzziness is measured with respect to a motif string not necessarily present in the input string. This problem is akin to the ‘generalized median string’ that is known to be an NP-hard problem. Experiments with both synthetic and biological sequences demonstrate that our method performs better than current state of the art for fuzzy TRs and that the fuzzy TRs of the type we detect are indeed present in important biological sequences.
Availability: TRStalker will be integrated in the web-based TRs Discovery Service (TReaDS) at bioalgo.iit.cnr.it.
Contact: marco.pellegrini@iit.cnr.it
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq209
PMCID: PMC2881393  PMID: 20529928
24.  Defining the Plasticity of Transcription Factor Binding Sites by Deconstructing DNA Consensus Sequences: The PhoP-Binding Sites among Gamma/Enterobacteria 
PLoS Computational Biology  2010;6(7):e1000862.
Transcriptional regulators recognize specific DNA sequences. Because these sequences are embedded in the background of genomic DNA, it is hard to identify the key cis-regulatory elements that determine disparate patterns of gene expression. The detection of the intra- and inter-species differences among these sequences is crucial for understanding the molecular basis of both differential gene expression and evolution. Here, we address this problem by investigating the target promoters controlled by the DNA-binding PhoP protein, which governs virulence and Mg2+ homeostasis in several bacterial species. PhoP is particularly interesting; it is highly conserved in different gamma/enterobacteria, regulating not only ancestral genes but also governing the expression of dozens of horizontally acquired genes that differ from species to species. Our approach consists of decomposing the DNA binding site sequences for a given regulator into families of motifs (i.e., termed submotifs) using a machine learning method inspired by the “Divide & Conquer” strategy. By partitioning a motif into sub-patterns, computational advantages for classification were produced, resulting in the discovery of new members of a regulon, and alleviating the problem of distinguishing functional sites in chromatin immunoprecipitation and DNA microarray genome-wide analysis. Moreover, we found that certain partitions were useful in revealing biological properties of binding site sequences, including modular gains and losses of PhoP binding sites through evolutionary turnover events, as well as conservation in distant species. The high conservation of PhoP submotifs within gamma/enterobacteria, as well as the regulatory protein that recognizes them, suggests that the major cause of divergence between related species is not due to the binding sites, as was previously suggested for other regulators. Instead, the divergence may be attributed to the fast evolution of orthologous target genes and/or the promoter architectures resulting from the interaction of those binding sites with the RNA polymerase.
Author Summary
The diversity of life forms frequently results from small changes in the regulatory systems that control gene expression. These changes often occur in cis-elements relevant to transcriptional regulation that are difficult to discern, as they are short, and are embedded in a genomic background that does not play a direct role in gene expression, or that consists of disparate sequences such as those from horizontally acquired genes. We devised a machine-learning method that significantly improves the identification of these elements, uncovering families of binding site motifs (i.e., “submotifs”), instead of a single consensus recognized by a transcriptional regulator. The method can also incorporate other cis-elements to fully describe promoter architectures. Far from being just a computational convenience, ChIP-chip and custom expression microarray experiments for the PhoP regulon validated the high conservation and modular evolution of submotifs throughout the gamma/enterobacteria. This suggests that the major cause of divergence between species is not due to the binding sites, as was previously suggested for other regulators. Instead, the divergence may be attributed to the fast evolution of orthologous and horizontally-acquired target genes, and/or to the uncovered promoter architectures governing the interaction between the regulator and the RNA polymerase.
doi:10.1371/journal.pcbi.1000862
PMCID: PMC2908699  PMID: 20661307
25.  Improved predictions of transcription factor binding sites using physicochemical features of DNA 
Nucleic Acids Research  2012;40(22):e175.
Typical approaches for predicting transcription factor binding sites (TFBSs) involve use of a position-specific weight matrix (PWM) to statistically characterize the sequences of the known sites. Recently, an alternative physicochemical approach, called SiteSleuth, was proposed. In this approach, a linear support vector machine (SVM) classifier is trained to distinguish TFBSs from background sequences based on local chemical and structural features of DNA. SiteSleuth appears to generally perform better than PWM-based methods. Here, we improve the SiteSleuth approach by considering both new physicochemical features and algorithmic modifications. New features are derived from Gibbs energies of amino acid–DNA interactions and hydroxyl radical cleavage profiles of DNA. Algorithmic modifications consist of inclusion of a feature selection step, use of a nonlinear kernel in the SVM classifier, and use of a consensus-based post-processing step for predictions. We also considered SVM classification based on letter features alone to distinguish performance gains from use of SVM-based models versus use of physicochemical features. The accuracy of each of the variant methods considered was assessed by cross validation using data available in the RegulonDB database for 54 Escherichia coli TFs, as well as by experimental validation using published ChIP-chip data available for Fis and Lrp.
doi:10.1093/nar/gks771
PMCID: PMC3526315  PMID: 22923524

Results 1-25 (1113899)