Cys2-His2 zinc finger proteins (ZFPs) are the largest family of transcription factors in higher metazoans. They also represent the most diverse family with regards to the composition of their recognition sequences. Although there are a number of ZFPs with characterized DNA-binding preferences, the specificity of the vast majority of ZFPs is unknown and cannot be directly inferred by homology due to the diversity of recognition residues present within individual fingers. Given the large number of unique zinc fingers and assemblies present across eukaryotes, a comprehensive predictive recognition model that could accurately estimate the DNA-binding specificity of any ZFP based on its amino acid sequence would have great utility. Toward this goal, we have used the DNA-binding specificities of 678 two-finger modules from both natural and artificial sources to construct a random forest-based predictive model for ZFP recognition. We find that our recognition model outperforms previously described determinant-based recognition models for ZFPs, and can successfully estimate the specificity of naturally occurring ZFPs with previously defined specificities.
Regulatory sites that control gene expression are essential to the proper functioning of cells, and identifying them is critical for modeling regulatory networks. We have developed Magma (Multiple Aligner of Genomic Multiple Alignments), a software tool for multiple species, multiple gene motif discovery. Magma identifies putative regulatory sites that are conserved across multiple species and occur near multiple genes throughout a reference genome. Magma takes as input multiple alignments that can include gaps. It uses efficient clustering methods that make it about 70 times faster than PhyloNet, a previous program for this task, with slightly greater sensitivity. We ran Magma on all non-coding DNA conserved between Caenorhabditis elegans and five additional species, about 70 Mbp in total, in <4 h. We obtained 2,309 motifs with lengths of 6–20 bp, each occurring at least 10 times throughout the genome, which collectively covered about 566 kbp of the genomes, approximately 0.8% of the input. Predicted sites occurred in all types of non-coding sequence but were especially enriched in the promoter regions. Comparisons to several experimental datasets show that Magma motifs correspond to a variety of known regulatory motifs.
ChIP analysis; cis-regulatory elements; eukaryotic motif-finding; fast motif-finding; genome-wide motif-finding; motif-expression association; motif redundancy; transcription factor binding site discovery
The specificity of protein-DNA interactions is most commonly modeled using position weight matrices (PWMs). First introduced in 1982, they have been adapted to many new types of data and many different approaches have been developed to determine the parameters of the PWM. New high-throughput technologies provide a large amount of data rapidly and offer an unprecedented opportunity to determine accurately the specificities of many transcription factors (TFs). But taking full advantage of the new data requires advanced algorithms that take into account the biophysical processes involved in generating the data. The new large datasets can also aid in determining when the PWM model is inadequate and must be extended to provide accurate predictions of binding sites. This article provides a general mathematical description of a PWM and how it is used to score potential binding sites, a brief history of the approaches that have been developed and the types of data that are used with an emphasis on algorithms that we have developed for analyzing high-throughput datasets from several new technologies. It also describes extensions that can be added when the simple PWM model is inadequate and further enhancements that may be necessary. It briefly describes some applications of PWMs in the discovery and modeling of in vivo regulatory networks.
Cilia are microtubule based organelles that project from cells. Cilia are found on almost every cell type of the human body and numerous diseases, collectively termed ciliopathies, are associated with defects in cilia, including respiratory infections, male infertility, situs inversus, polycystic kidney disease, retinal degeneration, and Bardet-Biedl Syndrome. Here we show that Illumina-based whole-genome transcriptome analysis in the biflagellate green alga Chlamydomonas reinhardtii identifies 1850 genes up-regulated during ciliogenesis, 4392 genes down-regulated, and 4548 genes with no change in expression during ciliogenesis. We examined four genes up-regulated and not previously known to be involved with cilia (ZMYND10, NXN, GLOD4, SPATA4) by knockdown of the human orthologs in human retinal pigment epithelial cells (hTERT-RPE1) cells to ask whether they are involved in cilia-related processes that include cilia assembly, cilia length control, basal body/centriole numbers, and the distance between basal bodies/centrioles. All of the genes have cilia-related phenotypes and, surprisingly, our data show that knockdown of GLOD4 and SPATA4 also affects the cell cycle. These results demonstrate that whole-genome transcriptome analysis during ciliogenesis is a powerful tool to gain insight into the molecular mechanism by which centrosomes and cilia are assembled.
flagella; deflagellation; ZMYND10; NXN; SPATA4; GLOD4
enoLOGOS is a web-based tool that generates sequence logos from various input sources. Sequence logos have become a popular way to graphically represent DNA and amino acid sequence patterns from a set of aligned sequences. Each position of the alignment is represented by a column of stacked symbols with its total height reflecting the information content in this position. Currently, the available web servers are able to create logo images from a set of aligned sequences, but none of them generates weighted sequence logos directly from energy measurements or other sources. With the advent of high-throughput technologies for estimating the contact energy of different DNA sequences, tools that can create logos directly from binding affinity data are useful to researchers. enoLOGOS generates sequence logos from a variety of input data, including energy measurements, probability matrices, alignment matrices, count matrices and aligned sequences. Furthermore, enoLOGOS can represent the mutual information of different positions of the consensus sequence, a unique feature of this tool. Another web interface for our software, C2H2-enoLOGOS, generates logos for the DNA-binding preferences of the C2H2 zinc-finger transcription factor family members. enoLOGOS and C2H2-enoLOGOS are accessible over the web at .
Zinc-finger nucleases (ZFNs) have been used for genome engineering in a wide variety of organisms; however, it remains challenging to design effective ZFNs for many genomic sequences using publicly available zinc-finger modules. This limitation is in part because of potential finger–finger incompatibility generated on assembly of modules into zinc-finger arrays (ZFAs). Herein, we describe the validation of a new set of two-finger modules that can be used for building ZFAs via conventional assembly methods or a new strategy—finger stitching—that increases the diversity of genomic sequences targetable by ZFNs. Instead of assembling ZFAs based on units of the zinc-finger structural domain, our finger stitching method uses units that span the finger–finger interface to ensure compatibility of neighbouring recognition helices. We tested this approach by generating and characterizing eight ZFAs, and we found their DNA-binding specificities reflected the specificities of the component modules used in their construction. Four pairs of ZFNs incorporating these ZFAs generated targeted lesions in vivo, demonstrating that stitching yields ZFAs with robust recognition properties.
MicroRNAs (miRNAs) are post-transcriptional regulators that bind to their target mRNAs through base complementarity. Predicting miRNA targets is a challenging task and various studies showed that existing algorithms suffer from high number of false predictions and low to moderate overlap in their predictions. Until recently, very few algorithms considered the dynamic nature of the interactions, including the effect of less specific interactions, the miRNA expression level, and the effect of combinatorial miRNA binding. Addressing these issues can result in a more accurate miRNA:mRNA modeling with many applications, including efficient miRNA-related SNP evaluation. We present a novel thermodynamic model based on the Fermi-Dirac equation that incorporates miRNA expression in the prediction of target occupancy and we show that it improves the performance of two popular single miRNA target finders. Modeling combinatorial miRNA targeting is a natural extension of this model. Two other algorithms show improved prediction efficiency when combinatorial binding models were considered. ComiR (Combinatorial miRNA targeting), a novel algorithm we developed, incorporates the improved predictions of the four target finders into a single probabilistic score using ensemble learning. Combining target scores of multiple miRNAs using ComiR improves predictions over the naïve method for target combination. ComiR scoring scheme can be used for identification of SNPs affecting miRNA binding. As proof of principle, ComiR identified rs17737058 as disruptive to the miR-488-5p:NCOA1 interaction, which we confirmed in vitro. We also found rs17737058 to be significantly associated with decreased bone mineral density (BMD) in two independent cohorts indicating that the miR-488-5p/NCOA1 regulatory axis is likely critical in maintaining BMD in women. With increasing availability of comprehensive high-throughput datasets from patients ComiR is expected to become an essential tool for miRNA-related studies.
MicroRNA genes (miRNAs) are small non-coding RNAs that regulate the expression levels of mRNAs post-transcriptionally. miRNAs are critical in many important biological processes, like development, and are important markers for many diseases. Identifying the targets of miRNAs is not an easy task. Recent developments of high-throughput data collection methods for identification of all miRNA targets in a cell are promising, but they still depend on computational algorithms to identify the exact miRNA:mRNA interactions. In this paper we present a novel algorithm, ComiR, which addresses a more general question, that is, whether a given mRNA is targeted by a set of miRNAs. ComiR uses miRNA expression to improve the targeting models of four target prediction algorithms. Then it combines their predicted targets using a support vector machine. By applying ComiR to single nucleotide polymorphism (SNP) data, we identified a SNP that is likely to be causally associated to osteoporosis in women.
The widespread use of zinc finger nucleases (ZFNs) for genome engineering is hampered by the fact that only a subset of sequences can be efficiently recognized using published finger archives. We describe a set of validated two-finger modules that complement existing finger archives and expand the range of ZFN-accessible sequences by three-fold. Using this archive, we successfully introduce lesions at 9 of 11 target sites in the zebrafish genome.
Motivation: Recognition models for protein-DNA interactions, which allow the prediction of specificity for a DNA-binding domain based only on its sequence or the alteration of specificity through rational design, have long been a goal of computational biology. There has been some progress in constructing useful models, especially for C2H2 zinc finger proteins, but it remains a challenging problem with ample room for improvement. For most families of transcription factors the best available methods utilize k-nearest neighbor (KNN) algorithms to make specificity predictions based on the average of the specificities of the k most similar proteins with defined specificities. Homeodomain (HD) proteins are the second most abundant family of transcription factors, after zinc fingers, in most metazoan genomes, and as a consequence an effective recognition model for this family would facilitate predictive models of many transcriptional regulatory networks within these genomes.
Results: Using extensive experimental data, we have tested several machine learning approaches and find that both support vector machines and random forests (RFs) can produce recognition models for HD proteins that are significant improvements over KNN-based methods. Cross-validation analyses show that the resulting models are capable of predicting specificities with high accuracy. We have produced a web-based prediction tool, PreMoTF (Predicted Motifs for Transcription Factors) (http://stormo.wustl.edu/PreMoTF), for predicting position frequency matrices from protein sequence using a RF-based model.
Silent information regulator 2 (Sir2) orthologs are an evolutionarily conserved family of NAD-dependent protein deacetylases that regulate aging and longevity in model organisms. The mammalian Sir2 ortholog Sirt1 regulates metabolic and stress responses through the deacetylation of many transcriptional regulatory factors. To elucidate the mechanism by which Sirt1 controls gene expression in response to nutrient availability, we devised a bioinformatic screen combining gene expression analysis with phylogenetic footprinting to identify transcription factors as new candidate partners of Sirt1. One candidate target was HNF-1α, a homeodomain transcription factor that regulates pancreatic β cell and hepatocyte functions and is commonly mutated in patients with maturity onset diabetes of the young (MODY). Interestingly, Sirt1 physically interacts with HNF-1α in vitro but does so in vivo only in nutrient-restricting conditions. This interaction requires 12–24 hr of nutrient restriction and is dependent on protein synthesis. Both nutrient restriction and Sirt1 suppress HNF-1α transcriptional activity and the expression of one of its target genes, C-reactive protein (Crp), in mouse primary hepatocytes. Pharmacological inhibition of Sirt1 blocks the suppression of Crp by nutrient restriction. Similarly, Crp expression is also suppressed in fasted and diet-restricted liver. Furthermore, Sirt1 and HNF-1α co-localize on two HNF-1α binding sites on the Crp promoter, leading to decreased acetylation of lysine 16 of histone H4 at these sites only in response to nutrient restriction. These findings reveal a novel nutrient-dependent interaction between Sirt1 and HNF-1α and provide important insight into the molecular mechanism by which Sirt1 mediates the anti-aging effects of diet restriction.
Sirt1; bioinformatics; HNF-1α; primary hepatocytes; nutrient restriction; Crp
Transcriptional regulation, a primary mechanism for controlling the development of multicellular organisms, is carried out by transcription factors (TFs) that recognize and bind to their cognate binding sites. In Caenorhabditis elegans, our knowledge of which genes are regulated by which TFs, through binding to specific sites, is still very limited. To expand our knowledge about the C. elegans regulatory network, we performed a comprehensive analysis of the C. elegans, Caenorhabditis briggsae, and Caenorhabditis remanei genomes to identify regulatory elements that are conserved in all genomes. Our analysis identified 4959 elements that are significantly conserved across the genomes and that each occur multiple times within each genome, both hallmarks of functional regulatory sites. Our motifs show significant matches to known core promoter elements, TF binding sites, splice sites, and poly-A signals as well as many putative regulatory sites. Many of the motifs are significantly correlated with various types of experimental data, including gene expression patterns, tissue-specific expression patterns, and binding site location analysis as well as enrichment in specific functional classes of genes. Many can also be significantly associated with specific TFs. Combinations of motif occurrences allow us to predict the location of cis-regulatory modules and we show that many of them significantly overlap experimentally determined enhancers. We provide access to the predicted binding sites, their associated motifs, and the predicted cis-regulatory modules across the whole genome through a web-accessible database and as tracks for genome browsers.
cis-regulatory element; cis-regulatory module; transcription factor; transcriptional regulation; Caenorhabditis elegans
Saccharomyces cerevisiae is a primary model for studies of transcriptional control, and the specificities of most yeast transcription factors (TFs) have been determined by multiple methods. However, it is unclear which position weight matrices (PWMs) are most useful; for the roughly 200 TFs in yeast, there are over 1200 PWMs in the literature. To address this issue, we created ScerTF, a comprehensive database of 1226 motifs from 11 different sources. We identified a single matrix for each TF that best predicts in vivo data by benchmarking matrices against chromatin immunoprecipitation and TF deletion experiments. We also used in vivo data to optimize thresholds for identifying regulatory sites with each matrix. To correct for biases from different methods, we developed a strategy to combine matrices. These aligned matrices outperform the best available matrix for several TFs. We used the matrices to predict co-occurring regulatory elements in the genome and identified many known TF combinations. In addition, we predict new combinations and provide evidence of combinatorial regulation from gene expression data. The database is available through a web interface at http://ural.wustl.edu/ScerTF. The site allows users to search the database with a regulatory site or matrix to identify the TFs most likely to bind the input sequence.
Motivation: Computational techniques for microbial genomic sequence analysis are becoming increasingly important. With next-generation sequencing technology and the human microbiome project underway, current sequencing capacity is significantly greater than the speed at which organisms of interest can be studied experimentally. Most related computational work has been focused on sequence assembly, gene annotation and metabolic network reconstruction. We have developed a method that will primarily use available sequence data in order to determine prokaryotic transcription factor (TF) binding specificities.
Results: Specificity determining residues (critical residues) were identified from crystal structures of DNA–protein complexes and TFs with the same critical residues were grouped into specificity classes. The putative binding regions for each class were defined as the set of promoters for each TF itself (autoregulatory) and the immediately upstream and downstream operons. MEME was used to find putative motifs within each separate class. Tests on the LacI and TetR TF families, using RegulonDB annotated sites, showed the sensitivity of prediction 86% and 80%, respectively.
Supplementary information: Supplementary data are available at Bioinformatics online.
Identifying the DNA binding sites for transcription factors is a key task in modeling the gene regulatory network of a cell. Predicting DNA binding sites computationally suffers from high false positives and false negatives due to various contributing factors, including the inaccurate models for transcription factor specificity. One source of inaccuracy in the specificity models is the assumption of asymmetry for symmetric models.
Using simulation studies, so that the correct binding site model is known and various parameters of the process can be systematically controlled, we test different motif finding algorithms on both symmetric and asymmetric binding site data. We show that if the true binding site is asymmetric the results are unambiguous and the asymmetric model is clearly superior to the symmetric model. But if the true binding specificity is symmetric commonly used methods can infer, incorrectly, that the motif is asymmetric. The resulting inaccurate motifs lead to lower sensitivity and specificity than would the correct, symmetric models. We also show how the correct model can be obtained by the use of appropriate measures of statistical significance.
This study demonstrates that the most commonly used motif-finding approaches usually model symmetric motifs incorrectly, which leads to higher than necessary false prediction errors. It also demonstrates how alternative motif-finding methods can correct the problem, providing more accurate motif models and reducing the errors. Furthermore, it provides criteria for determining whether a symmetric or asymmetric model is the most appropriate for any experimental dataset.
Sequence similarity based protein clustering methods organize proteins into families of similar sequences, a task that continues to be critical for automated protein characterization. However, many protein families cannot be automatically characterized further because little is known about the function of any protein in a family of similar sequences. We present a novel phylogenetic profile comparison (PPC) method called Automated Protein Annotation by Coordinate Evolution (APACE) that facilitates the automated characterization of proteins beyond their homology to other similar sequences. Our method implements a new approach for the normalization of similarity scores among multiple species and automates the characterization of proteins by their patterns of co-evolution with other proteins that do not necessarily share a similar sequence. We demonstrate that our method is able to recapitulate the topology of the latest, unresolved, composite deep eukaryotic phylogeny and is able to quantify the as yet unresolved branch lengths. We further demonstrate that our method is able to detect more functionally related proteins, given the same starting data, than existing methods. Finally, we demonstrate that our method can be successfully applied to much larger comparative genomic problem instances where existing methods often fail.
We examine the use of high-throughput sequencing on binding sites recovered using a bacterial one-hybrid (B1H) system and find that improved models of transcription factor (TF) binding specificity can be obtained compared to standard methods of sequencing a small subset of the selected clones. We can obtain even more accurate binding models using a modified version of B1H selection method with constrained variation (CV-B1H). However, achieving these improved models using CV-B1H data required the development of a new method of analysis—GRaMS (Growth Rate Modeling of Specificity)—that estimates bacterial growth rates as a function of the quality of the recognition sequence. We benchmark these different methods of motif discovery using Zif268, a well-characterized C2H2 zinc-finger TF on both a 28 bp randomized library for the standard B1H method and on 6 bp randomized library for the CV-B1H method for which 45 different experimental conditions were tested: five time points and three different IPTG and 3-AT concentrations. We find that GRaMS analysis is robust to the different experimental parameters whereas other analysis methods give widely varying results depending on the conditions of the experiment. Finally, we demonstrate that the CV-B1H assay can be performed in liquid media, which produces recognition models that are similar in quality to sequences recovered from selection on solid media.
Activator protein 1 (AP-1) transcription factors are dimers of Jun, Fos, MAF and activating transcription factor (ATF) family proteins characterized by basic region and leucine zipper domains1. Many AP-1 proteins contain defined transcriptional activation domains (TADs), but Batf and the closely related Batf3 (refs 2, 3) contain only a basic region and leucine zipper and have been considered inhibitors of AP-1 activity3–8. Here we show that Batf is required for the differentiation of IL-17-producing T helper (TH17) cells9. TH17 cells comprise a CD4+ T cell subset that coordinates inflammatory responses in host defense but is pathogenic in autoimmunity10–13.Batf
−/−mice have normal TH1 and TH2 differentiation, but show a defect in TH17 differentiation, and are resistant to experimental autoimmune encephalomyelitis (EAE).Batf
−/−T cells fail to induce known factors required for TH17 differentiation, such as RORγt11 and the cytokine IL-21 (refs 14–17). Neither addition of IL-21 nor overexpression of RORγt fully restores IL-17 production in Batf−/− T cells. The IL-17 promoter is Batf-responsive, and upon TH17 differentiation, Batf binds conserved intergenic elements in the IL-17A/F locus and to the IL-17, IL-21 and IL-22 (ref 18) promoters. These results demonstrate that the AP-1 protein Batf plays a critical role in TH17 differentiation.
We employ a biophysical model that accounts for the non-linear relationship between binding energy and the statistics of selected binding sites. The model includes the chemical potential of the transcription factor, non-specific binding affinity of the protein for DNA, as well as sequence-specific parameters that may include non-independent contributions of bases to the interaction. We obtain maximum likelihood estimates for all of the parameters and compare the results to standard probabilistic methods of parameter estimation. On simulated data, where the true energy model is known and samples are generated with a variety of parameter values, we show that our method returns much more accurate estimates of the true parameters and much better predictions of the selected binding site distributions. We also introduce a new high-throughput SELEX (HT-SELEX) procedure to determine the binding specificity of a transcription factor in which the initial randomized library and the selected sites are sequenced with next generation methods that return hundreds of thousands of sites. We show that after a single round of selection our method can estimate binding parameters that give very good fits to the selected site distributions, much better than standard motif identification algorithms.
The DNA binding sites of transcription factors that control gene expression are often predicted based on a collection of known or selected binding sites. The most commonly used methods for inferring the binding site pattern, or sequence motif, assume that the sites are selected in proportion to their affinity for the transcription factor, ignoring the effect of the transcription factor concentration. We have developed a new maximum likelihood approach, in a program called BEEML, that directly takes into account the transcription factor concentration as well as non-specific contributions to the binding affinity, and we show in simulation studies that it gives a much more accurate model of the transcription factor binding sites than previous methods. We also develop a new method for extracting binding sites for a transcription factor from a random pool of DNA sequences, called high-throughput SELEX (HT-SELEX), and we show that after a single round of selection BEEML can obtain an accurate model of the transcription factor binding sites.
Motivation: Modeling and identifying the DNA-protein recognition code is one of the most challenging problems in computational biology. Several quantitative methods have been developed to model DNA-protein interactions with specific focus on the C2H2 zinc-finger proteins, the largest transcription factor family in eukaryotic genomes. In many cases, they performed well. But the overall the predictive accuracy of these methods is still limited. One of the major reasons is all these methods used weight matrix models to represent DNA-protein interactions, assuming all base-amino acid contacts contribute independently to the total free energy of binding.
Results: We present a context-dependent model for DNA–zinc-finger protein interactions that allows us to identify inter-positional dependencies in the DNA recognition code for C2H2 zinc-finger proteins. The degree of non-independence was detected by comparing the linear perceptron model with the non-linear neural net (NN) model for their predictions of DNA–zinc-finger protein interactions. This dependency is supported by the complex base-amino acid contacts observed in DNA–zinc-finger interactions from structural analyses. Using extensive published qualitative and quantitative experimental data, we demonstrated that the context-dependent model developed in this study can significantly improves predictions of DNA binding profiles and free energies of binding for both individual zinc fingers and proteins with multiple zinc fingers when comparing to previous positional-independent models. This approach can be extended to other protein families with complex base-amino acid residue interactions that would help to further understand the transcriptional regulation in eukaryotic genomes.
Availability:The software implemented as c programs and are available by request. http://ural.wustl.edu/softwares.html
The binding of transcription factors to their respective DNA sites is a key component of every regulatory network. Predictions of transcription factor binding sites are usually based on models for transcription factor specificity. These models, in turn, are often based on examples of known binding sites.
Collections of binding sites are obtained in simulation experiments where the true model for the transcription factor is known and various sampling procedures are employed. We compare the accuracies of three different and commonly used methods for predicting the specificity of the transcription factor based on example binding sites. Different methods for constructing the models can lead to significant differences in the accuracy of the predictions and we show that commonly used methods can be positively misleading, even at large sample sizes and using noise-free data. Methods that minimize the number of predicted binding sequences are often significantly more accurate than the other methods tested.
Different methods for generating motifs from example binding sites can have significantly different numbers of false positive and false negative predictions. For many different sampling procedures models based on quadratic programming are the most accurate.
We describe the comprehensive characterization of homeodomain DNA-binding specificities from a metazoan genome. The analysis of all 84 independent homeodomains from D. melanogaster reveals the breadth of DNA sequences that can be specified by this recognition motif. The majority of these factors can be organized into 11 different specificity groups, where the preferred recognition sequence between these groups can differ at up to 4 of the 6 core recognition positions. Analysis of the recognition motifs within these groups led to a catalog of common specificity determinants that may cooperate or compete to define the binding site preference. Using these recognition principles, a homeodomain can be reengineered to create factors where its specificity is altered at the majority of recognition positions. This resource also allows prediction of homeodomain specificities from other organisms, which is demonstrated by the prediction and analysis of human homeodomain specificities.
The availability of whole-genome sequences allows for the identification of the entire set of protein coding genes as well as their regulatory regions. This can be accomplished using multiple complementary methods that include ESTs, homology searches and ab initio gene predictions. Previously, the Genie gene-finding algorithm was trained on a small set of Chlamydomonas genes and shown to improve the accuracy of gene prediction in this species compared to other available programs. To improve ab initio gene finding in Chlamydomonas, we assemble a new training set consisting of over 2,300 cDNAs by assembling over 167,000 Chlamydomonas EST entries in GenBank using the EST assembly tool PASA.
The prediction accuracy of our cDNA-trained gene-finder, GreenGenie2, attains 83% sensitivity and 83% specificity for exons on short-sequence predictions. We predict about 12,000 genes in the version v3 Chlamydomonas genome assembly, most of which (78%) are either identical to or significantly overlap the published catalog of Chlamydomonas genes . 22% of the published catalog is absent from the GreenGenie2 predictions; there is also a fraction (23%) of GreenGenie2 predictions that are absent from the published gene catalog. Randomly chosen gene models were tested by RT-PCR and most support the GreenGenie2 predictions.
These data suggest that training with EST assemblies is highly effective and that GreenGenie2 is a valuable, complementary tool for predicting genes in Chlamydomonas reinhardtii.
Gene expression is regulated at each step from chromatin remodeling through translation and degradation. Several known RNA-binding regulatory proteins interact with specific RNA secondary structures in addition to specific nucleotides. To provide a more comprehensive understanding of the regulation of gene expression, we developed an integrative computational approach that leverages functional genomics data and nucleotide sequences to discover RNA secondary structure-defined cis-regulatory elements (SCREs). We applied our structural cis-regulatory element detector (StructRED) to microarray and mRNA sequence data from Saccharomyces cerevisiae, Drosophila melanogaster, and Homo sapiens. We recovered the known specificities of Vts1p in yeast and Smaug in flies. In addition, we discovered six putative SCREs in flies and three in humans. We characterized the SCREs based on their condition-specific regulatory influences, the annotation of the transcripts that contain them, and their locations within transcripts. Overall, we show that modeling functional genomics data in terms of combined RNA structure and sequence motifs is an effective method for discovering the specificities and regulatory roles of RNA-binding proteins.
modeling; mRNA stability; polysome association; post-transcriptional regulation; secondary structure
An increasing number of cis-regulatory RNA elements have been found to regulate gene expression post-transcriptionally in various biological processes in bacterial systems. Effective computational tools for large-scale identification of novel regulatory RNAs are strongly desired to facilitate our exploration of gene regulation mechanisms and regulatory networks. We present a new computational program named RSSVM (RNA Sampler+Support Vector Machine), which employs Support Vector Machines (SVMs) for efficient identification of functional RNA motifs from random RNA secondary structures. RSSVM uses a set of distinctive features to represent the common RNA secondary structure and structural alignment predicted by RNA Sampler, a tool for accurate common RNA secondary structure prediction, and is trained with functional RNAs from a variety of bacterial RNA motif/gene families covering a wide range of sequence identities. When tested on a large number of known and random RNA motifs, RSSVM shows a significantly higher sensitivity than other leading RNA identification programs while maintaining the same false positive rate. RSSVM performs particularly well on sets with low sequence identities. The combination of RNA Sampler and RSSVM provides a new, fast, and efficient pipeline for large-scale discovery of regulatory RNA motifs. We applied RSSVM to multiple Shewanella genomes and identified putative regulatory RNA motifs in the 5′ untranslated regions (UTRs) in S. oneidensis, an important bacterial organism with extraordinary respiratory and metal reducing abilities and great potential for bioremediation and alternative energy generation. From 1002 sets of 5′-UTRs of orthologous operons, we identified 166 putative regulatory RNA motifs, including 17 of the 19 known RNA motifs from Rfam, an additional 21 RNA motifs that are supported by literature evidence, 72 RNA motifs overlapping predicted transcription terminators or attenuators, and other candidate regulatory RNA motifs. Our study provides a list of promising novel regulatory RNA motifs potentially involved in post-transcriptional gene regulation. Combined with the previous cis-regulatory DNA motif study in S. oneidensis, this genome-wide discovery of cis-regulatory RNA motifs may offer more comprehensive views of gene regulation at a different level in this organism. The RSSVM software, predictions, and analysis results on Shewanella genomes are available at http://ural.wustl.edu/resources.html#RSSVM.
RNA is remarkably versatile, acting not only as messengers to transfer genetic information from DNA to protein but also as critical structural components and catalytic enzymes in the cell. More intriguingly, RNA elements in messenger RNAs have been widely found in bacteria to control the expression of their downstream genes. The functions of these RNA elements are intrinsically linked to their secondary structures, which are usually conserved across multiple closely related species during evolution and often shared by genes in the same metabolic pathways. We developed a new computational approach to find putative functional RNA elements by looking for conserved RNA secondary structures that are distinguished from random RNA secondary structures in the orthologous RNA sequences from related species. We applied this approach to multiple Shewanella genomes and predicted putative regulatory RNA elements in Shewanella oneidensis, a bacterium that has extraordinary respiratory and metal reducing abilities and great potential for bioremediation and alternative energy generation. Our findings not only recovered many RNA elements that are known or supported by literature evidence but also included exciting novel RNA elements for further exploration.