The pairwise comparison of RNA secondary structures is a fundamental problem, with direct application in mining databases for annotating putative noncoding RNA candidates in newly sequenced genomes. An increasing number of software tools are available for comparing RNA secondary structures, based on different models (such as ordered trees or forests, arc annotated sequences, and multilevel trees) and computational principles (edit distance, alignment). We describe here the website BRASERO that offers tools for evaluating such software tools on real and synthetic datasets.
Nucleic acid phylogenetic profiling (NAPP) classifies coding and non-coding sequences in a genome according to their pattern of conservation across other genomes. This procedure efficiently distinguishes clusters of functional non-coding elements in bacteria, particularly small RNAs and cis-regulatory RNAs, from other conserved sequences. In contrast to other non-coding RNA detection pipelines, NAPP does not require the presence of conserved RNA secondary structure and therefore is likely to identify previously undetected RNA genes or elements. Furthermore, as NAPP clusters contain both coding and non-coding sequences with similar occurrence profiles, they can be analyzed under a functional perspective. We recently improved the NAPP pipeline and applied it to a collection of 949 bacterial and 68 archaeal species. The database and web interface available at http://napp.u-psud.fr/ enable detailed analysis of NAPP clusters enriched in non-coding RNAs, graphical display of phylogenetic profiles, visualization of predicted RNAs in their genome context and extraction of predicted RNAs for use with genome browsers or other software.
Endospore formation is a characteristic shared by some Bacilli and Clostridia that involves the creation of two cell types, the forespore and the mother cell. Hundreds of protein-encoding genes have been shown to be transcribed in a cell-specific fashion during this developmental process in Bacillus subtilis. We have used a phylogenetic profiling procedure to identify clusters of B. subtilis coding and non-coding sequences that co-occur in other endospore formers. One such cluster shows a strong bias for sporulation-related genes (42% among 156 genes) and is enriched in potential non-coding RNAs. We have studied one RNA candidate, encoded in the ylbG-ylbH interval. In vivo analysis using a transcriptional fusion to the Escherichia coli lacZ gene demonstrates that this region of the chromosome contains a gene, csfG, encoding a 147-nucleotide RNA that is transcribed only during sporulation, specifically in the forespore. csfG is present in many endospore formers, mostly Bacilli and some Clostridia, whereas it is absent from bacteria that do not produce endospores. All CsfG RNAs contain a strongly conserved, pyrimidine-rich, central motif that overlaps a potential stem-loop structure. The remarkable conservation of this sequence in widely divergent bacteria suggests that it plays a conserved physiological role, presumably by interacting with an unidentified target in the forespore, where it contributes to the acquisition of the spore properties.
small RNA; sporulation; germination; forespore; Bacilli; Clostridia
Bacterial transcription attenuation occurs through a variety of cis-regulatory elements that control gene expression in response to a wide range of signals. The signal-sensing structures in attenuators are so diverse and rapidly evolving that only a small fraction have been properly annotated and characterized to date. Here we apply a broad-spectrum detection tool in order to achieve a more complete view of the transcriptional attenuation complement of key bacterial species.
Our protocol seeks gene families with an unusual frequency of 5' terminators found across multiple species. Many of the detected attenuators are part of annotated elements, such as riboswitches or T-boxes, which often operate through transcriptional attenuation. However, a significant fraction of candidates were not previously characterized in spite of their unmistakable footprint. We further characterized some of these new elements using sequence and secondary structure analysis. We also present elements that may control the expression of several non-homologous genes, suggesting co-transcription and response to common signals. An important class of such elements, which we called mobile attenuators, is provided by 3' terminators of insertion sequences or prophages that may be exapted as 5' regulators when inserted directly upstream of a cellular gene.
We show here that attenuators involve a complex landscape of signal-detection structures spanning the entire bacterial domain. We discuss possible scenarios through which these diverse 5' regulatory structures may arise or evolve.
Using an experimental approach, we investigated the RNome of the pathogen Staphylococcus aureus to identify 30 small RNAs (sRNAs) including 14 that are newly confirmed. Among the latter, 10 are encoded in intergenic regions, three are generated by premature transcription termination associated with riboswitch activities, and one is expressed from the complementary strand of a transposase gene. The expression of four sRNAs increases during the transition from exponential to stationary phase. We focused our study on RsaE, an sRNA that is highly conserved in the bacillales order and is deleterious when over-expressed. We show that RsaE interacts in vitro with the 5′ region of opp3A mRNA, encoding an ABC transporter component, to prevent formation of the ribosomal initiation complex. A previous report showed that RsaE targets opp3B which is co-transcribed with opp3A. Thus, our results identify an unusual case of riboregulation where the same sRNA controls an operon mRNA by targeting two of its cistrons. A combination of biocomputational and transcriptional analyses revealed a remarkably coordinated RsaE-dependent downregulation of numerous metabolic enzymes involved in the citrate cycle and the folate-dependent one-carbon metabolism. As we observed that RsaE accumulates transiently in late exponential growth, we propose that RsaE functions to ensure a coordinate downregulation of the central metabolism when carbon sources become scarce.
The Annotathon is a novel bioinformatics teaching environment, where undergraduate students join in a community annotation effort. Besides being a rewarding educational tool, it holds the added promise of potentially useful scientific findings.
A 10X draft sequence of Podospora anserina genome shows highly dynamic evolution since its divergence from Neurospora crassa.
The dung-inhabiting ascomycete fungus Podospora anserina is a model used to study various aspects of eukaryotic and fungal biology, such as ageing, prions and sexual development.
We present a 10X draft sequence of P. anserina genome, linked to the sequences of a large expressed sequence tag collection. Similar to higher eukaryotes, the P. anserina transcription/splicing machinery generates numerous non-conventional transcripts. Comparison of the P. anserina genome and orthologous gene set with the one of its close relatives, Neurospora crassa, shows that synteny is poorly conserved, the main result of evolution being gene shuffling in the same chromosome. The P. anserina genome contains fewer repeated sequences and has evolved new genes by duplication since its separation from N. crassa, despite the presence of the repeat induced point mutation mechanism that mutates duplicated sequences. We also provide evidence that frequent gene loss took place in the lineages leading to P. anserina and N. crassa. P. anserina contains a large and highly specialized set of genes involved in utilization of natural carbon sources commonly found in its natural biotope. It includes genes potentially involved in lignin degradation and efficient cellulose breakdown.
The features of the P. anserina genome indicate a highly dynamic evolution since the divergence of P. anserina and N. crassa, leading to the ability of the former to use specific complex carbon sources that match its needs in its natural biotope.
Most mammalian genes are able to express several splice variants in a phenomenon known as alternative splicing. Serious alterations of alternative splicing occur in cancer tissues, leading to expression of multiple aberrant splice forms. Most studies of alternative splicing defects have focused on the identification of cancer-specific splice variants as potential therapeutic targets. Here, we examine instead the bulk of non-specific transcript isoforms and analyze their level of disorder using a measure of uncertainty called Shannon's entropy. We compare isoform expression entropy in normal and cancer tissues from the same anatomical site for different classes of transcript variations: alternative splicing, polyadenylation, and transcription initiation. Whereas alternative initiation and polyadenylation show no significant gain or loss of entropy between normal and cancer tissues, alternative splicing shows highly significant entropy gains for 13 of the 27 cancers studied. This entropy gain is characterized by a flattening in the expression profile of normal isoforms and is correlated to the level of estimated cellular proliferation in the cancer tissue. Interestingly, the genes that present the highest entropy gain are enriched in splicing factors. We provide here the first quantitative estimate of splicing disruption in cancer. The expression of normal splice variants is widely and significantly disrupted in at least half of the cancers studied. We postulate that such splicing disorders may develop in part from splicing alteration in key splice factors, which in turn significantly impact multiple target genes.
RNA splicing is the process by which gene products are pieced together to form a mature messenger RNA (mRNA). In normal cells, RNA splicing is a tightly controlled process that leads to production of a well-defined set of mRNAs. Cancer cells, however, often produce aberrant, mis-spliced mRNAs. Such disorders have not been quantified to date. To this end, we use a well-known measure of disorder called Shannon's entropy. We show that overall splicing disorders are highly significant in many cancers, and that the extent of disorder may be correlated to the level of cell proliferation in each tumor. Surprisingly, genes that control the splicing mechanism are unusually frequent among genes affected by splicing disorders. This suggests that cancer cells may withstand harmful chain reactions in which splicing defects in key regulatory genes would in turn cause extensive splicing damage. As mis-spliced mRNAs are widely studied for cancer diagnosis, awareness of these global disorders is important to distinguish reliable cancer markers from background noise.
High throughput EST and full-length cDNA sequencing have revealed extensive variations at the 3′ ends of mammalian transcripts. Whether all of these changes are biologically meaningful has been the subject of controversy, as such, results may reflect in part transcription or polyadenylation leakage. We selected here a set of tandem poly(A) sites predicted from EST/cDNA sequence analysis that (i) are conserved between human and mouse, (ii) produce alternative 3′ isoforms with unusual size features and (iii) are not documented in current genome databases, and we submitted these sites to experimental validation in mouse tissues. Out of 86 tested poly(A) sites from 44 genes, 84 were individually confirmed using a specially devised RT-PCR strategy. We then focused on validating the exon structure between distant tandem poly(A) sites separated by over 3 kb, and between stop codons and alternative poly(A) sites located at 4.5 kb or more, using a long-distance RT-PCR strategy. In most cases, long transcripts spanning the whole poly(A)–poly(A) or stop-poly(A) distance were detected, confirming that tandem sites were part of the same transcription unit. Given the apparent conservation of these long alternative 3′ ends, different regulatory functions can be foreseen, depending on the location where transcription starts.
Alternative polyadenylation is a widespread mechanism contributing to transcript diversity in eukaryotes. Over half of mammalian genes are alternatively polyadenylated. Our understanding of poly(A) site evolution is limited by the lack of a reliable identification of conserved, equivalent poly(A) sites among species. We introduce here a working definition of conserved poly(A) sites as sites that are both (i) properly aligned in human and mouse orthologous 3' untranslated regions (UTRs) and (ii) supported by EST or cDNA data in both species.
We identified about 4800 such conserved poly(A) sites covering one third of the orthologous gene set studied. Characteristics of conserved poly(A) sites such as processing efficiency and tissue-specificity were analyzed. Conserved sites show a higher processing efficiency but no difference in tissular distribution when compared to non-conserved sites. In general, alternative poly(A) sites are species-specific and involve minor, non-conserved sites that are unlikely to play essential roles. However, there are about 500 genes with conserved tandem poly(A) sites. A significant fraction of these conserved tandems display a conserved arrangement of major/minor sites in their 3' UTR, suggesting that these alternative 3' ends may be under selection.
This analysis allows us to identify potential functional alternative poly(A) sites and provides clues on the selective mechanisms at play in the appearance of multiple poly(A) sites and their maintenance in the 3' UTRs of genes.
Alternative polyadenylation sites produce transcript isoforms with 3′ untranslated regions (UTRs) of different lengths. If a microRNA (miRNA) target is present in the UTR, then only those target-containing isoforms should be sensitive to control by a cognate miRNA. We carried out a systematic examination of 3′ UTRs containing multiple poly(A) sites and putative miRNA targets. Based on expressed sequence tag (EST) counts and EST library information, we observed that levels of isoforms containing targets for miR-1 or miR-124, two miRNAs causing downregulation of transcript levels, were reduced in tissues expressing the corresponding miRNA. This analysis was repeated for all conserved 7-mers in 3′ UTRs, resulting in a selection of 312 motifs. We show that this set is significantly enriched in known miRNA targets and mRNA-destabilizing elements, which validates our initial hypothesis. We scanned the human genome for possible cognate miRNAs and identified phylogenetically conserved precursors matching our motifs. This analysis can help identify target-miRNA couples that went undetected in previous screens, but it may also reveal targets for other types of regulatory factors.
MicroRNAs (miRNAs) are short RNA molecules that recognize specific target sequences in the 3′ region of mRNAs. These miRNAs can then specifically keep the mRNAs from being expressed, or translated into proteins. In this article, the authors ask what happens when a targeted mRNA has several forms differing by their 3′ regions. Such 3′ variations are very common. If two or more variations are present in a single mRNA, the result is two or more mRNAs with 3′ ends of different lengths. If an miRNA target is located between the two sites of variability, the shorter transcript should be target free and should escape miRNA-mediated inhibition, while longer transcripts should be inhibited. To test this hypothesis, the authors looked at mRNAs that had these variable 3′ ends. Variants containing targets for certain miRNAs appeared to be specifically underrepresented in tissues where these particular miRNAs are found. This principle was used to find other sequence patterns in 3′ regions that had a similar effect, and a list of 312 significant patterns was obtained. The authors then scanned genome sequences and identified possible cognate miRNAs for these patterns. This new knowledge will help further an understanding of how genes are controlled.
The three major mechanisms that regulate transcript formation involve the selection of alternative sites for transcription start (TS), splicing, and polyadenylation. Currently there are efforts that collect data & annotation individually for each of these variants. It is important to take an integrated view of these data sets and to derive a data set of alternate transcripts along with consolidated annotation. We have been developing in the past computational pipelines that generate value-added data at genome-scale on individual variant types; these include AltSplice on splicing and AltPAS on polyadenylation. We now extend these pipelines and integrate the resultant data sets to facilitate an integrated view of the contributions from splicing and polyadenylation in the formation of transcript variants.
The AltSplice pipeline examines gene-transcript alignments and delineates alternative splice events and splice patterns; this pipeline is extended as AltTrans to delineate isoform transcript patterns for each of which both introns/exons and 'terminating' polyA site are delineated; EST/mRNA sequences that qualify the transcript pattern confirm both the underlying splicing and polyadenylation. The AltPAS pipeline examines gene-transcript alignments and delineates all potential polyA sites irrespective of underlying splicing patterns. Resultant polyA sites from both AltTrans and AltPAS are merged. The generated database reports data on alternative splicing, alternative polyadenylation and the resultant alternate transcript patterns; the basal data is annotated for various biological features. The data (named as integrated AltTrans data) generated for both the organisms of human and mouse is made available through the Alternate Transcript Diversity web site at .
The reported data set presents alternate transcript patterns that are annotated for both alternative splicing and alternative polyadenylation. Results based on current transcriptome data indicate that the contribution of alternative splicing is larger than that of alternative polyadenylation.
Computational biologists use Expectation values (E-values) to estimate the number of solutions that can be expected by chance during a database scan. Here we focus on computing Expectation values for RNA motifs defined by single-strand and helix lod-score profiles with variable helix spans. Such E-values cannot be computed assuming a normal score distribution and their estimation previously required lengthy simulations.
We introduce discrete convolutions as an accurate and fast mean to estimate score distributions of lod-score profiles. This method provides excellent score estimations for all single-strand or helical elements tested and also applies to the combination of elements into larger, complex, motifs. Further, the estimated distributions remain accurate even when pseudocounts are introduced into the lod-score profiles. Estimated score distributions are then easily converted into E-values.
A good agreement was observed between computed E-values and simulations for a number of complete RNA motifs. This method is now implemented into the ERPIN software, but it can be applied as well to any search procedure based on ungapped profiles with statistically independent columns.
ERPIN is an RNA motif identification program that takes an RNA sequence alignment as an input and identifies related sequences using a profile-based dynamic programming algorithm. ERPIN differs from other RNA motif search programs in its ability to capture subtle biases in the training set and produce highly specific and sensitive searches, while keeping CPU requirements at a practical level. In its latest version, ERPIN also computes E-values, which tell biologists how likely they are to encounter a specific sequence match by chance—a useful indication of biological significance. We present here the ERPIN online search interface (http://tagc.univ-mrs.fr/erpin/). This web server automatically performs ERPIN searches for different RNA genes or motifs, using predefined training sets and search parameters. With a couple of clicks, users can analyze an entire bacterial genome or a genomic segment of up to 5Mb for the presence of tRNAs, 5S rRNAs, SRP RNA, C/D box snoRNAs, hammerhead motifs, miRNAs and other motifs. Search results are displayed with sequence, score, position, E-value and secondary structure graphics. An example of a complete genome scan is provided, as well as an evaluation of run times and specificity/sensitivity information for all available motifs.
Differential polyadenylation is a widespread mechanism in higher eukaryotes producing mRNAs with different 3' ends in different contexts. This involves several alternative polyadenylation sites in the 3' UTR, each with its specific strength. Here, we analyze the vicinity of human polyadenylation signals in search of patterns that would help discriminate strong and weak polyadenylation sites, or true sites from randomly occurring signals.
We used human genomic sequences to retrieve the region downstream of polyadenylation signals, usually absent from cDNA or mRNA databases. Analyzing 4956 EST-validated polyadenylation sites and their -300/+300 nt flanking regions, we clearly visualized the upstream (USE) and downstream (DSE) sequence elements, both characterized by U-rich (not GU-rich) segments. The presence of a USE and a DSE is the main feature distinguishing true polyadenylation sites from randomly occurring A(A/U)UAAA hexamers. While USEs are indifferently associated with strong and weak poly(A) sites, DSEs are more conspicuous near strong poly(A) sites. We then used the region encompassing the hexamer and DSE as a training set for poly(A) site identification by the ERPIN program and achieved a prediction specificity of 69 to 85% for a sensitivity of 56%.
The availability of complete genomes and large EST sequence databases now permit large-scale observation of polyadenylation sites. Both U-rich sequences flanking both sides of poly(A) signals contribute to the definition of "true" sites. However, the downstream U-rich sequences may also play an enhancing role. Based on this information, poly(A) site prediction accuracy was moderately but consistently improved compared to the best previously available algorithm.
RNA molecules fold into characteristic secondary and tertiary structures that account for their diverse functional activities. Many of these RNA structures are assembled from a collection of RNA structural motifs. These basic building blocks are used repeatedly, and in various combinations, to form different RNA types and define their unique structural and functional properties. Identification of recurring RNA structural motifs will therefore enhance our understanding of RNA structure and help associate elements of RNA structure with functional and regulatory elements. Our goal was to develop a computer program that can describe an RNA structural element of any complexity and then search any nucleotide sequence database, including the complete prokaryotic and eukaryotic genomes, for these structural elements. Here we describe in detail a new computational motif search algorithm, RNAMotif, and demonstrate its utility with some motif search examples. RNAMotif differs from other motif search tools in two important aspects: first, the structure definition language is more flexible and can specify any type of base–base interaction; second, RNAMotif provides a user controlled scoring section that can be used to add capabilities that patterns alone cannot provide.
The positive selection of CD4+ T cells requires the expression of major histocompatibility complex (MHC) class II molecules in the thymus, but the role of self-peptides complexed to class II molecules is still a matter of debate. Recently, it was observed that transgenic mice expressing a single peptide–MHC class II complex positively select significant numbers of diverse CD4+ T cells in the thymus. However, the number of selected T cell specificities has not been evaluated so far. Here, we have sequenced 700 junctional complementarity determining regions 3 (CDR3) from T cell receptors (TCRs) carrying Vβ11-Jβ1.1 or Vβ12-Jβ1.1 rearrangements. We found that a single peptide–MHC class II complex positively selects at least 105 different Vβ rearrangements. Our data yield a first evaluation of the size of the T cell repertoire. In addition, they provide evidence that the single Eα52-68–I-Ab complex skews the amino acid frequency in the TCR CDR3 loop of positively selected T cells. A detailed analysis of CDR3 sequences indicates that a fraction of the β chain repertoire bears the imprint of the selecting self-peptide.
thymus; major histocompatibility complex; T cell receptors; repertoire development; transgenic/knockout