The rapid determination of nucleic acid sequences is increasing the number of sequences that are available. Inherent in a template or seed alignment is the culmination of structural and functional constraints that are selecting those mutations that are viable during the evolution of the RNA. While we might not understand these structural and functional, template-based alignment programs utilize the patterns of sequence conservation to encapsulate the characteristics of viable RNA sequences that are aligned properly. We have developed a program that utilizes the different dimensions of information in rCAD, a large RNA informatics resource, to establish a profile for each position in an alignment. The most significant include sequence identity and column composition in different phylogenetic taxa. We have compared our methods with a maximum of eight alternative alignment methods on different sets of 16S and 23S rRNA sequences with sequence percent identities ranging from 50% to 100%. The results showed that CRWAlign outperformed the other alignment methods in both speed and accuracy. A web-based alignment server is available at http://www.rna.ccbb.utexas.edu/SAE/2F/CRWAlign.
RNA sequence alignment; template-based alignment; comparative analysis; phylogenetic-based alignment
We present a fast pairwise RNA sequence alignment method using structural information, named R-PASS (RNA Pairwise Alignment of Structure and Sequence), which shows good accuracy on sequences with low sequence identity and significantly faster than alternative methods. The method begins by representing RNA secondary structure as a set of structure motifs. The motifs from two RNAs are then used as input into a bipartite graph-matching algorithm, which determines the structure matches. The matches are then used as constraints in a constrained dynamic programming sequence alignment procedure. The R-PASS method has an O(nm) complexity. We compare our method with two other structure-based alignment methods, LARA and ExpaLoc, and with a sequence-based alignment method, MAFFT, across three benchmarks and obtain favorable results in accuracy and orders of magnitude faster in speed.
RNA pairwise structural alignment; structure motif; bipartite graph matching; constraint sequence alignment
Beyond its direct involvement in protein synthesis with mRNA, tRNA, and rRNA, RNA is now being appreciated for its significance in the overall metabolism and regulation of the cell. Comparative analysis has been very effective in the identification and characterization of RNA molecules, including the accurate prediction of their secondary structure. We are developing an integrative scalable data management and analysis system, the RNA Comparative Analysis Database (rCAD), implemented with SQL Server to support RNA comparative analysis. The platformagnostic database schema of rCAD captures the essential relationships between the different dimensions of information for RNA comparative analysis datasets. The rCAD implementation enables a variety of comparative analysis manipulations with multiple integrated data dimensions for advanced RNA comparative analysis workflows. In this paper, we describe details of the rCAD schema design and illustrate its usefulness with two usage scenarios.
Biological Database; RNA Sequence Analysis; Bioinformatics; Database Schema
Helices are an essential element in defining the three-dimensional architecture of structured RNAs. While internal basepairs in a canonical helix stack on both sides, the ends of the helix stack on only one side and are exposed to the loop side, thus susceptible to fraying unless they are protected. While coaxial stacking has long been known to stabilize helix ends by directly stacking two canonical helices coaxially, based on analysis of helix-loop junctions in RNA crystal structures, herein we describe helix capping, topological stacking of a helix end with a basepair or an unpaired nucleotide from the loop side, which in turn protects helix ends. Beyond the topological protection of helix ends against fraying, helix capping should confer greater stability onto the resulting composite helices. Our analysis also reveals that this general motif is associated with the formation of tertiary structure interactions. Greater knowledge about the dynamics at the helix-junctions in the secondary structure should enhance the prediction of RNA secondary structure with a richer set of energetic rules and help better understand the folding of a secondary structure into its three-dimensional structure. These together suggest that helix capping likely play a fundamental role in driving RNA folding.
The analysis of RNA sequences, once a small niche field for a small collection of scientists whose primary emphasis was the structure and function of a few RNA molecules, has grown most significantly with the realizations that 1) RNA is implicated in many more functions within the cell, and 2) the analysis of ribosomal RNA sequences is revealing more about the microbial ecology within all biological and environmental systems. The accurate and rapid alignment of these RNA sequences is essential to decipher the maximum amount of information from this data.
Two computer systems that utilize the Gutell lab's RNA Comparative Analysis Database (rCAD) were developed to align sequences to an existing template alignment available at the Gutell lab's Comparative RNA Web (CRW) Site. Multiple dimensions of cross-indexed information are contained within the relational database - rCAD, including sequence alignments, the NCBI phylogenetic tree, and comparative secondary structure information for each aligned sequence. The first program, CRWAlign-1 creates a phylogenetic-based sequence profile for each column in the alignment. The second program, CRWAlign-2 creates a profile based on phylogenetic, secondary structure, and sequence information. Both programs utilize their profiles to align new sequences into the template alignment.
The accuracies of the two CRWAlign programs were compared with the best template-based rRNA alignment programs and the best de-novo alignment programs. We have compared our programs with a total of eight alternative alignment methods on different sets of 16S rRNA alignments with sequence percent identities ranging from 50% to 100%. Both CRWAlign programs were superior to these other programs in accuracy and speed.
Both CRWAlign programs can be used to align the very extensive amount of RNA sequencing that is generated due to the rapid next-generation sequencing technology. This latter technology is augmenting the new paradigm that RNA is intimately implicated in a significant number of functions within the cell. In addition, the use of bacterial 16S rRNA sequencing in the identification of the microbiome in many different environmental systems creates a need for rapid and highly accurate alignment of bacterial 16S rRNA sequences.
Lactobacilli (Lactobacillales: Lactobacillaceae) are well known for their roles in food fermentation, as probiotics, and in human health, but they can also be dominant members of the microbiota of some species of Hymenoptera (ants, bees, and wasps). Honey bees and bumble bees associate with host-specific lactobacilli, and some evidence suggests that these lactobacilli are important for bee health. Social transmission helps maintain associations between these bees and their respective microbiota. To determine whether lactobacilli associated with social hymenopteran hosts are generally host specific, we gathered publicly available Lactobacillus 16S rRNA gene sequences, along with Lactobacillus sequences from 454 pyrosequencing surveys of six other hymenopteran species (three sweat bees and three ants). We determined the comparative secondary structural models of 16S rRNA, which allowed us to accurately align the entire 16S rRNA gene, including fast-evolving regions. BLAST searches and maximum-likelihood phylogenetic reconstructions confirmed that honey and bumble bees have host-specific Lactobacillus associates. Regardless of colony size or within-colony oral sharing of food (trophallaxis), sweat bees and ants associate with lactobacilli that are closely related to those found in vertebrate hosts or in diverse environments. Why honey and bumble bees associate with host-specific lactobacilli while other social Hymenoptera do not remains an open question. Lactobacilli are known to inhibit the growth of other microbes and can be beneficial whether they are coevolved with their host or are recruited by the host from environmental sources through mechanisms of partner choice.
Protein translation is essential for all forms of life and is conducted by a macromolecular complex, the ribosome. Evolutionary changes in protein and RNA sequences can affect the three-dimensional organization of structural features in ribosomes in different species. The most dramatic changes occur in animal mitochondria, whose genomes have been significantly reduced and altered. The RNA component of the mitochondrial ribosome (mitoribosome) is reduced in size, with a compensatory increase in protein content. Until recently, it was unclear how these changes affect the three-dimensional structure of the mitoribosome. Here we present a structural model of the large subunit (LSU) of the mammalian mitoribosome developed by combining molecular modeling techniques with cryo-electron microscopic (cryo-EM) studies. The model contains 93% of the mitochondrial rRNA (mito-rRNA) sequence and 16 mitochondrial ribosomal proteins (MRPs) in the large subunit of the mitoribosome. Despite the smaller mitochondrial rRNA, the spatial positions of RNA domains known to be directly involved in protein synthesis are essentially the same as in Bacterial and Archaeal ribosomes. However, the dramatic reduction in rRNA content necessitates evolution of unique structural features to maintain connectivity between RNA domains. The smaller rRNA sequence also limits the likelihood of tRNA binding at E-site of the mitoribosome, and correlates with the reduced size of D- and T-loops in some animal mitochondrial tRNAs, suggesting co-evolution of mitochondrial rRNA and tRNA structures.
RNA is directly associated with a growing number of functions within the cell. The accurate prediction of different RNAs higher-order structure from their nucleic acid sequences will provide insight into their functions and molecular mechanics. We have been determining statistical potentials for a collection of structural elements that is larger than the number of structural elements determined with experimentally determined energy values. The experimentally derived free-energies and the statistical potentials for canonical base pair stacks are analogous, demonstrating that statistical potentials derived from comparative data can be used as an alternative energetic parameter. A new computational infrastructure - RNA Comparative Analysis Database (rCAD) - that utilizes a relational database was developed to manipulate and analyze very large sequence alignments and secondary structure datasets. Using rCAD, a richer set of energetic parameters for RNA fundamental structural elements including hairpin and internal loops was determined. A new version of RNAfold was developed to utilize these statistical potentials. Overall, these new statistical potentials for hairpin and internal loops integrated into the new version of RNAfold demonstrated significant improvements in the prediction accuracy of RNA secondary structure.
statistical potentials; RNA folding; comparative analysis; RNA structure; accuracy of the predicted RNA structure
A new and emerging paradigm in molecular biology is revealing that RNA is implicated in nearly every aspect of the metabolism in the cell. To enhance our understanding of the function of these RNA molecules in the cell, it is essential that we have a complete understanding of their higher-order structures. While many computational tools have been developed to predict and analyse these higher-order RNA structures, few are able to visualize them for analytical purposes. In this paper, we present an interactive visualization tool of the secondary structure of RNA, named RNA2DMap. This program enables multiple-dimensions of information about RNA structure to be selected, customized and displayed to visually identify patterns and relationships. RNA2DMap facilitates the comparative analysis and understanding of RNAs that cannot be readily obtained with other graphical or text output from computer programs. Three use cases are presented to illustrate how RNA2DMap aids structural analysis.
Biological Data Visulation; RNA Struaral Analysis; Interative Application
The mitochondrial genome in the human malaria parasite Plasmodium falciparum is most unusual. Over half the genome is composed of the genes for three classic mitochondrial proteins: cytochrome oxidase subunits I and III and apocytochrome b. The remainder encodes numerous small RNAs, ranging in size from 23 to 190 nt. Previous analysis revealed that some of these transcripts have significant sequence identity with highly conserved regions of large and small subunit rRNAs, and can form the expected secondary structures. However, these rRNA fragments are not encoded in linear order; instead, they are intermixed with one another and the protein coding genes, and are coded on both strands of the genome. This unorthodox arrangement hindered the identification of transcripts corresponding to other regions of rRNA that are highly conserved and/or are known to participate directly in protein synthesis.
The identification of 14 additional small mitochondrial transcripts from P. falcipaurm and the assignment of 27 small RNAs (12 SSU RNAs totaling 804 nt, 15 LSU RNAs totaling 1233 nt) to specific regions of rRNA are supported by multiple lines of evidence. The regions now represented are highly similar to those of the small but contiguous mitochondrial rRNAs of Caenorhabditis elegans. The P. falciparum rRNA fragments cluster on the interfaces of the two ribosomal subunits in the three-dimensional structure of the ribosome.
All of the rRNA fragments are now presumed to have been identified with experimental methods, and nearly all of these have been mapped onto the SSU and LSU rRNAs. Conversely, all regions of the rRNAs that are known to be directly associated with protein synthesis have been identified in the P. falciparum mitochondrial genome and RNA transcripts. The fragmentation of the rRNA in the P. falciparum mitochondrion is the most extreme example of any rRNA fragmentation discovered.
Covariation analysis is used to identify those positions with similar patterns of sequence variation in an alignment of RNA sequences. These constraints on the evolution of two positions are usually associated with a base pair in a helix. While mutual information (MI) has been used to accurately predict an RNA secondary structure and a few of its tertiary interactions, early studies revealed that phylogenetic event counting methods are more sensitive and provide extra confidence in the prediction of base pairs. We developed a novel and powerful phylogenetic events counting method (PEC) for quantifying positional covariation with the Gutell lab’s new RNA Comparative Analysis Database (rCAD). The PEC and MI-based methods each identify unique base pairs, and jointly identify many other base pairs. In total, both methods in combination with an N-best and helix-extension strategy identify the maximal number of base pairs. While covariation methods have effectively and accurately predicted RNAs secondary structure, only a few tertiary structure base pairs have been identified. Analysis presented herein and at the Gutell lab’s Comparative RNA Web (CRW) Site reveal that the majority of these latter base pairs do not covary with one another. However, covariation analysis does reveal a weaker although significant covariation between sets of nucleotides that are in proximity in the three-dimensional RNA structure. This reveals that covariation analysis identifies other types of structural constraints beyond the two nucleotides that form a base pair.
While the majority of the ribosomal RNA structure is conserved in the three major domains of life – archaea, bacteria, and eukaryotes, specific regions of the rRNA structure are unique to at least one of these three primary forms of life. In particular, the comparative secondary structure for the eukaryotic SSU rRNA contains several regions that are different from the analogous regions in the bacteria. Our detailed analysis of two recently determined eukaryotic 40S ribosomal crystal structures, Tetrahymena thermophila and Saccharomyces cerevisiae, and the comparison of these results with the bacterial Thermus thermophilus 30S ribosomal crystal structure: (1) revealed that the vast majority of the comparative structure model for the eukaryotic SSU rRNA is substantiated, including the secondary structure that is similar to both bacteria and archaea as well as specific for the eukaryotes, (2) resolved the secondary structure for regions of the eukaryotic SSU rRNA that were not determined with comparative methods, (3) identified eukaryotic helices that are equivalent to the bacterial helices in several of the hypervariable regions, (4) revealed that, while the coaxially stacked compound helix in the 540 region in the central domain maintains the constant length of 10 base pairs, its two constituent helices contain 5+5 bp rather than the 6+4 bp predicted with comparative analysis of archaeal and eukaryotic SSU rRNAs.
The accurate prediction of an RNAs three dimensional structure from its “primary structure” will have a tremendous influence on the experimental design and its interpretation, and ultimately our understanding of the many functions of RNA. This paper presents a general coarse-grained (CG) potential for modeling RNA 3-D structures. Each nucleotide is represented by five pseudo atoms, two for the backbone (one for the phosphate and another for the sugar), and three for the base to represent base-stacking interactions. The CG potential has been parameterized from statistical analysis of 688 RNA experimental structures. Molecular dynamic simulations of 15 RNA molecules with the length of 12 to 27 nucleotides have been performed using the CG potential, with performance comparable to that from all-atom simulations. For ~75% of systems tested, simulated annealing led to native-like structures at least once out of multiple repeated runs. Furthermore, with weak distance restraints based on the knowledge of three to five canonical Watson-Crick pairs, all 15 RNAs tested are successfully folded to within 6.5 Å of native structures using the CG potential and simulated annealing. The results reveal that with a limited secondary structure model, the current CG potential can reliably predict the 3-D structures for small RNA molecules. We also explored an all-atom force field to construct atomic structures from the CG simulations.
Coarse-Grained Model; RNA structure; 3-D structure prediction; Molecular Dynamics
Evolutionary relationships among organisms are commonly described by using a
hierarchy derived from comparisons of ribosomal RNA (rRNA) sequences. We propose that
even on the level of a single rRNA molecule, an organism's evolution is composed
of multiple pathways due to concurrent forces that act independently upon different
rRNA degrees of freedom. Relationships among organisms are then compositions of
coexisting pathway-dependent similarities and dissimilarities, which cannot be
described by a single hierarchy. We computationally test this hypothesis in
comparative analyses of 16S and 23S rRNA sequence alignments by using a tensor
decomposition, i.e., a framework for modeling composite data. Each alignment is
encoded in a cuboid, i.e., a third-order tensor, where nucleotides, positions and
organisms, each represent a degree of freedom. A tensor mode-1 higher-order singular
value decomposition (HOSVD) is formulated such that it separates each cuboid into
combinations of patterns of nucleotide frequency variation across organisms and
positions, i.e., “eigenpositions” and corresponding nucleotide-specific
segments of “eigenorganisms,” respectively, independent of a-priori
knowledge of the taxonomic groups or rRNA structures. We find, in support of our
hypothesis that, first, the significant eigenpositions reveal multiple similarities
and dissimilarities among the taxonomic groups. Second, the corresponding
eigenorganisms identify insertions or deletions of nucleotides exclusively conserved
within the corresponding groups, that map out entire substructures and are enriched
in adenosines, unpaired in the rRNA secondary structure, that participate in tertiary
structure interactions. This demonstrates that structural motifs involved in rRNA
folding and function are evolutionary degrees of freedom. Third, two previously
unknown coexisting subgenic relationships between Microsporidia and Archaea are
revealed in both the 16S and 23S rRNA alignments, a convergence and a divergence,
conferred by insertions and deletions of these motifs, which cannot be described by a
single hierarchy. This shows that mode-1 HOSVD modeling of rRNA alignments might be
used to computationally predict evolutionary mechanisms.
We reconstruct the phylogenetic relationships within the bacterial genus Pseudonocardia to evaluate two models explaining how and why Pseudonocardia bacteria colonize the microbial communities on the integument of fungus-gardening ant species (Attini, Formicidae). The traditional Coevolution-Codivergence model views the integument-colonizing Pseudonocardia as mutualistic microbes that are largely vertically transmitted between ant generations and that supply antibiotics that specifically suppress the garden pathogen Escovopsis. The more recent Acquisition model views Pseudonocardia as part of a larger integumental microbe community that frequently colonizes the ant integument from environmental sources (e.g., soil, plant material). Under this latter model, ant-associated Pseudonocardia may have diverse ecological roles on the ant integument (possibly ranging from pathogenic, to commensal, to mutualistic) and are not necessarily related to Escovopsis suppression. We test distinct predictions of these two models regarding the phylogenetic proximity of ant-associated and environmental Pseudonocardia. We amassed 16S-rRNA gene sequence information for 87 attine-associated and 238 environmental Pseudonocardia, aligned the sequences with the help of RNA secondary structure modeling, and reconstructed phylogenetic relationships using a maximum-likelihood approach. We present 16S-rRNA secondary structure models of representative Pseudonocardia species to improve sequence alignments and identify sequencing errors. Our phylogenetic analyses reveal close affinities and even identical sequence matches between environmental Pseudonocardia and ant-associated Pseudonocardia, as well as nesting of environmental Pseudonocardia in subgroups that were previously thought to be specialized to associate only with attine ants. The great majority of ant associated Pseudonocardia are closely related to autotrophic Pseudonocardia and are placed in a large subgroup of Pseudonocardia that is known essentially only from cultured isolates (rather than cloned 16S sequences). The preponderance of the known ant-associated Pseudonocardia in this latter clade of culturable lineages may not necessarily reflect abundance of these Pseudonocardia types on the ants, but isolation biases when screening for Pseudonocardia (e.g., preferential isolation of autotrophic Pseudonocardia with minimum-nutrient media). The accumulated phylogenetic patterns and the possibility of isolation biases in previous work further erode support for the traditional Coevolution-Codivergence model and calls for continued revision of our understanding how and why Pseudonocardia colonize the microbial communities on the integument of fungus-gardening ant species.
Attine ant-microbe symbiosis; Mutualism; Antibiotic; Secondary rRNA structure
Discontinuous genes have been observed in bacteria, archaea, and eukaryotic nuclei, mitochondria and chloroplasts. Gene discontinuity occurs in multiple forms: the two most frequent forms result from introns that are spliced out of the RNA and the resulting exons are spliced together to form a single transcript, and fragmented gene transcripts that are not covalently attached post-transcriptionally. Within the past few years, fragmented ribosomal RNA (rRNA) genes have been discovered in bilateral metazoan mitochondria, all within a group of related oysters.
In this study, we have characterized this fragmentation with comparative analysis and experimentation. We present secondary structures, modeled using comparative sequence analysis of the discontinuous mitochondrial large subunit rRNA genes of the cupped oysters C. virginica, C. gigas, and C. hongkongensis. Comparative structure models for the large subunit rRNA in each of the three oyster species are generally similar to those for other bilateral metazoans. We also used RT-PCR and analyzed ESTs to determine if the two fragmented LSU rRNAs are spliced together. The two segments are transcribed separately, and not spliced together although they still form functional rRNAs and ribosomes.
Although many examples of discontinuous ribosomal genes have been documented in bacteria and archaea, as well as the nuclei, chloroplasts, and mitochondria of eukaryotes, oysters are some of the first characterized examples of fragmented bilateral animal mitochondrial rRNA genes. The secondary structures of the oyster LSU rRNA fragments have been predicted on the basis of previous comparative metazoan mitochondrial LSU rRNA structure models.
The accurate prediction of the secondary and tertiary structure of an RNA with different folding algorithms are dependent on several factors, including the energy functions. However, an RNA higher-order structure cannot be accurately predicted from its sequence based on a limited set of energy parameters. The inter- and intra-molecular forces between this RNA and other small molecules and macromolecules, in addition to other factors in the cell such as pH, ionic strength, and temperature influence the complex dynamics associated with a single stranded RNA's transitioning to its secondary and tertiary structure. Since all of the factors that affect the formation of an RNAs three-dimensional structure cannot be determined experimentally, statistically derived potential energy has been used in the prediction of protein structure. In the current work, we evaluate the statistical free energy of various secondary structure motifs, including base-pair stacks, hairpin loops, and internal loops, using their statistical frequencies obtained from the comparative analysis of more than 50 000 RNA sequences stored in the RNA Comparative Analysis Database (rCAD) at the Comparative RNA Web (CRW) Site. Statistical energies were computed from the structural statistics for several datasets. While the statistical energies for base-pair stacks correlate with experimentally derived free energy values, suggesting a Boltzmann-like distribution, variation is observed between different molecules and their location on the phylogenetic tree of life. Our statistical energies for several structural elements were utilized in the Mfold RNA folding algorithm. The combined statistical energies for base-pair stacks, hairpins and internal loop flanks results in a significant improvement in the accuracy of secondary structure prediction; however, the hairpin flanks contribute the most.
statistical potentials; RNA folding; thermodynamic stability; comparative analysis
With an increasingly large amount of sequences properly aligned, comparative sequence analysis can accurately identify not only common structures formed by standard base pairing but also new types of structural elements and constraints. However, traditional methods are too computationally expensive to perform well on large scale alignment and less effective with the sequences from diversified phylogenetic classifications. We propose a new approach that utilizes coevolutional rates among pairs of nucleotide positions using phylogenetic and evolutionary relationships of the organisms of aligned sequences. With a novel data schema to manage relevant information within a relational database, our method, implemented with a Microsoft SQL Server 2005, showed 90% sensitivity in identifying base pair interactions among 16S ribosomal RNA sequences from Bacteria, at a scale 40 times bigger and 50% better sensitivity than a previous study. The results also indicated covariation signals for a few sets of cross-strand base stacking pairs in secondary structure helices, and other subtle constraints in the RNA structure.
Biological database; Bioinformatics; Sequence Analysis; RNA
A recent reclassification of diatoms based on phylogenies recovered using the nuclear-encoded SSU rRNA gene contains three major classes, Coscinodiscophyceae, Mediophyceae and the Bacillariophyceae (the CMB hypothesis). We evaluated this with a sequence alignment of 1336 protist and heterokont algae SSU rRNAs, which includes 673 diatoms. Sequences were aligned to maintain structural elements conserved within this dataset. Parsimony analysis rejected the CMB hypothesis, albeit weakly. Morphological data are also incongruent with this recent CMB hypothesis of three diatom clades. We also reanalyzed a recently published dataset which purports to support the CMB hypothesis. Our reanalysis found that the original analysis had not converged on the true bipartition posterior probability distribution, and rejected the CMB hypothesis. Thus we conclude that a reclassification of the evolutionary relationships of the diatoms according to the CMB hypothesis is premature.
SSU; diatom phylogeny; diatom classification; Coscinodiscophyceae; Mediophyceae; Bacillariophyceae
We present a structure of the mammalian ribosome determined at ∼8.7Å resolution by cryo-electron microscopy. A molecular model was created by docking homology models of the subunit rRNAs and conserved ribosomal proteins into the density map. We also modeled expansion segments in the subunit rRNAs and identified the positions of 20 novel proteins. In general, we find that many ribosomal proteins interact with the expansion segments to form an integrated framework that may stabilize the mature ribosome. Importantly, our structure gives a snapshot of the mammalian ribosome before the binding of an A-site tRNA. The structure also provides additional support for the idea that movements of the small subunit and L1 stalk occur during the translocation of tRNAs. Finally, new details are presented about novel inter-subunit bridges in the eukaryotic ribosome. These bridges may help reset the conformation of the ribosomal subunits in preparation for the next cycle of chain elongation.
eukaryotic ribosome; protein translation; tRNA translocation; expansion segments
The origin and early evolution of the active site of the ribosome can be elucidated through an analysis of the ribosomal proteins' taxonomic block structures and their RNA interactions. Comparison between the two subunits, exploiting the detailed three-dimensional structures of the bacterial and archaeal ribosomes, is especially informative.
The analysis of the differences between these two sites can be summarized as follows: 1) There is no self-folding RNA segment that defines the decoding site of the small subunit; 2) there is one self-folding RNA segment encompassing the entire peptidyl transfer center of the large subunit; 3) the protein contacts with the decoding site are made by a set of universal alignable sequence blocks of the ribosomal proteins; 4) the majority of those peptides contacting the peptidyl transfer center are made by bacterial or archaeal-specific sequence blocks.
These clear distinctions between the two subunit active sites support an earlier origin for the large subunit's peptidyl transferase center (PTC) with the decoding site of the small subunit being a later addition to the ribosome. The main implications are that a single self-folding RNA, in conjunction with a few short stabilizing peptides, formed the precursor of the modern ribosomal large subunit in association with a membrane.
This article was reviewed by Jerzy Jurka, W. Ford Doolittle, Eugene Shaknovich, and George E. Fox (nominated by Jerzy Jurka).
Biclustering is a clustering method that simultaneously clusters both the domain and range of a relation. A challenge in multiple sequence alignment (MSA) is that the alignment of sequences is often intended to reveal groups of conserved functional subsequences. Simultaneously, the grouping of the sequences can impact the alignment; precisely the kind of dual situation biclustering is intended to address.
We define a representation of the MSA problem enabling the application of biclustering algorithms. We develop a computer program for local MSA, BlockMSA, that combines biclustering with divide-and-conquer. BlockMSA simultaneously finds groups of similar sequences and locally aligns subsequences within them. Further alignment is accomplished by dividing both the set of sequences and their contents. The net result is both a multiple sequence alignment and a hierarchical clustering of the sequences. BlockMSA was tested on the subsets of the BRAliBase 2.1 benchmark suite that display high variability and on an extension to that suite to larger problem sizes. Also, alignments were evaluated of two large datasets of current biological interest, T box sequences and Group IC1 Introns. The results were compared with alignments computed by ClustalW, MAFFT, MUCLE and PROBCONS alignment programs using Sum of Pairs (SPS) and Consensus Count.
Results for the benchmark suite are sensitive to problem size. On problems of 15 or greater sequences, BlockMSA is consistently the best. On none of the problems in the test suite are there appreciable differences in scores among BlockMSA, MAFFT and PROBCONS. On the T box sequences, BlockMSA does the most faithful job of reproducing known annotations. MAFFT and PROBCONS do not. On the Intron sequences, BlockMSA, MAFFT and MUSCLE are comparable at identifying conserved regions.
A detailed understanding of an RNA's correct secondary and tertiary structure is crucial to understanding its function and mechanism in the cell. Free energy minimization with energy parameters based on the nearest-neighbor model and comparative analysis are the primary methods for predicting an RNA's secondary structure from its sequence. Version 3.1 of Mfold has been available since 1999. This version contains an expanded sequence dependence of energy parameters and the ability to incorporate coaxial stacking into free energy calculations. We test Mfold 3.1 by performing the largest and most phylogenetically diverse comparison of rRNA and tRNA structures predicted by comparative analysis and Mfold, and we use the results of our tests on 16S and 23S rRNA sequences to assess the improvement between Mfold 2.3 and Mfold 3.1.
The average prediction accuracy for a 16S or 23S rRNA sequence with Mfold 3.1 is 41%, while the prediction accuracies for the majority of 16S and 23S rRNA structures tested are between 20% and 60%, with some having less than 20% prediction accuracy. The average prediction accuracy was 71% for 5S rRNA and 69% for tRNA. The majority of the 5S rRNA and tRNA sequences have prediction accuracies greater than 60%. The prediction accuracy of 16S rRNA base-pairs decreases exponentially as the number of nucleotides intervening between the 5' and 3' halves of the base-pair increases.
Our analysis indicates that the current set of nearest-neighbor energy parameters in conjunction with the Mfold folding algorithm are unable to consistently and reliably predict an RNA's correct secondary structure. For 16S or 23S rRNA structure prediction, Mfold 3.1 offers little improvement over Mfold 2.3. However, the nearest-neighbor energy parameters do work well for shorter RNA sequences such as tRNA or 5S rRNA, or for larger rRNAs when the contact distance between the base-pairs is less than 100 nucleotides.
We have studied spliceosomal introns in the ribosomal (r)RNA of fungi to discover the forces that guide their insertion and fixation.
Comparative analyses of flanking sequences at 49 different spliceosomal intron sites showed that the G – intron – G motif is the conserved flanking sequence at sites of intron insertion. Information analysis showed that these rRNA introns contain significant information in the flanking exons. Analysis of all rDNA introns in the three phylogenetic domains and two organelles showed that group I introns are usually located after the most conserved sites in rRNA, whereas spliceosomal introns occur at less conserved positions. The distribution of spliceosomal and group I introns in the primary structure of small and large subunit rRNAs was tested with simulations using the broken-stick model as the null hypothesis. This analysis suggested that the spliceosomal and group I intron distributions were not produced by a random process. Sequence upstream of rRNA spliceosomal introns was significantly enriched in G nucleotides. We speculate that these G-rich regions may function as exonic splicing enhancers that guide the spliceosome and facilitate splicing.
Our results begin to define some of the rules that guide the distribution of rRNA spliceosomal introns and suggest that the exon context is of fundamental importance in intron fixation.