Sequence conservation and co-variation of base pairs are hallmarks of structured RNAs. For certain RNAs (e.g. riboswitches), a single sequence must adopt at least two alternative secondary structures to effectively regulate the message. If alternative secondary structures are important to the function of an RNA, we expect to observe evolutionary co-variation supporting multiple conformations. We set out to characterize the evolutionary co-variation supporting alternative conformations in riboswitches to determine the extent to which alternative secondary structures are conserved. We found strong co-variation support for the terminator, P1, and anti-terminator stems in the purine riboswitch by extending alignments to include terminator sequences. When we performed Boltzmann suboptimal sampling on purine riboswitch sequences with terminators we found that these sequences appear to have evolved to favor specific alternative conformations. We extended our analysis of co-variation to classic alignments of group I/II introns, tRNA, and other classes of riboswitches. In a majority of these RNAs, we found evolutionary evidence for alternative conformations that are compatible with the Boltzmann suboptimal ensemble. Our analyses suggest that alternative conformations are selected for and thus likely play functional roles in even the most structured of RNAs.
RNA (Ribonucleic Acid) is a messenger of genetic information, master regulator, and catalyst in the cell. To carry out its function, RNA can fold into complex three-dimensional structures. Certain classes of RNAs, called riboswitches, adopt at least two alternative structures to act as a switch. We set out to detect the evolutionary signal for alternative structures in riboswitches as we hypothesize that these RNA sequences must have evolved to allow both conformations. We find that indeed such signals exist when we compare the sequences of riboswitches from multiple species. When we extend this analysis to other RNA regulators in the cell that are not thought of as switches, we detect equivalent evolutionary support for alternative structures. Viewed through the lens of evolutionary structure conservation RNA sequences appear to have adapted to adopt multiple conformations.
The detection of single nucleotide polymorphisms (SNPs) and insertion/deletions (indels) with precision from high-throughput data remains a significant bioinformatics challenge. Accurate detection is necessary before next-generation sequencing can routinely be used in the clinic. In research, scientific advances are inhibited by gaps in data, exemplified by the underrepresented discovery of rare variants, variants in non-coding regions and indels. The continued presence of false positives and false negatives prevents full automation and requires additional manual verification steps. Our methodology presents applications of both pattern recognition and sensitivity analysis to eliminate false positives and aid in the detection of SNP/indel loci and genotypes from high-throughput data. We chose FK506-binding protein 51(FKBP5) (6p21.31) for our clinical target because of its role in modulating pharmacological responses to physiological and synthetic glucocorticoids and because of the complexity of the genomic region. We detected genetic variation across a160 kb region encompassing FKBP5. 613 SNPs and 57 indels, including a 3.3 kb deletion were discovered. We validated our method using three independent data sets and, with Sanger sequencing and Affymetrix and Illumina microarrays, achieved 99% concordance. Furthermore we were able to detect 267 novel rare variants and assess linkage disequilibrium. Our results showed both a sensitivity and specificity of 98%, indicating near perfect classification between true and false variants. The process is scalable and amenable to automation, with the downstream filters taking only 1.5 hours to analyze 96 individuals simultaneously. We provide examples of how our level of precision uncovered the interactions of multiple loci, their predicted influences on mRNA stability, perturbations of the hsp90 binding site, and individual variation in FKBP5 expression. Finally we show how our discovery of rare variants may change current conceptions of evolution at this locus.
pattern recognition; next-generation sequencing analysis; indels; rare variants; FKBP5; HLA
To make full use of research data, the bioscience community needs to adopt technologies and reward mechanisms that support interoperability and promote the growth of an open ‘data commoning’ culture. Here we describe the prerequisites for data commoning and present an established and growing ecosystem of solutions using the shared ‘Investigation-Study-Assay’ framework to support that vision.
The structure of RiboNucleic Acid (RNA) has the potential to be altered by a Single Nucleotide Polymorphism (SNP). Disease-associated SNPs mapping to non-coding regions of the genome that are transcribed into RiboNucleic Acid (RNA) can potentially affect cellular regulation (and cause disease) by altering the structure of the transcript. We performed a large-scale meta-analysis of Selective 2'-Hydroxyl Acylation analyzed by Primer Extension (SHAPE) data, which probes the structure of RNA. We found that several single point mutations exist that significantly disrupt RNA secondary structure in the five transcripts we analyzed. Thus, every RNA that is transcribed has the potential to be a “RiboSNitch;” where a SNP causes a large conformational change that alters regulatory function. Predicting the SNPs that will have the largest effect on RNA structure remains a contemporary computational challenge. We therefore benchmarked the most popular RNA structure prediction algorithms for their ability to identify mutations that maximally affect structure. We also evaluated metrics for rank ordering the extent of the structural change. Although no single algorithm/metric combination dramatically outperformed the others, small differences in AUC (Area Under the Curve) values reveal that certain approaches do provide better agreement with experiment. The experimental data we analyzed nonetheless show that multiple single point mutations exist in all RNA transcripts that significantly disrupt structure in agreement with the predictions.
This report summarizes the proceedings of the structure mapping working group meeting of the RNA Ontology Consortium (ROC), held in Kona, Hawaii on January 8-9, 2011. The ROC hosted this workshop to facilitate collaborations among those researchers formalizing concepts in RNA, those developing RNA-related software, and those performing genome annotation and standardization. The workshop included three software presentations, extended round-table discussions, and the constitution of two new working groups, the first to address the need for better software integration and the second to discuss standardization and benchmarking of existing RNA annotation pipelines. These working groups have subsequently pursued concrete implementation of actions suggested during the discussion. Further information about the ROC and its activities can be found at http://roc.bgsu.edu/.
Parkinson disease (PD) is a common disorder that leads to motor and cognitive disability. We performed a genome-wide association study (GWAS) with 2000 PD and 1986 control Caucasian subjects from NeuroGenetics Research Consortium.1–5 We confirmed SNCA2,6–8 and MAPT3,7–9; replicated GAK9 (PPankratz+NGRC=3.2×10−9); and detected a novel association with HLA (PNGRC=2.9×10−8) which replicated in two datasets (PMeta-analysis=1.9×10−10). We designate the new PD genes PARK17 (GAK) and PARK18 (HLA). PD-HLA association was uniform across genetic and environmental risk strata, and strong in sporadic (P=5.5×10−10) and late-onset (P=2.4×10−8) PD. The association peak was at rs3129882, a non-coding variant in HLA-DRA. Two studies suggested rs3129882 influences expression of HLA-DR and HLA-DQ.10,11 PD brains exhibit up-regulation of DR antigens and presence of DR-positive reactive microglia.12 Moreover, non-steroidal anti-inflammatory drugs (NSAID) reduce PD risk.4,13 The genetic association with HLA coalesces the evidence for involvement of the immune system and offers new targets for drug development and pharmacogenetics.
Efficient modeling approaches are necessary to accurately predict large-scale structural behavior of biomolecular systems like RNA (Ribonucleic Acid). Coarse grained approximations of such complex systems can significantly reduce the computational costs of the simulation while maintaining sufficient fidelity to capture the biologically significant motions. However, given the coupling and nonlinearity of RNA systems (and effectively all biopolymers), it is expected that different parameters such as geometric and dynamic boundary conditions, states, and applied forces will affect the system’s dynamic behavior. Consequently, static coarse grained models (i.e., models for which the coarse graining is time invariant) are not always able to adequately sample the conformational space of the molecule. We introduce here the concept of adaptive coarse-grained molecular dynamics of RNA, which automatically adapts the coarseness of the model dynamically, in an effort to more optimally increase simulation speed, while maintaining accuracy. Adaptivity requires two basic algorithmic developments; first, a set of integrators that seamlessly allow transitions between higher and lower fidelity models while preserving the laws of motion. Secondly, we propose and validate metrics for determining when and where more or less fidelity needs to be integrated into the model to allow sufficiently accurate dynamics simulation. Given the central role that multibody dynamics plays in the proposed framework, and the nominally large number of dynamic degrees of freedom being considered in these applications, a computationally efficient multibody method which lends itself well to adaptivity is essential to the success of this effort. A suite of Divide-And-Conquer Algorithm (DCA)-based approaches are employed to this end, because these methods offer a good combination of computational efficiency and adaptive structure.
adaptive coarse graining; articulated multibody dynamics; divide-and-conquer algorithm; RNA; transition metric
The use of highly reactive chemical species to probe the structure and dynamics of nucleic acids is greatly simplified by software that enables rapid quantification of the gel images that result from these experiments. SAFA (Semi-Automated Footprinting Analysis) allows a user to quickly and reproducibly quantify a chemical footprinting gel image through a series of steps that rectify, assign, and integrate the relative band intensities. The output of this procedure is raw band intensities that report on the relative reactivity of each nucleotide with the chemical probe. We describe here how to obtain these raw band intensities using SAFA and the subsequent normalization and analysis procedures required to process this data. In particular, we focus on analyzing time-resolved hydroxyl radical (•OH) data, which we use to monitor the kinetics of folding of a large RNA (the L-21 T. thermophila group I intron). Exposing the RNA to bursts of •OH radicals at specific time-points during the folding process monitors the time-progress of the reaction. Specifically, we identify protected (nucleotides that become inaccessible to the •OH radical probe when folded) and invariant (nucleotides with constant accessibility to the •OH probe) residues that we use for monitoring and normalization of the data. With this analysis, we obtain time-progress curves from which we determine kinetic rates of folding. We also report on a data visualization tool implemented in SAFA that allows users to map data onto a secondary structure diagram.
Genome-wide association studies (GWAS) often identify disease-associated mutations in intergenic and non-coding regions of the genome. Given the high percentage of the human genome that is transcribed, we postulate that for some observed associations the disease phenotype is caused by a structural rearrangement in a regulatory region of the RNA transcript. To identify such mutations, we have performed a genome-wide analysis of all known disease-associated Single Nucleotide Polymorphisms (SNPs) from the Human Gene Mutation Database (HGMD) that map to the untranslated regions (UTRs) of a gene. Rather than using minimum free energy approaches (e.g. mFold), we use a partition function calculation that takes into consideration the ensemble of possible RNA conformations for a given sequence. We identified in the human genome disease-associated SNPs that significantly alter the global conformation of the UTR to which they map. For six disease-states (Hyperferritinemia Cataract Syndrome, β-Thalassemia, Cartilage-Hair Hypoplasia, Retinoblastoma, Chronic Obstructive Pulmonary Disease (COPD), and Hypertension), we identified multiple SNPs in UTRs that alter the mRNA structural ensemble of the associated genes. Using a Boltzmann sampling procedure for sub-optimal RNA structures, we are able to characterize and visualize the nature of the conformational changes induced by the disease-associated mutations in the structural ensemble. We observe in several cases (specifically the 5′ UTRs of FTL and RB1) SNP–induced conformational changes analogous to those observed in bacterial regulatory Riboswitches when specific ligands bind. We propose that the UTR and SNP combinations we identify constitute a “RiboSNitch,” that is a regulatory RNA in which a specific SNP has a structural consequence that results in a disease phenotype. Our SNPfold algorithm can help identify RiboSNitches by leveraging GWAS data and an analysis of the mRNA structural ensemble.
Genome-wide association studies identify mutations in the human genome that correlate with a particular disease. It is common to find mutations associated with disease in the non-coding region of the genome. These non-coding mutations are more difficult to interpret at a molecular level, because they do not affect the protein sequence. In this study, we analyze disease-associated mutations in non-coding regions of our genome in the context of their structural effect on the message of genetic information in our cells, Ribonucleic Acid (RNA). We focus in particular on the regulatory parts of our genes known as untranslated regions. We find that certain disease-associated mutations in these regulatory untranslated regions have a significant effect on the structure of the RNA message. We call these elements “RiboSNitches,” because they act like switches turning on and off genes, but are caused by Single Nucleotide Polymorphisms (SNPs), which are single point mutations in our genome. The RiboSNitches we identify are potentially a new class of pharmaceutical targets, as it is possible to change the structure of RNA with small drug-like molecules.
Large, multi-domain RNA molecules are generally thought to fold following multiple pathways down rugged landscapes populated with intermediates and traps. A challenge to understanding RNA folding reactions are the complex relationships that exist between the structure of the RNA and its folding landscape. The identification of intermediate species that populate folding landscapes and characterization of elements of their structures are key components to solving the RNA folding problem. This review explores recent studies that characterize the dominant pathways by which RNA folds, structural and dynamic features of intermediates that populate the folding landscape and the energy barriers that separate the distinct steps of the folding process.
Unlike protein folding, the process by which a large RNA molecule adopts a functionally active conformation remains poorly understood. Chemical mapping techniques, such as Hydroxyl Radical (·OH) footprinting report on local structural changes in an RNA as it folds with single nucleotide resolution. The analysis and interpretation of this kinetic data requires the identification and subsequent optimization of a kinetic model and its parameters. We detail our approach to this problem, specifically focusing on a novel strategy to overcome a factorial explosion in the number of possible models that need to be tested to identify the best fitting model. Previously, smaller systems (less than three intermediates) were computationally tractable using a distributed computing approach. However, for larger systems with three or more intermediates, the problem became computationally intractable. With our new enumeration strategy, we are able to significantly reduce the number of models that need to be tested using non-linear least squares optimization, allowing us to study systems with up to five intermediates. Furthermore, two intermediate systems can now be analyzed on a desktop computer, which eliminates the need for a distributed computing solution for most medium-sized data sets. Our new approach also allows us to study potential degeneracy in kinetic model selection, elucidating the limits of the method when working with large systems. This work establishes clear criteria for determining if experimental ·OH data is sufficient to determine the underlying kinetic model, or if other experimental modalities are required to resolve any degeneracy.
RNA folding; kinetic modeling; Tetrahymena thermophila group I intron; distributed computing; ·OH radical footprinting
We have developed protocols for rapidly quantifying the band intensities from nucleic acid chemical mapping gels at single nucleotide resolution. These protocols are implemented in the software SAFA (Semi-Automated Footprinting Analysis) that can be downloaded without charge from http://safa.stanford.edu. The protocols implemented in SAFA have five steps: 1.) Lane identification, 2.) Gel rectification, 3.) Band assignment, 4.) Model fitting, and 5.) Band intensity normalization. SAFA enables the rapid quantitation of gel images containing thousands of discrete bands, thereby eliminating a bottleneck to the analysis of chemical mapping experiments. An experienced user of the software can quantify a gel image in approximately 15 minutes. Although SAFA was developed to analyze hydroxyl radical (·OH) footprints, it effectively quantifies the gel images obtained with other types of chemical mapping probes. We also present a series of tutorial movies that illustrate the best practices and different steps in the SAFA analysis as a supplement to this protocol.
Gel Electrophoresis; Quantification; Chemical Mapping; Nucleic Acid; Phosphorimaging; SAFA; Footprint
The world of regulatory RNAs is fast expanding into mainstream molecular biology as both a subject of intense mechanistic study and as a tool for functional characterization. The RNA world is one of complex structures that carry out catalysis, sense metabolites and synthesize proteins. The dynamic and structural nature of RNAs presents a whole new set of informatics challenges to the computational community. The ability to relate structure and dynamics to function will be key to understanding this complex world. I review several important classes of structured RNAs that present our community with a series of biologically novel informatics challenges. I also review available informatics tools that have been recently developed in the field.
RNA; Folding; Informatics; Riboswitch; Ribosome; RNAi
At the heart of the RNA folding problem is the number, structures, and relationships among the intermediates that populate the folding pathways of most large RNA molecules. Unique insight into the structural dynamics of these intermediates can be gleaned from the time-dependent changes in local probes of macromolecular conformation (e.g. reports on individual nucleotide solvent accessibility offered by hydroxyl radical (•OH) footprinting). Local measures distributed around a macromolecule individually illuminate the ensemble of separate changes that constitute a folding reaction. Folding pathway reconstruction from a multitude of these individual measures is daunting due to the combinatorial explosion of possible kinetic models as the number of independent local measures increases. Fortunately, clustering of time progress curves sufficiently reduces the dimensionality of the data so as to make reconstruction computationally tractable. The most likely folding topology and intermediates can then be identified by exhaustively enumerating all possible kinetic models on a super-computer grid. The folding pathways and measures of the relative flux through them were determined for Mg2+- and Na+-mediated folding of the Tetrahymena thermophila group I intron using this combined experimental and computational approach. The flux during Mg2+-mediated folding is divided among numerous parallel pathways. In contrast, the flux during the Na+-mediated reaction is predominantly restricted through three pathways, one of which is without detectable passage through intermediates. Under both conditions, the folding reaction is highly parallel with no single pathway accounting for more than 50% of the molecular flux. This suggests that RNA folding is non-sequential under a variety of different experimental conditions even at the earliest stages of folding. This study provides a template for the systematic analysis of the time-evolution of RNA structure from ensembles of local measures that will illuminate the chemical and physical characteristics of each step in the process. The applicability of this analysis approach to other macromolecules is discussed.
RNA; Folding; Ribozyme; Pathway; Salt
The use of capillary electrophoresis with fluorescently labeled nucleic acids revolutionized DNA sequencing, effectively fueling the genomic revolution. We present an application of this technology for the high-throughput structural analysis of nucleic acids by chemical and enzymatic mapping (‘footprinting’). We achieve the throughput and data quality necessary for genomic-scale structural analysis by combining fluorophore labeling of nucleic acids with novel quantitation algorithms. We implemented these algorithms in the CAFA (capillary automated footprinting analysis) open-source software that is downloadable gratis from https://simtk.org/home/cafa. The accuracy, throughput and reproducibility of CAFA analysis are demonstrated using hydroxyl radical footprinting of RNA. The versatility of CAFA is illustrated by dimethyl sulfate mapping of RNA secondary structure and DNase I mapping of a protein binding to a specific sequence of DNA. Our experimental and computational approach facilitates the acquisition of high-throughput chemical probing data for solution structural analysis of nucleic acids.