PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (58)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
1.  BlockClust: efficient clustering and classification of non-coding RNAs from short read RNA-seq profiles 
Bioinformatics  2014;30(12):i274-i282.
Summary: Non-coding RNAs (ncRNAs) play a vital role in many cellular processes such as RNA splicing, translation, gene regulation. However the vast majority of ncRNAs still have no functional annotation. One prominent approach for putative function assignment is clustering of transcripts according to sequence and secondary structure. However sequence information is changed by post-transcriptional modifications, and secondary structure is only a proxy for the true 3D conformation of the RNA polymer. A different type of information that does not suffer from these issues and that can be used for the detection of RNA classes, is the pattern of processing and its traces in small RNA-seq reads data. Here we introduce BlockClust, an efficient approach to detect transcripts with similar processing patterns. We propose a novel way to encode expression profiles in compact discrete structures, which can then be processed using fast graph-kernel techniques. We perform both unsupervised clustering and develop family specific discriminative models; finally we show how the proposed approach is scalable, accurate and robust across different organisms, tissues and cell lines.
Availability: The whole BlockClust galaxy workflow including all tool dependencies is available at http://toolshed.g2.bx.psu.edu/view/rnateam/blockclust_workflow.
Contact: backofen@informatik.uni-freiburg.de; costa@informatik.uni-freiburg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btu270
PMCID: PMC4058930  PMID: 24931994
2.  Lineage-specific splicing of a brain-enriched alternative exon promotes glioblastoma progression 
The Journal of Clinical Investigation  2014;124(7):2861-2876.
Tissue-specific alternative splicing is critical for the emergence of tissue identity during development, yet the role of this process in malignant transformation is undefined. Tissue-specific splicing involves evolutionarily conserved, alternative exons that represent only a minority of the total alternative exons identified. Many of these conserved exons have functional features that influence signaling pathways to profound biological effect. Here, we determined that lineage-specific splicing of a brain-enriched cassette exon in the membrane-binding tumor suppressor annexin A7 (ANXA7) diminishes endosomal targeting of the EGFR oncoprotein, consequently enhancing EGFR signaling during brain tumor progression. ANXA7 exon splicing was mediated by the ribonucleoprotein PTBP1, which is normally repressed during neuronal development. PTBP1 was highly expressed in glioblastomas due to loss of a brain-enriched microRNA (miR-124) and to PTBP1 amplification. The alternative ANXA7 splicing trait was present in precursor cells, suggesting that glioblastoma cells inherit the trait from a potential tumor-initiating ancestor and that these cells exploit this trait through accumulation of mutations that enhance EGFR signaling. Our data illustrate that lineage-specific splicing of a tissue-regulated alternative exon in a constituent of an oncogenic pathway eliminates tumor suppressor functions and promotes glioblastoma progression. This paradigm may offer a general model as to how tissue-specific regulatory mechanisms can reprogram normal developmental processes into oncogenic ones.
doi:10.1172/JCI68836
PMCID: PMC4071411  PMID: 24865424
3.  MOF-associated complexes ensure stem cell identity and Xist repression 
eLife  2014;3:e02024.
Histone acetyl transferases (HATs) play distinct roles in many cellular processes and are frequently misregulated in cancers. Here, we study the regulatory potential of MYST1-(MOF)-containing MSL and NSL complexes in mouse embryonic stem cells (ESCs) and neuronal progenitors. We find that both complexes influence transcription by targeting promoters and TSS-distal enhancers. In contrast to flies, the MSL complex is not exclusively enriched on the X chromosome, yet it is crucial for mammalian X chromosome regulation as it specifically regulates Tsix, the major repressor of Xist lncRNA. MSL depletion leads to decreased Tsix expression, reduced REX1 recruitment, and consequently, enhanced accumulation of Xist and variable numbers of inactivated X chromosomes during early differentiation. The NSL complex provides additional, Tsix-independent repression of Xist by maintaining pluripotency. MSL and NSL complexes therefore act synergistically by using distinct pathways to ensure a fail-safe mechanism for the repression of X inactivation in ESCs.
DOI: http://dx.doi.org/10.7554/eLife.02024.001
eLife digest
Gene expression is controlled by a complicated network of mechanisms involving a wide range of enzymes and protein complexes. Many of these mechanisms are identical in males and females, but some are not. Female mammals, for example, carry two X chromosomes, whereas males have one X and one Y chromosome. Since the two X chromosomes in females contain essentially the same set of genes, one of them undergoes silencing to prevent the overproduction of certain proteins. This process, which is called X-inactivation, occurs during different stages of development and it must be tightly controlled.
An enzyme called MOF was originally found in flies in two distinct complexes—the male-specific lethal (MSL) complex, which forms only in males, and the non-specific lethal (NSL) complex, which is ubiquitous in both males and females. These complexes are evolutionary conserved and are also found in mammals. While mammalian MOF is reasonably well understood, the MSL and NSL complexes are not, so Chelmicki, Dündar et al. have used various sequencing techniques, in combination with biochemical experiments, to investigate their roles in embryonic stem cells and neuronal progenitor cells in mice.
These experiments show that MSL and NSL complexes engage in the regulation of thousands of genes. Although the two complexes often show different gene preferences, they often regulate the same cellular processes. The MSL/NSL-dependent regulation of X chromosome inactivation is a prime example of this phenomenon.
The MSL complex reduces the production of an RNA molecule called Xist, which is responsible for the inactivation of one of the two X chromosomes in females. The NSL complex, meanwhile, ensures the production of multiple proteins that are crucial for the development of embryonic stem cells, and are also involved in the repression of X inactivation.
This analysis sheds light on how different complexes can cooperate and complement each other in order to reach the same goal in the cell. The knowledge gained from this study will pave the way towards better understanding of complex processes such as embryonic development, organogenesis and the pathogenesis of disorders like cancer.
DOI: http://dx.doi.org/10.7554/eLife.02024.002
doi:10.7554/eLife.02024
PMCID: PMC4059889  PMID: 24842875
D. melanogaster; epigenetics; chromatin; transcription; acetylation; X inactivation; mouse
4.  Comparative analysis of Cas6b processing and CRISPR RNA stability 
RNA Biology  2013;10(5):700-707.
The prokaryotic antiviral defense systems CRISPR (clustered regularly interspaced short palindromic repeats)/Cas (CRISPR-associated) employs short crRNAs (CRISPR RNAs) to target invading viral nucleic acids. A short spacer sequence of these crRNAs can be derived from a viral genome and recognizes a reoccurring attack of a virus via base complementarity. We analyzed the effect of spacer sequences on the maturation of crRNAs of the subtype I-B Methanococcus maripaludis C5 CRISPR cluster. The responsible endonuclease, termed Cas6b, bound non-hydrolyzable repeat RNA as a dimer and mature crRNA as a monomer. Comparative analysis of Cas6b processing of individual spacer-repeat-spacer RNA substrates and crRNA stability revealed the potential influence of spacer sequence and length on these parameters. Correlation of these observations with the variable abundance of crRNAs visualized by deep-sequencing analyses is discussed. Finally, insertion of spacer and repeat sequences with archaeal poly-T termination signals is suggested to be prevented in archaeal CRISPR/Cas systems.
doi:10.4161/rna.23715
PMCID: PMC3737328  PMID: 23392318
CRISPR; Cas6; endonuclease; crRNA; in-line probing; RNA binding; transcription termination
5.  Two CRISPR-Cas systems inMethanosarcina mazeistrain Gö1 display common processing features despite belonging to different types I and III 
RNA Biology  2013;10(5):779-791.
The clustered regularly interspaced short palindromic repeats (CRISPR) system represents a highly adaptive and heritable defense system against foreign nucleic acids in bacteria and archaea. We analyzed the two CRISPR-Cas systems in Methanosarcina mazei strain Gö1. Although belonging to different subtypes (I-B and III-B), the leaders and repeats of both loci are nearly identical. Also, despite many point mutations in each array, a common hairpin motif was identified in the repeats by a bioinformatics analysis and in vitro structural probing. The expression and maturation of CRISPR-derived RNAs (crRNAs) were studied in vitro and in vivo. Both respective potential Cas6b-type endonucleases were purified and their activity tested in vitro. Each protein showed significant activity and could cleave both repeats at the same processing site. Cas6b of subtype III-B, however, was significantly more efficient in its cleavage activity compared with Cas6b of subtype I-B. Northern blot and differential RNAseq analyses were performed to investigate in vivo transcription and maturation of crRNAs, revealing generally very low expression of both systems, whereas significant induction at high NaCl concentrations was observed. crRNAs derived proximal to the leader were generally more abundant than distal ones and in vivo processing sites were clarified for both loci, confirming the previously well-established 8 nt 5′ repeat tags. The 3′-ends were more diverse, but generally ended in a prefix of the following repeat sequence (3′-tag). The analysis further revealed a 5′-hydroxy and 3′-phosphate termini architecture of small crRNAs specific for cleavage products of Cas6 endonucleases from type I-E and I-F and type III-B.
doi:10.4161/rna.23928
PMCID: PMC3737336  PMID: 23619576
methanoarchaea; CRISPR-Cas system; immunity of prokaryotes; regulatory RNA; phages; Methanosarcina mazei
6.  Cluster based prediction of PDZ-peptide interactions 
BMC Genomics  2014;15(Suppl 1):S5.
Background
PDZ domains are one of the most promiscuous protein recognition modules that bind with short linear peptides and play an important role in cellular signaling. Recently, few high-throughput techniques (e.g. protein microarray screen, phage display) have been applied to determine in-vitro binding specificity of PDZ domains. Currently, many computational methods are available to predict PDZ-peptide interactions but they often provide domain specific models and/or have a limited domain coverage.
Results
Here, we composed the largest set of PDZ domains derived from human, mouse, fly and worm proteomes and defined binding models for PDZ domain families to improve the domain coverage and prediction specificity. For that purpose, we first identified a novel set of 138 PDZ families, comprising of 548 PDZ domains from aforementioned organisms, based on efficient clustering according to their sequence identity. For 43 PDZ families, covering 226 PDZ domains with available interaction data, we built specialized models using a support vector machine approach. The advantage of family-wise models is that they can also be used to determine the binding specificity of a newly characterized PDZ domain with sufficient sequence identity to the known families. Since most current experimental approaches provide only positive data, we have to cope with the class imbalance problem. Thus, to enrich the negative class, we introduced a powerful semi-supervised technique to generate high confidence non-interaction data. We report competitive predictive performance with respect to state-of-the-art approaches.
Conclusions
Our approach has several contributions. First, we show that domain coverage can be increased by applying accurate clustering technique. Second, we developed an approach based on a semi-supervised strategy to get high confidence negative data. Third, we allowed high order correlations between the amino acid positions in the binding peptides. Fourth, our method is general enough and will easily be applicable to other peptide recognition modules such as SH2 domains and finally, we performed a genome-wide prediction for 101 human and 102 mouse PDZ domains and uncovered novel interactions with biological relevance. We make all the predictive models and genome-wide predictions freely available to the scientific community.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-S1-S5) contains supplementary material, which is available to authorized users.
doi:10.1186/1471-2164-15-S1-S5
PMCID: PMC4046824  PMID: 24564547
PDZ domain-peptide interactions; protein recognition modules; protein domain clustering; semi-supervised learning; support vector machines
7.  A Complex of Cas Proteins 5, 6, and 7 Is Required for the Biogenesis and Stability of Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)-derived RNAs (crRNAs) in Haloferax volcanii* 
The Journal of Biological Chemistry  2014;289(10):7164-7177.
Background: The Cas6 protein is required for generating crRNAs in CRISPR-Cas I and III systems.
Results: The Cas6 protein is necessary for crRNA production but not sufficient for crRNA maintenance in Haloferax.
Conclusion: A Cascade-like complex is required in the type I-B system for a stable crRNA population.
Significance: The CRISPR-Cas system I-B has a similar Cascade complex like types I-A and I-E.
The clustered regularly interspaced short palindromic repeats/CRISPR-associated (CRISPR-Cas) system is a prokaryotic defense mechanism against foreign genetic elements. A plethora of CRISPR-Cas versions exist, with more than 40 different Cas protein families and several different molecular approaches to fight the invading DNA. One of the key players in the system is the CRISPR-derived RNA (crRNA), which directs the invader-degrading Cas protein complex to the invader. The CRISPR-Cas types I and III use the Cas6 protein to generate mature crRNAs. Here, we show that the Cas6 protein is necessary for crRNA production but that additional Cas proteins that form a CRISPR-associated complex for antiviral defense (Cascade)-like complex are needed for crRNA stability in the CRISPR-Cas type I-B system in Haloferax volcanii in vivo. Deletion of the cas6 gene results in the loss of mature crRNAs and interference. However, cells that have the complete cas gene cluster (cas1–8b) removed and are transformed with the cas6 gene are not able to produce and stably maintain mature crRNAs. crRNA production and stability is rescued only if cas5, -6, and -7 are present. Mutational analysis of the cas6 gene reveals three amino acids (His-41, Gly-256, and Gly-258) that are essential for pre-crRNA cleavage, whereas the mutation of two amino acids (Ser-115 and Ser-224) leads to an increase of crRNA amounts. This is the first systematic in vivo analysis of Cas6 protein variants. In addition, we show that the H. volcanii I-B system contains a Cascade-like complex with a Cas7, Cas5, and Cas6 core that protects the crRNA.
doi:10.1074/jbc.M113.508184
PMCID: PMC3945376  PMID: 24459147
Archaea; Microbiology; Molecular Biology; Molecular Genetics; Protein Complexes; CRISPR/Cas; Cas6; Haloferax volcanii; crRNA; Type I-B
8.  GraphProt: modeling binding preferences of RNA-binding proteins 
Genome Biology  2014;15(1):R17.
We present GraphProt, a computational framework for learning sequence- and structure-binding preferences of RNA-binding proteins (RBPs) from high-throughput experimental data. We benchmark GraphProt, demonstrating that the modeled binding preferences conform to the literature, and showcase the biological relevance and two applications of GraphProt models. First, estimated binding affinities correlate with experimental measurements. Second, predicted Ago2 targets display higher levels of expression upon Ago2 knockdown, whereas control targets do not. Computational binding models, such as those provided by GraphProt, are essential for predicting RBP binding sites and affinities in all tissues. GraphProt is freely available at http://www.bioinf.uni-freiburg.de/Software/GraphProt.
doi:10.1186/gb-2014-15-1-r17
PMCID: PMC4053806  PMID: 24451197
9.  Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources 
Motivation
The identification of protein and gene names (PGNs) from the scientific literature requires semantic resources: Terminological and lexical resources deliver the term candidates into PGN tagging solutions and the gold standard corpora (GSC) train them to identify term parameters and contextual features. Ideally all three resources, i.e. corpora, lexica and taggers, cover the same domain knowledge, and thus support identification of the same types of PGNs and cover all of them. Unfortunately, none of the three serves as a predominant standard and for this reason it is worth exploring, how these three resources comply with each other. We systematically compare different PGN taggers against publicly available corpora and analyze the impact of the included lexical resource in their performance. In particular, we determine the performance gains through false positive filtering, which contributes to the disambiguation of identified PGNs.
Results
In general, machine learning approaches (ML-Tag) for PGN tagging show higher F1-measure performance against the BioCreative-II and Jnlpba GSCs (exact matching), whereas the lexicon based approaches (LexTag) in combination with disambiguation methods show better results on FsuPrge and PennBio. The ML-Tag solutions balance precision and recall, whereas the LexTag solutions have different precision and recall profiles at the same F1-measure across all corpora. Higher recall is achieved with larger lexical resources, which also introduce more noise (false positive results). The ML-Tag solutions certainly perform best, if the test corpus is from the same GSC as the training corpus. As expected, the false negative errors characterize the test corpora and – on the other hand – the profiles of the false positive mistakes characterize the tagging solutions. Lex-Tag solutions that are based on a large terminological resource in combination with false positive filtering produce better results, which, in addition, provide concept identifiers from a knowledge source in contrast to ML-Tag solutions.
Conclusion
The standard ML-Tag solutions achieve high performance, but not across all corpora, and thus should be trained using several different corpora to reduce possible biases. The LexTag solutions have different profiles for their precision and recall performance, but with similar F1-measure. This result is surprising and suggests that they cover a portion of the most common naming standards, but cope differently with the term variability across the corpora. The false positive filtering applied to LexTag solutions does improve the results by increasing their precision without compromising significantly their recall. The harmonisation of the annotation schemes in combination with standardized lexical resources in the tagging solutions will enable their comparability and will pave the way for a shared standard.
doi:10.1186/2041-1480-4-28
PMCID: PMC4021975  PMID: 24112383
10.  Evaluation and Cross-Comparison of Lexical Entities of Biological Interest (LexEBI) 
PLoS ONE  2013;8(10):e75185.
Motivation
Biomedical entities, their identifiers and names, are essential in the representation of biomedical facts and knowledge. In the same way, the complete set of biomedical and chemical terms, i.e. the biomedical “term space” (the “Lexeome”), forms a key resource to achieve the full integration of the scientific literature with biomedical data resources: any identified named entity can immediately be normalized to the correct database entry. This goal does not only require that we are aware of all existing terms, but would also profit from knowing all their senses and their semantic interpretation (ambiguities, nestedness).
Result
This study compiles a resource for lexical terms of biomedical interest in a standard format (called “LexEBI”), determines the overall number of terms, their reuse in different resources and the nestedness of terms. LexEBI comprises references for protein and gene entries and their term variants and chemical entities amongst other terms. In addition, disease terms have been identified from Medline and PubmedCentral and added to LexEBI. Our analysis demonstrates that the baseforms of terms from the different semantic types show only little polysemous use. Nonetheless, the term variants of protein and gene names (PGNs) frequently contain species mentions, which should have been avoided according to protein annotation guidelines. Furthermore, the protein and gene entities as well as the chemical entities, both do comprise enzymes leading to hierarchical polysemy, and a large portion of PGNs make reference to a chemical entity. Altogether, according to our analysis based on the Medline distribution, 401,869 unique PGNs in the documents contain a reference to 25,022 chemical entities, 3,125 disease terms or 1,576 species mentions.
Conclusion
LexEBI delivers the complete biomedical and chemical Lexeome in a standardized representation (http://www.ebi.ac.uk/Rebholz-srv/LexEBI/). The resource provides the disease terms as open source content, and fully interlinks terms across resources.
doi:10.1371/journal.pone.0075185
PMCID: PMC3790750  PMID: 24124474
11.  CRISPRmap: an automated classification of repeat conservation in prokaryotic adaptive immune systems 
Nucleic Acids Research  2013;41(17):8034-8044.
Central to Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR)-Cas systems are repeated RNA sequences that serve as Cas-protein–binding templates. Classification is based on the architectural composition of associated Cas proteins, considering repeat evolution is essential to complete the picture. We compiled the largest data set of CRISPRs to date, performed comprehensive, independent clustering analyses and identified a novel set of 40 conserved sequence families and 33 potential structure motifs for Cas-endoribonucleases with some distinct conservation patterns. Evolutionary relationships are presented as a hierarchical map of sequence and structure similarities for both a quick and detailed insight into the diversity of CRISPR-Cas systems. In a comparison with Cas-subtypes, I-C, I-E, I-F and type II were strongly coupled and the remaining type I and type III subtypes were loosely coupled to repeat and Cas1 evolution, respectively. Subtypes with a strong link to CRISPR evolution were almost exclusive to bacteria; nevertheless, we identified rare examples of potential horizontal transfer of I-C and I-E systems into archaeal organisms. Our easy-to-use web server provides an automated assignment of newly sequenced CRISPRs to our classification system and enables more informed choices on future hypotheses in CRISPR-Cas research: http://rna.informatik.uni-freiburg.de/CRISPRmap.
doi:10.1093/nar/gkt606
PMCID: PMC3783184  PMID: 23863837
12.  A graph kernel approach for alignment-free domain–peptide interaction prediction with an application to human SH3 domains 
Bioinformatics  2013;29(13):i335-i343.
Motivation: State-of-the-art experimental data for determining binding specificities of peptide recognition modules (PRMs) is obtained by high-throughput approaches like peptide arrays. Most prediction tools applicable to this kind of data are based on an initial multiple alignment of the peptide ligands. Building an initial alignment can be error-prone, especially in the case of the proline-rich peptides bound by the SH3 domains.
Results: Here, we present a machine-learning approach based on an efficient graph-kernel technique to predict the specificity of a large set of 70 human SH3 domains, which are an important class of PRMs. The graph-kernel strategy allows us to (i) integrate several types of physico-chemical information for each amino acid, (ii) consider high-order correlations between these features and (iii) eliminate the need for an initial peptide alignment. We build specialized models for each human SH3 domain and achieve competitive predictive performance of 0.73 area under precision-recall curve, compared with 0.27 area under precision-recall curve for state-of-the-art methods based on position weight matrices.
We show that better models can be obtained when we use information on the noninteracting peptides (negative examples), which is currently not used by the state-of-the art approaches based on position weight matrices. To this end, we analyze two strategies to identify subsets of high confidence negative data.
The techniques introduced here are more general and hence can also be used for any other protein domains, which interact with short peptides (i.e. other PRMs).
Availability: The program with the predictive models can be found at http://www.bioinf.uni-freiburg.de/Software/SH3PepInt/SH3PepInt.tar.gz. We also provide a genome-wide prediction for all 70 human SH3 domains, which can be found under http://www.bioinf.uni-freiburg.de/Software/SH3PepInt/Genome-Wide-Predictions.tar.gz.
Contact: backofen@informatik.uni-freiburg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt220
PMCID: PMC3694653  PMID: 23813002
13.  Semi-Supervised Prediction of SH2-Peptide Interactions from Imbalanced High-Throughput Data 
PLoS ONE  2013;8(5):e62732.
Src homology 2 (SH2) domains are the largest family of the peptide-recognition modules (PRMs) that bind to phosphotyrosine containing peptides. Knowledge about binding partners of SH2-domains is key for a deeper understanding of different cellular processes. Given the high binding specificity of SH2, in-silico ligand peptide prediction is of great interest. Currently however, only a few approaches have been published for the prediction of SH2-peptide interactions. Their main shortcomings range from limited coverage, to restrictive modeling assumptions (they are mainly based on position specific scoring matrices and do not take into consideration complex amino acids inter-dependencies) and high computational complexity. We propose a simple yet effective machine learning approach for a large set of known human SH2 domains. We used comprehensive data from micro-array and peptide-array experiments on 51 human SH2 domains. In order to deal with the high data imbalance problem and the high signal-to-noise ration, we casted the problem in a semi-supervised setting. We report competitive predictive performance w.r.t. state-of-the-art. Specifically we obtain 0.83 AUC ROC and 0.93 AUC PR in comparison to 0.71 AUC ROC and 0.87 AUC PR previously achieved by the position specific scoring matrices (PSSMs) based SMALI approach. Our work provides three main contributions. First, we showed that better models can be obtained when the information on the non-interacting peptides (negative examples) is also used. Second, we improve performance when considering high order correlations between the ligand positions employing regularization techniques to effectively avoid overfitting issues. Third, we developed an approach to tackle the data imbalance problem using a semi-supervised strategy. Finally, we performed a genome-wide prediction of human SH2-peptide binding, uncovering several findings of biological relevance. We make our models and genome-wide predictions, for all the 51 SH2-domains, freely available to the scientific community under the following URLs: http://www.bioinf.uni-freiburg.de/Software/SH2PepInt/SH2PepInt.tar.gz and http://www.bioinf.uni-freiburg.de/Software/SH2PepInt/Genome-wide-predictions.tar.gz, respectively.
doi:10.1371/journal.pone.0062732
PMCID: PMC3656881  PMID: 23690949
14.  LocARNAscan: Incorporating thermodynamic stability in sequence and structure-based RNA homology search 
Background
The search for distant homologs has become an import issue in genome annotation. A particular difficulty is posed by divergent homologs that have lost recognizable sequence similarity. This same problem also arises in the recognition of novel members of large classes of RNAs such as snoRNAs or microRNAs that consist of families unrelated by common descent. Current homology search tools for structured RNAs are either based entirely on sequence similarity (such as blast or hmmer) or combine sequence and secondary structure. The most prominent example of the latter class of tools is Infernal. Alternatives are descriptor-based methods. In most practical applications published to-date, however, the information contained in covariance models or manually prescribed search patterns is dominated by sequence information. Here we ask two related questions: (1) Is secondary structure alone informative for homology search and the detection of novel members of RNA classes? (2) To what extent is the thermodynamic propensity of the target sequence to fold into the correct secondary structure helpful for this task?
Results
Sequence-structure alignment can be used as an alternative search strategy. In this scenario, the query consists of a base pairing probability matrix, which can be derived either from a single sequence or from a multiple alignment representing a set of known representatives. Sequence information can be optionally added to the query. The target sequence is pre-processed to obtain local base pairing probabilities. As a search engine we devised a semi-global scanning variant of LocARNA’s algorithm for sequence-structure alignment. The LocARNAscan tool is optimized for speed and low memory consumption. In benchmarking experiments on artificial data we observe that the inclusion of thermodynamic stability is helpful, albeit only in a regime of extremely low sequence information in the query. We observe, furthermore, that the sensitivity is bounded in particular by the limited accuracy of the predicted local structures of the target sequence.
Conclusions
Although we demonstrate that a purely structure-based homology search is feasible in principle, it is unlikely to outperform tools such as Infernal in most application scenarios, where a substantial amount of sequence information is typically available. The LocARNAscan approach will profit, however, from high throughput methods to determine RNA secondary structure. In transcriptome-wide applications, such methods will provide accurate structure annotations on the target side.
Availability
Source code of the free software LocARNAscan 1.0 and supplementary data are available at http://www.bioinf.uni-leipzig.de/Software/LocARNAscan.
doi:10.1186/1748-7188-8-14
PMCID: PMC3716875  PMID: 23601347
15.  Essential requirements for the detection and degradation of invaders by the Haloferax volcanii CRISPR/Cas system I-B 
RNA Biology  2013;10(5):865-874.
To fend off foreign genetic elements, prokaryotes have developed several defense systems. The most recently discovered defense system, CRISPR/Cas, is sequence-specific, adaptive and heritable. The two central components of this system are the Cas proteins and the CRISPR RNA. The latter consists of repeat sequences that are interspersed with spacer sequences. The CRISPR locus is transcribed into a precursor RNA that is subsequently processed into short crRNAs. CRISPR/Cas systems have been identified in bacteria and archaea, and data show that many variations of this system exist. We analyzed the requirements for a successful defense reaction in the halophilic archaeon Haloferax volcanii. Haloferax encodes a CRISPR/Cas system of the I-B subtype, about which very little is known. Analysis of the mature crRNAs revealed that they contain a spacer as their central element, which is preceded by an eight-nucleotide-long 5′ handle that originates from the upstream repeat. The repeat sequences have the potential to fold into a minimal stem loop. Sequencing of the crRNA population indicated that not all of the spacers that are encoded by the three CRISPR loci are present in the same abundance. By challenging Haloferax with an invader plasmid, we demonstrated that the interaction of the crRNA with the invader DNA requires a 10-nucleotide-long seed sequence. In addition, we found that not all of the crRNAs from the three CRISPR loci are effective at triggering the degradation of invader plasmids. The interference does not seem to be influenced by the copy number of the invader plasmid.
doi:10.4161/rna.24282
PMCID: PMC3737343  PMID: 23594992
archaea; Haloferax volcanii; CRISPR/Cas; crRNA; PAM; seed sequence
16.  CRISPR-Cas Systems in the Cyanobacterium Synechocystis sp. PCC6803 Exhibit Distinct Processing Pathways Involving at Least Two Cas6 and a Cmr2 Protein 
PLoS ONE  2013;8(2):e56470.
The CRISPR-Cas (Clustered Regularly Interspaced Short Palindrome Repeats – CRISPR associated proteins) system provides adaptive immunity in archaea and bacteria. A hallmark of CRISPR-Cas is the involvement of short crRNAs that guide associated proteins in the destruction of invading DNA or RNA. We present three fundamentally distinct processing pathways in the cyanobacterium Synechocystis sp. PCC6803 for a subtype I-D (CRISPR1), and two type III systems (CRISPR2 and CRISPR3), which are located together on the plasmid pSYSA. Using high-throughput transcriptome analyses and assays of transcript accumulation we found all CRISPR loci to be highly expressed, but the individual crRNAs had profoundly varying abundances despite single transcription start sites for each array. In a computational analysis, CRISPR3 spacers with stable secondary structures displayed a greater ratio of degradation products. These structures might interfere with the loading of the crRNAs into RNP complexes, explaining the varying abundancies. The maturation of CRISPR1 and CRISPR2 transcripts depends on at least two different Cas6 proteins. Mutation of gene sll7090, encoding a Cmr2 protein led to the disappearance of all CRISPR3-derived crRNAs, providing in vivo evidence for a function of Cmr2 in the maturation, regulation of expression, Cmr complex formation or stabilization of CRISPR3 transcripts. Finally, we optimized CRISPR repeat structure prediction and the results indicate that the spacer context can influence individual repeat structures.
doi:10.1371/journal.pone.0056470
PMCID: PMC3575380  PMID: 23441196
17.  Navigating the unexplored seascape of pre-miRNA candidates in single-genome approaches 
Bioinformatics  2012;28(23):3034-3041.
Motivation: The computational search for novel microRNA (miRNA) precursors often involves some sort of structural analysis with the aim of identifying which type of structures are prone to being recognized and processed by the cellular miRNA-maturation machinery. A natural way to tackle this problem is to perform clustering over the candidate structures along with known miRNA precursor structures. Mixed clusters allow then the identification of candidates that are similar to known precursors. Given the large number of pre-miRNA candidates that can be identified in single-genome approaches, even after applying several filters for precursor robustness and stability, a conventional structural clustering approach is unfeasible.
Results: We propose a method to represent candidate structures in a feature space, which summarizes key sequence/structure characteristics of each candidate. We demonstrate that proximity in this feature space is related to sequence/structure similarity, and we select candidates that have a high similarity to known precursors. Additional filtering steps are then applied to further reduce the number of candidates to those with greater transcriptional potential. Our method is compared with another single-genome method (TripletSVM) in two datasets, showing better performance in one and comparable performance in the other, for larger training sets. Additionally, we show that our approach allows for a better interpretation of the results.
Availability and Implementation: The MinDist method is implemented using Perl scripts and is freely available at http://www.cravela.org/?mindist=1.
Contact: backofen@informatik.uni-freiburg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts574
PMCID: PMC3516144  PMID: 23052038
18.  An archaeal sRNA targeting cis- and trans-encoded mRNAs via two distinct domains 
Nucleic Acids Research  2012;40(21):10964-10979.
We report on the characterization and target analysis of the small (s)RNA162 in the methanoarchaeon Methanosarcina mazei. Using a combination of genetic approaches, transcriptome analysis and computational predictions, the bicistronic MM2441-MM2440 mRNA encoding the transcription factor MM2441 and a protein of unknown function was identified as a potential target of this sRNA, which due to processing accumulates as three stabile 5′ fragments in late exponential growth. Mobility shift assays using various mutants verified that the non-structured single-stranded linker region of sRNA162 (SLR) base-pairs with the MM2440-MM2441 mRNA internally, thereby masking the predicted ribosome binding site of MM2441. This most likely leads to translational repression of the second cistron resulting in dis-coordinated operon expression. Analysis of mutant RNAs in vivo confirmed that the SLR of sRNA162 is crucial for target interactions. Furthermore, our results indicate that sRNA162-controlled MM2441 is involved in regulating the metabolic switch between the carbon sources methanol and methylamine. Moreover, biochemical studies demonstrated that the 5′ end of sRNA162 targets the 5′-untranslated region of the cis-encoded MM2442 mRNA. Overall, this first study of archaeal sRNA/mRNA-target interactions unraveled that sRNA162 acts as an antisense (as)RNA on cis- and trans-encoded mRNAs via two distinct domains, indicating that cis-encoded asRNAs can have larger target regulons than previously anticipated.
doi:10.1093/nar/gks847
PMCID: PMC3510493  PMID: 22965121
19.  Producing High-Accuracy Lattice Models from Protein Atomic Coordinates Including Side Chains 
Advances in Bioinformatics  2012;2012:148045.
Lattice models are a common abstraction used in the study of protein structure, folding, and refinement. They are advantageous because the discretisation of space can make extensive protein evaluations computationally feasible. Various approaches to the protein chain lattice fitting problem have been suggested but only a single backbone-only tool is available currently. We introduce LatFit, a new tool to produce high-accuracy lattice protein models. It generates both backbone-only and backbone-side-chain models in any user defined lattice. LatFit implements a new distance RMSD-optimisation fitting procedure in addition to the known coordinate RMSD method. We tested LatFit's accuracy and speed using a large nonredundant set of high resolution proteins (SCOP database) on three commonly used lattices: 3D cubic, face-centred cubic, and knight's walk. Fitting speed compared favourably to other methods and both backbone-only and backbone-side-chain models show low deviation from the original data (~1.5 Å RMSD in the FCC lattice). To our knowledge this represents the first comprehensive study of lattice quality for on-lattice protein models including side chains while LatFit is the only available tool for such models.
doi:10.1155/2012/148045
PMCID: PMC3426164  PMID: 22934109
20.  Characterization of CRISPR RNA processing in Clostridium thermocellum and Methanococcus maripaludis  
Nucleic Acids Research  2012;40(19):9887-9896.
The CRISPR arrays found in many bacteria and most archaea are transcribed into a long precursor RNA that is processed into small clustered regularly interspaced short palindromic repeats (CRISPR) RNAs (crRNAs). These RNA molecules can contain fragments of viral genomes and mediate, together with a set of CRISPR-associated (Cas) proteins, the prokaryotic immunity against viral attacks. CRISPR/Cas systems are diverse and the Cas6 enzymes that process crRNAs vary between different subtypes. We analysed CRISPR/Cas subtype I-B and present the identification of novel Cas6 enzymes from the bacterial and archaeal model organisms Clostridium thermocellum and Methanococcus maripaludis C5. Methanococcus maripaludis Cas6b in vitro activity and specificity was determined. Two complementary catalytic histidine residues were identified. RNA-Seq analyses revealed in vivo crRNA processing sites, crRNA abundance and orientation of CRISPR transcription within these two organisms. Individual spacer sequences were identified with strong effects on transcription and processing patterns of a CRISPR cluster. These effects will need to be considered for the application of CRISPR clusters that are designed to produce synthetic crRNAs.
doi:10.1093/nar/gks737
PMCID: PMC3479195  PMID: 22879377
21.  Accessibility and conservation 
RNA Biology  2012;9(7):954-965.
Bacterial small RNAs (sRNAs) are a class of structural RNAs that often regulate mRNA targets via post-transcriptional base pair interactions. We determined features that discriminate functional from non-functional interactions and assessed the influence of these features on genome-wide target predictions. For this purpose, we compiled a set of 71 experimentally verified sRNA–target pairs from Escherichia coli and Salmonella enterica. Furthermore, we collected full-length 5′ untranslated regions by using genome-wide experimentally verified transcription start sites.
 
Only interaction sites in sRNAs, but not in targets, show significant sequence conservation. In addition to this observation, we found that the base pairing between sRNAs and their targets is not conserved in general across more distantly related species. A closer inspection of RybB and RyhB sRNAs and their targets revealed that the base pairing complementarity is only conserved in a small subset of the targets. In contrast to conservation, accessibility of functional interaction sites is significantly higher in both sRNAs and targets in comparison to non-functional sites. Based on the above observations, we successfully used the following constraints to improve the specificity of genome-wide target predictions: the region of interaction initiation must be located in (1) highly accessible regions in both interaction partners or (2) unstructured conserved sRNA regions derived from reliability profiles of multiple sRNA alignments.
Aligned sequences of homologous sRNAs, functional and non-functional targets, and a sup document with sup tables, figures and references are available at www.bioinf.uni-freiburg.de/Supplements/srna-interact-feat/.
doi:10.4161/rna.20294
PMCID: PMC3495738  PMID: 22767260
IntaRNA; RNA–RNA interaction; accessibility; bacteria; conservation; sRNA; target prediction
22.  GraphClust: alignment-free structural clustering of local RNA secondary structures 
Bioinformatics  2012;28(12):i224-i232.
Motivation: Clustering according to sequence–structure similarity has now become a generally accepted scheme for ncRNA annotation. Its application to complete genomic sequences as well as whole transcriptomes is therefore desirable but hindered by extremely high computational costs.
Results: We present a novel linear-time, alignment-free method for comparing and clustering RNAs according to sequence and structure. The approach scales to datasets of hundreds of thousands of sequences. The quality of the retrieved clusters has been benchmarked against known ncRNA datasets and is comparable to state-of-the-art sequence–structure methods although achieving speedups of several orders of magnitude. A selection of applications aiming at the detection of novel structural ncRNAs are presented. Exemplarily, we predicted local structural elements specific to lincRNAs likely functionally associating involved transcripts to vital processes of the human nervous system. In total, we predicted 349 local structural RNA elements.
Availability: The GraphClust pipeline is available on request.
Contact: backofen@informatik.uni-freiburg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts224
PMCID: PMC3371856  PMID: 22689765
23.  CARNA—alignment of RNA structure ensembles 
Nucleic Acids Research  2012;40(Web Server issue):W49-W53.
Due to recent algorithmic progress, tools for the gold standard of comparative RNA analysis, namely Sankoff-style simultaneous alignment and folding, are now readily applicable. Such approaches, however, compare RNAs with respect to a simultaneously predicted, single, nested consensus structure. To make multiple alignment of RNAs available in cases, where this limitation of the standard approach is critical, we introduce a web server that provides a complete and convenient interface to the RNA structure alignment tool ‘CARNA’. This tool uniquely supports RNAs with multiple conserved structures per RNA and aligns pseudoknots intrinsically; these features are highly desirable for aligning riboswitches, RNAs with conserved folding pathways, or pseudoknots. We represent structural input and output information as base pair probability dot plots; this provides large flexibility in the input, ranging from fixed structures to structure ensembles, and enables immediate visual analysis of the results. In contrast to conventional Sankoff-style approaches, ‘CARNA’ optimizes all structural similarities in the input simultaneously, for example across an entire RNA structure ensemble. Even compared with already costly Sankoff-style alignment, ‘CARNA’ solves an intrinsically much harder problem by applying advanced, constraint-based, algorithmic techniques. Although ‘CARNA’ is specialized to the alignment of RNAs with several conserved structures, its performance on RNAs in general is on par with state-of-the-art general-purpose RNA alignment tools, as we show in a Bralibase 2.1 benchmark. The web server is freely available at http://rna.informatik.uni-freiburg.de/CARNA.
doi:10.1093/nar/gks491
PMCID: PMC3394245  PMID: 22689637
24.  Global or local? Predicting secondary structure and accessibility in mRNAs 
Nucleic Acids Research  2012;40(12):5215-5226.
Determining the structural properties of mRNA is key to understanding vital post-transcriptional processes. As experimental data on mRNA structure are scarce, accurate structure prediction is required to characterize RNA regulatory mechanisms. Although various structure prediction approaches are available, it is often unclear which to choose and how to set their parameters. Furthermore, no standard measure to compare predictions of local structure exists. We assessed the performance of different methods using two types of data: transcriptome-wide enzymatic probing information and a large, curated set of cis-regulatory elements. To compare the approaches, we introduced structure accuracy, a measure that is applicable to both global and local methods. Our results showed that local folding was more accurate than the classic global approach. We investigated how the locality parameters, maximum base pair span and window size, influenced the prediction performance. A span of 150 provided a reasonable balance between maximizing the number of accurately predicted base pairs, while minimizing effects of incorrect long-range predictions. We characterized the error at artificial sequence ends, which we reduced by setting the window size sufficiently greater than the maximum span. Our method, LocalFold, diminished all border effects and produced the most robust performance.
doi:10.1093/nar/gks181
PMCID: PMC3384308  PMID: 22373926
25.  Structator: fast index-based search for RNA sequence-structure patterns 
BMC Bioinformatics  2011;12:214.
Background
The secondary structure of RNA molecules is intimately related to their function and often more conserved than the sequence. Hence, the important task of searching databases for RNAs requires to match sequence-structure patterns. Unfortunately, current tools for this task have, in the best case, a running time that is only linear in the size of sequence databases. Furthermore, established index data structures for fast sequence matching, like suffix trees or arrays, cannot benefit from the complementarity constraints introduced by the secondary structure of RNAs.
Results
We present a novel method and readily applicable software for time efficient matching of RNA sequence-structure patterns in sequence databases. Our approach is based on affix arrays, a recently introduced index data structure, preprocessed from the target database. Affix arrays support bidirectional pattern search, which is required for efficiently handling the structural constraints of the pattern. Structural patterns like stem-loops can be matched inside out, such that the loop region is matched first and then the pairing bases on the boundaries are matched consecutively. This allows to exploit base pairing information for search space reduction and leads to an expected running time that is sublinear in the size of the sequence database. The incorporation of a new chaining approach in the search of RNA sequence-structure patterns enables the description of molecules folding into complex secondary structures with multiple ordered patterns. The chaining approach removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our method runs up to two orders of magnitude faster than previous methods.
Conclusions
The presented method's sublinear expected running time makes it well suited for RNA sequence-structure pattern matching in large sequence databases. RNA molecules containing several stem-loop substructures can be described by multiple sequence-structure patterns and their matches are efficiently handled by a novel chaining method. Beyond our algorithmic contributions, we provide with Structator a complete and robust open-source software solution for index-based search of RNA sequence-structure patterns. The Structator software is available at http://www.zbh.uni-hamburg.de/Structator.
doi:10.1186/1471-2105-12-214
PMCID: PMC3154205  PMID: 21619640

Results 1-25 (58)