PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (64)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
1.  ExpaRNA-P: simultaneous exact pattern matching and folding of RNAs 
BMC Bioinformatics  2014;15(1):404.
Background
Identifying sequence-structure motifs common to two RNAs can speed up the comparison of structural RNAs substantially. The core algorithm of the existent approach ExpaRNA solves this problem for a priori known input structures. However, such structures are rarely known; moreover, predicting them computationally is no rescue, since single sequence structure prediction is highly unreliable.
Results
The novel algorithm ExpaRNA-P computes exactly matching sequence-structure motifs in entire Boltzmann-distributed structure ensembles of two RNAs; thereby we match and fold RNAs simultaneously, analogous to the well-known “simultaneous alignment and folding” of RNAs. While this implies much higher flexibility compared to ExpaRNA, ExpaRNA-P has the same very low complexity (quadratic in time and space), which is enabled by its novel structure ensemble-based sparsification. Furthermore, we devise a generalized chaining algorithm to compute compatible subsets of ExpaRNA-P’s sequence-structure motifs. Resulting in the very fast RNA alignment approach ExpLoc-P, we utilize the best chain as anchor constraints for the sequence-structure alignment tool LocARNA. ExpLoc-P is benchmarked in several variants and versus state-of-the-art approaches. In particular, we formally introduce and evaluate strict and relaxed variants of the problem; the latter makes the approach sensitive to compensatory mutations. Across a benchmark set of typical non-coding RNAs, ExpLoc-P has similar accuracy to LocARNA but is four times faster (in both variants), while it achieves a speed-up over 30-fold for the longest benchmark sequences (≈400nt). Finally, different ExpLoc-P variants enable tailoring of the method to specific application scenarios. ExpaRNA-P and ExpLoc-P are distributed as part of the LocARNA package. The source code is freely available at http://www.bioinf.uni-freiburg.de/Software/ExpaRNA-P.
Conclusions
ExpaRNA-P’s novel ensemble-based sparsification reduces its complexity to quadratic time and space. Thereby, ExpaRNA-P significantly speeds up sequence-structure alignment while maintaining the alignment quality. Different ExpaRNA-P variants support a wide range of applications.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0404-0) contains supplementary material, which is available to authorized users.
doi:10.1186/s12859-014-0404-0
PMCID: PMC4302096  PMID: 25551362
RNA bioinformatics; Structure-based comparison of RNA; Sparsification
2.  Atom mapping with constraint programming 
Chemical reactions are rearrangements of chemical bonds. Each atom in an educt molecule thus appears again in a specific position of one of the reaction products. This bijection between educt and product atoms is not reported by chemical reaction databases, however, so that the “Atom Mapping Problem” of finding this bijection is left as an important computational task for many practical applications in computational chemistry and systems biology. Elementary chemical reactions feature a cyclic imaginary transition state (ITS) that imposes additional restrictions on the bijection between educt and product atoms that are not taken into account by previous approaches. We demonstrate that Constraint Programming is well-suited to solving the Atom Mapping Problem in this setting. The performance of our approach is evaluated for a manually curated subset of chemical reactions from the KEGG database featuring various ITS cycle layouts and reaction mechanisms.
Electronic supplementary material
The online version of this article (doi:10.1186/s13015-014-0023-3) contains supplementary material, which is available to authorized users.
doi:10.1186/s13015-014-0023-3
PMCID: PMC4256833  PMID: 25484913
Atom-atom mapping; Constraint programming; Chemical reaction; Imaginary transition state
3.  Dynamic DNA methylation orchestrates cardiomyocyte development, maturation and disease 
Nature Communications  2014;5:5288.
The heart is a highly specialized organ with essential function for the organism throughout life. The significance of DNA methylation in shaping the phenotype of the heart remains only partially known. Here we generate and analyse DNA methylomes from highly purified cardiomyocytes of neonatal, adult healthy and adult failing hearts. We identify large genomic regions that are differentially methylated during cardiomyocyte development and maturation. Demethylation of cardiomyocyte gene bodies correlates strongly with increased gene expression. Silencing of demethylated genes is characterized by the polycomb mark H3K27me3 or by DNA methylation. De novo methylation by DNA methyltransferases 3A/B causes repression of fetal cardiac genes, including essential components of the cardiac sarcomere. Failing cardiomyocytes partially resemble neonatal methylation patterns. This study establishes DNA methylation as a highly dynamic process during postnatal growth of cardiomyocytes and their adaptation to pathological stress in a process tightly linked to gene regulation and activity.
DNA methylation is essential for proper gene expression, development and genome stability. Here the authors present whole-genome DNA methylation analyses of purified mouse cardiomyocytes from newborn, adult and failing hearts and find highly dynamic patterns between the three phenotypes of cardiomyocytes.
doi:10.1038/ncomms6288
PMCID: PMC4220495  PMID: 25335909
4.  Graph-distance distribution of the Boltzmann ensemble of RNA secondary structures 
Background
Large RNA molecules are often composed of multiple functional domains whose spatial arrangement strongly influences their function. Pre-mRNA splicing, for instance, relies on the spatial proximity of the splice junctions that can be separated by very long introns. Similar effects appear in the processing of RNA virus genomes. Albeit a crude measure, the distribution of spatial distances in thermodynamic equilibrium harbors useful information on the shape of the molecule that in turn can give insights into the interplay of its functional domains.
Result
Spatial distance can be approximated by the graph-distance in RNA secondary structure. We show here that the equilibrium distribution of graph-distances between a fixed pair of nucleotides can be computed in polynomial time by means of dynamic programming. While a naïve implementation would yield recursions with a very high time complexity of O(n6D5) for sequence length n and D distinct distance values, it is possible to reduce this to O(n4) for practical applications in which predominantly small distances are of of interest. Further reductions, however, seem to be difficult. Therefore, we introduced sampling approaches that are much easier to implement. They are also theoretically favorable for several real-life applications, in particular since these primarily concern long-range interactions in very large RNA molecules.
Conclusions
The graph-distance distribution can be computed using a dynamic programming approach. Although a crude approximation of reality, our initial results indicate that the graph-distance can be related to the smFRET data. The additional file and the software of our paper are available from http://www.rna.uni-jena.de/RNAgraphdist.html.
doi:10.1186/1748-7188-9-19
PMCID: PMC4181469  PMID: 25285153
Graph-distance; Boltzmann distribution; Partition function; Pre-mRNA splicing; smFRET
5.  CRISPRstrand: predicting repeat orientations to determine the crRNA-encoding strand at CRISPR loci 
Bioinformatics  2014;30(17):i489-i496.
Motivation: The discovery of CRISPR-Cas systems almost 20 years ago rapidly changed our perception of the bacterial and archaeal immune systems. CRISPR loci consist of several repetitive DNA sequences called repeats, inter-spaced by stretches of variable length sequences called spacers. This CRISPR array is transcribed and processed into multiple mature RNA species (crRNAs). A single crRNA is integrated into an interference complex, together with CRISPR-associated (Cas) proteins, to bind and degrade invading nucleic acids. Although existing bioinformatics tools can recognize CRISPR loci by their characteristic repeat-spacer architecture, they generally output CRISPR arrays of ambiguous orientation and thus do not determine the strand from which crRNAs are processed. Knowledge of the correct orientation is crucial for many tasks, including the classification of CRISPR conservation, the detection of leader regions, the identification of target sites (protospacers) on invading genetic elements and the characterization of protospacer-adjacent motifs.
Results: We present a fast and accurate tool to determine the crRNA-encoding strand at CRISPR loci by predicting the correct orientation of repeats based on an advanced machine learning approach. Both the repeat sequence and mutation information were encoded and processed by an efficient graph kernel to learn higher-order correlations. The model was trained and tested on curated data comprising >4500 CRISPRs and yielded a remarkable performance of 0.95 AUC ROC (area under the curve of the receiver operator characteristic). In addition, we show that accurate orientation information greatly improved detection of conserved repeat sequence families and structure motifs. We integrated CRISPRstrand predictions into our CRISPRmap web server of CRISPR conservation and updated the latter to version 2.0.
Availability: CRISPRmap and CRISPRstrand are available at http://rna.informatik.uni-freiburg.de/CRISPRmap.
Contact: backofen@informatik.uni-freiburg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btu459
PMCID: PMC4147912  PMID: 25161238
6.  Tandem Stem Loops in roX RNAs Act Together to Mediate X Chromosome Dosage Compensation in Drosophila 
Molecular cell  2013;51(2):156-173.
Summary
Dosage compensation in Drosophila is an epigenetic phenomenon utilizing proteins and long noncoding RNAs (lncRNAs) for transcriptional upregulation of the male X chromosome. Here, by using UV crosslinking followed by deep sequencing, we show that two enzymes in the Male-Specific Lethal complex, MLE RNA helicase and MSL2 ubiquitin ligase, bind evolutionarily conserved domains containing tandem stem loops in roX1 and roX2 RNAs in vivo. These domains constitute the minimal RNA unit present in multiple copies in diverse arrangements for nucleation of the MSL complex. MLE binds to these domains with distinct ATP-independent and ATP-dependent behavior. Importantly, we show that different roX RNA domains have overlapping function, since only combinatorial mutations in the tandem stem loops result in severe loss of dosage compensation and consequently male-specific lethality. We propose that repetitive structural motifs in lncRNAs could provide plasticity during multiprotein complex assemblies to ensure efficient targeting in cis or in trans along chromosomes.
doi:10.1016/j.molcel.2013.07.001
PMCID: PMC3804161  PMID: 23870142
7.  BlockClust: efficient clustering and classification of non-coding RNAs from short read RNA-seq profiles 
Bioinformatics  2014;30(12):i274-i282.
Summary: Non-coding RNAs (ncRNAs) play a vital role in many cellular processes such as RNA splicing, translation, gene regulation. However the vast majority of ncRNAs still have no functional annotation. One prominent approach for putative function assignment is clustering of transcripts according to sequence and secondary structure. However sequence information is changed by post-transcriptional modifications, and secondary structure is only a proxy for the true 3D conformation of the RNA polymer. A different type of information that does not suffer from these issues and that can be used for the detection of RNA classes, is the pattern of processing and its traces in small RNA-seq reads data. Here we introduce BlockClust, an efficient approach to detect transcripts with similar processing patterns. We propose a novel way to encode expression profiles in compact discrete structures, which can then be processed using fast graph-kernel techniques. We perform both unsupervised clustering and develop family specific discriminative models; finally we show how the proposed approach is scalable, accurate and robust across different organisms, tissues and cell lines.
Availability: The whole BlockClust galaxy workflow including all tool dependencies is available at http://toolshed.g2.bx.psu.edu/view/rnateam/blockclust_workflow.
Contact: backofen@informatik.uni-freiburg.de; costa@informatik.uni-freiburg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btu270
PMCID: PMC4058930  PMID: 24931994
8.  MoDPepInt: an interactive web server for prediction of modular domain–peptide interactions 
Bioinformatics  2014;30(18):2668-2669.
Summary: MoDPepInt (Modular Domain Peptide Interaction) is a new easy-to-use web server for the prediction of binding partners for modular protein domains. Currently, we offer models for SH2, SH3 and PDZ domains via the tools SH2PepInt, SH3PepInt and PDZPepInt, respectively. More specifically, our server offers predictions for 51 SH2 human domains and 69 SH3 human domains via single domain models, and predictions for 226 PDZ domains across several species, via 43 multidomain models. All models are based on support vector machines with different kernel functions ranging from polynomial, to Gaussian, to advanced graph kernels. In this way, we model non-linear interactions between amino acid residues. Results were validated on manually curated datasets achieving competitive performance against various state-of-the-art approaches.
Availability and implementation: The MoDPepInt server is available under the URL http://modpepint.informatik.uni-freiburg.de/
Contact: backofen@informatik.uni-freiburg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btu350
PMCID: PMC4155253  PMID: 24872426
9.  Lineage-specific splicing of a brain-enriched alternative exon promotes glioblastoma progression 
The Journal of Clinical Investigation  2014;124(7):2861-2876.
Tissue-specific alternative splicing is critical for the emergence of tissue identity during development, yet the role of this process in malignant transformation is undefined. Tissue-specific splicing involves evolutionarily conserved, alternative exons that represent only a minority of the total alternative exons identified. Many of these conserved exons have functional features that influence signaling pathways to profound biological effect. Here, we determined that lineage-specific splicing of a brain-enriched cassette exon in the membrane-binding tumor suppressor annexin A7 (ANXA7) diminishes endosomal targeting of the EGFR oncoprotein, consequently enhancing EGFR signaling during brain tumor progression. ANXA7 exon splicing was mediated by the ribonucleoprotein PTBP1, which is normally repressed during neuronal development. PTBP1 was highly expressed in glioblastomas due to loss of a brain-enriched microRNA (miR-124) and to PTBP1 amplification. The alternative ANXA7 splicing trait was present in precursor cells, suggesting that glioblastoma cells inherit the trait from a potential tumor-initiating ancestor and that these cells exploit this trait through accumulation of mutations that enhance EGFR signaling. Our data illustrate that lineage-specific splicing of a tissue-regulated alternative exon in a constituent of an oncogenic pathway eliminates tumor suppressor functions and promotes glioblastoma progression. This paradigm may offer a general model as to how tissue-specific regulatory mechanisms can reprogram normal developmental processes into oncogenic ones.
doi:10.1172/JCI68836
PMCID: PMC4071411  PMID: 24865424
10.  MOF-associated complexes ensure stem cell identity and Xist repression 
eLife  2014;3:e02024.
Histone acetyl transferases (HATs) play distinct roles in many cellular processes and are frequently misregulated in cancers. Here, we study the regulatory potential of MYST1-(MOF)-containing MSL and NSL complexes in mouse embryonic stem cells (ESCs) and neuronal progenitors. We find that both complexes influence transcription by targeting promoters and TSS-distal enhancers. In contrast to flies, the MSL complex is not exclusively enriched on the X chromosome, yet it is crucial for mammalian X chromosome regulation as it specifically regulates Tsix, the major repressor of Xist lncRNA. MSL depletion leads to decreased Tsix expression, reduced REX1 recruitment, and consequently, enhanced accumulation of Xist and variable numbers of inactivated X chromosomes during early differentiation. The NSL complex provides additional, Tsix-independent repression of Xist by maintaining pluripotency. MSL and NSL complexes therefore act synergistically by using distinct pathways to ensure a fail-safe mechanism for the repression of X inactivation in ESCs.
DOI: http://dx.doi.org/10.7554/eLife.02024.001
eLife digest
Gene expression is controlled by a complicated network of mechanisms involving a wide range of enzymes and protein complexes. Many of these mechanisms are identical in males and females, but some are not. Female mammals, for example, carry two X chromosomes, whereas males have one X and one Y chromosome. Since the two X chromosomes in females contain essentially the same set of genes, one of them undergoes silencing to prevent the overproduction of certain proteins. This process, which is called X-inactivation, occurs during different stages of development and it must be tightly controlled.
An enzyme called MOF was originally found in flies in two distinct complexes—the male-specific lethal (MSL) complex, which forms only in males, and the non-specific lethal (NSL) complex, which is ubiquitous in both males and females. These complexes are evolutionary conserved and are also found in mammals. While mammalian MOF is reasonably well understood, the MSL and NSL complexes are not, so Chelmicki, Dündar et al. have used various sequencing techniques, in combination with biochemical experiments, to investigate their roles in embryonic stem cells and neuronal progenitor cells in mice.
These experiments show that MSL and NSL complexes engage in the regulation of thousands of genes. Although the two complexes often show different gene preferences, they often regulate the same cellular processes. The MSL/NSL-dependent regulation of X chromosome inactivation is a prime example of this phenomenon.
The MSL complex reduces the production of an RNA molecule called Xist, which is responsible for the inactivation of one of the two X chromosomes in females. The NSL complex, meanwhile, ensures the production of multiple proteins that are crucial for the development of embryonic stem cells, and are also involved in the repression of X inactivation.
This analysis sheds light on how different complexes can cooperate and complement each other in order to reach the same goal in the cell. The knowledge gained from this study will pave the way towards better understanding of complex processes such as embryonic development, organogenesis and the pathogenesis of disorders like cancer.
DOI: http://dx.doi.org/10.7554/eLife.02024.002
doi:10.7554/eLife.02024
PMCID: PMC4059889  PMID: 24842875
D. melanogaster; epigenetics; chromatin; transcription; acetylation; X inactivation; mouse
11.  CopraRNA and IntaRNA: predicting small RNA targets, networks and interaction domains 
Nucleic Acids Research  2014;42(Web Server issue):W119-W123.
CopraRNA (Comparative prediction algorithm for small RNA targets) is the most recent asset to the Freiburg RNA Tools webserver. It incorporates and extends the functionality of the existing tool IntaRNA (Interacting RNAs) in order to predict targets, interaction domains and consequently the regulatory networks of bacterial small RNA molecules. The CopraRNA prediction results are accompanied by extensive postprocessing methods such as functional enrichment analysis and visualization of interacting regions. Here, we introduce the functionality of the CopraRNA and IntaRNA webservers and give detailed explanations on their postprocessing functionalities. Both tools are freely accessible at http://rna.informatik.uni-freiburg.de.
doi:10.1093/nar/gku359
PMCID: PMC4086077  PMID: 24838564
12.  Comparative analysis of Cas6b processing and CRISPR RNA stability 
RNA Biology  2013;10(5):700-707.
The prokaryotic antiviral defense systems CRISPR (clustered regularly interspaced short palindromic repeats)/Cas (CRISPR-associated) employs short crRNAs (CRISPR RNAs) to target invading viral nucleic acids. A short spacer sequence of these crRNAs can be derived from a viral genome and recognizes a reoccurring attack of a virus via base complementarity. We analyzed the effect of spacer sequences on the maturation of crRNAs of the subtype I-B Methanococcus maripaludis C5 CRISPR cluster. The responsible endonuclease, termed Cas6b, bound non-hydrolyzable repeat RNA as a dimer and mature crRNA as a monomer. Comparative analysis of Cas6b processing of individual spacer-repeat-spacer RNA substrates and crRNA stability revealed the potential influence of spacer sequence and length on these parameters. Correlation of these observations with the variable abundance of crRNAs visualized by deep-sequencing analyses is discussed. Finally, insertion of spacer and repeat sequences with archaeal poly-T termination signals is suggested to be prevented in archaeal CRISPR/Cas systems.
doi:10.4161/rna.23715
PMCID: PMC3737328  PMID: 23392318
CRISPR; Cas6; endonuclease; crRNA; in-line probing; RNA binding; transcription termination
13.  Two CRISPR-Cas systems inMethanosarcina mazeistrain Gö1 display common processing features despite belonging to different types I and III 
RNA Biology  2013;10(5):779-791.
The clustered regularly interspaced short palindromic repeats (CRISPR) system represents a highly adaptive and heritable defense system against foreign nucleic acids in bacteria and archaea. We analyzed the two CRISPR-Cas systems in Methanosarcina mazei strain Gö1. Although belonging to different subtypes (I-B and III-B), the leaders and repeats of both loci are nearly identical. Also, despite many point mutations in each array, a common hairpin motif was identified in the repeats by a bioinformatics analysis and in vitro structural probing. The expression and maturation of CRISPR-derived RNAs (crRNAs) were studied in vitro and in vivo. Both respective potential Cas6b-type endonucleases were purified and their activity tested in vitro. Each protein showed significant activity and could cleave both repeats at the same processing site. Cas6b of subtype III-B, however, was significantly more efficient in its cleavage activity compared with Cas6b of subtype I-B. Northern blot and differential RNAseq analyses were performed to investigate in vivo transcription and maturation of crRNAs, revealing generally very low expression of both systems, whereas significant induction at high NaCl concentrations was observed. crRNAs derived proximal to the leader were generally more abundant than distal ones and in vivo processing sites were clarified for both loci, confirming the previously well-established 8 nt 5′ repeat tags. The 3′-ends were more diverse, but generally ended in a prefix of the following repeat sequence (3′-tag). The analysis further revealed a 5′-hydroxy and 3′-phosphate termini architecture of small crRNAs specific for cleavage products of Cas6 endonucleases from type I-E and I-F and type III-B.
doi:10.4161/rna.23928
PMCID: PMC3737336  PMID: 23619576
methanoarchaea; CRISPR-Cas system; immunity of prokaryotes; regulatory RNA; phages; Methanosarcina mazei
14.  Cluster based prediction of PDZ-peptide interactions 
BMC Genomics  2014;15(Suppl 1):S5.
Background
PDZ domains are one of the most promiscuous protein recognition modules that bind with short linear peptides and play an important role in cellular signaling. Recently, few high-throughput techniques (e.g. protein microarray screen, phage display) have been applied to determine in-vitro binding specificity of PDZ domains. Currently, many computational methods are available to predict PDZ-peptide interactions but they often provide domain specific models and/or have a limited domain coverage.
Results
Here, we composed the largest set of PDZ domains derived from human, mouse, fly and worm proteomes and defined binding models for PDZ domain families to improve the domain coverage and prediction specificity. For that purpose, we first identified a novel set of 138 PDZ families, comprising of 548 PDZ domains from aforementioned organisms, based on efficient clustering according to their sequence identity. For 43 PDZ families, covering 226 PDZ domains with available interaction data, we built specialized models using a support vector machine approach. The advantage of family-wise models is that they can also be used to determine the binding specificity of a newly characterized PDZ domain with sufficient sequence identity to the known families. Since most current experimental approaches provide only positive data, we have to cope with the class imbalance problem. Thus, to enrich the negative class, we introduced a powerful semi-supervised technique to generate high confidence non-interaction data. We report competitive predictive performance with respect to state-of-the-art approaches.
Conclusions
Our approach has several contributions. First, we show that domain coverage can be increased by applying accurate clustering technique. Second, we developed an approach based on a semi-supervised strategy to get high confidence negative data. Third, we allowed high order correlations between the amino acid positions in the binding peptides. Fourth, our method is general enough and will easily be applicable to other peptide recognition modules such as SH2 domains and finally, we performed a genome-wide prediction for 101 human and 102 mouse PDZ domains and uncovered novel interactions with biological relevance. We make all the predictive models and genome-wide predictions freely available to the scientific community.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-S1-S5) contains supplementary material, which is available to authorized users.
doi:10.1186/1471-2164-15-S1-S5
PMCID: PMC4046824  PMID: 24564547
PDZ domain-peptide interactions; protein recognition modules; protein domain clustering; semi-supervised learning; support vector machines
15.  A Complex of Cas Proteins 5, 6, and 7 Is Required for the Biogenesis and Stability of Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)-derived RNAs (crRNAs) in Haloferax volcanii* 
The Journal of Biological Chemistry  2014;289(10):7164-7177.
Background: The Cas6 protein is required for generating crRNAs in CRISPR-Cas I and III systems.
Results: The Cas6 protein is necessary for crRNA production but not sufficient for crRNA maintenance in Haloferax.
Conclusion: A Cascade-like complex is required in the type I-B system for a stable crRNA population.
Significance: The CRISPR-Cas system I-B has a similar Cascade complex like types I-A and I-E.
The clustered regularly interspaced short palindromic repeats/CRISPR-associated (CRISPR-Cas) system is a prokaryotic defense mechanism against foreign genetic elements. A plethora of CRISPR-Cas versions exist, with more than 40 different Cas protein families and several different molecular approaches to fight the invading DNA. One of the key players in the system is the CRISPR-derived RNA (crRNA), which directs the invader-degrading Cas protein complex to the invader. The CRISPR-Cas types I and III use the Cas6 protein to generate mature crRNAs. Here, we show that the Cas6 protein is necessary for crRNA production but that additional Cas proteins that form a CRISPR-associated complex for antiviral defense (Cascade)-like complex are needed for crRNA stability in the CRISPR-Cas type I-B system in Haloferax volcanii in vivo. Deletion of the cas6 gene results in the loss of mature crRNAs and interference. However, cells that have the complete cas gene cluster (cas1–8b) removed and are transformed with the cas6 gene are not able to produce and stably maintain mature crRNAs. crRNA production and stability is rescued only if cas5, -6, and -7 are present. Mutational analysis of the cas6 gene reveals three amino acids (His-41, Gly-256, and Gly-258) that are essential for pre-crRNA cleavage, whereas the mutation of two amino acids (Ser-115 and Ser-224) leads to an increase of crRNA amounts. This is the first systematic in vivo analysis of Cas6 protein variants. In addition, we show that the H. volcanii I-B system contains a Cascade-like complex with a Cas7, Cas5, and Cas6 core that protects the crRNA.
doi:10.1074/jbc.M113.508184
PMCID: PMC3945376  PMID: 24459147
Archaea; Microbiology; Molecular Biology; Molecular Genetics; Protein Complexes; CRISPR/Cas; Cas6; Haloferax volcanii; crRNA; Type I-B
16.  GraphProt: modeling binding preferences of RNA-binding proteins 
Genome Biology  2014;15(1):R17.
We present GraphProt, a computational framework for learning sequence- and structure-binding preferences of RNA-binding proteins (RBPs) from high-throughput experimental data. We benchmark GraphProt, demonstrating that the modeled binding preferences conform to the literature, and showcase the biological relevance and two applications of GraphProt models. First, estimated binding affinities correlate with experimental measurements. Second, predicted Ago2 targets display higher levels of expression upon Ago2 knockdown, whereas control targets do not. Computational binding models, such as those provided by GraphProt, are essential for predicting RBP binding sites and affinities in all tissues. GraphProt is freely available at http://www.bioinf.uni-freiburg.de/Software/GraphProt.
doi:10.1186/gb-2014-15-1-r17
PMCID: PMC4053806  PMID: 24451197
17.  Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources 
Motivation
The identification of protein and gene names (PGNs) from the scientific literature requires semantic resources: Terminological and lexical resources deliver the term candidates into PGN tagging solutions and the gold standard corpora (GSC) train them to identify term parameters and contextual features. Ideally all three resources, i.e. corpora, lexica and taggers, cover the same domain knowledge, and thus support identification of the same types of PGNs and cover all of them. Unfortunately, none of the three serves as a predominant standard and for this reason it is worth exploring, how these three resources comply with each other. We systematically compare different PGN taggers against publicly available corpora and analyze the impact of the included lexical resource in their performance. In particular, we determine the performance gains through false positive filtering, which contributes to the disambiguation of identified PGNs.
Results
In general, machine learning approaches (ML-Tag) for PGN tagging show higher F1-measure performance against the BioCreative-II and Jnlpba GSCs (exact matching), whereas the lexicon based approaches (LexTag) in combination with disambiguation methods show better results on FsuPrge and PennBio. The ML-Tag solutions balance precision and recall, whereas the LexTag solutions have different precision and recall profiles at the same F1-measure across all corpora. Higher recall is achieved with larger lexical resources, which also introduce more noise (false positive results). The ML-Tag solutions certainly perform best, if the test corpus is from the same GSC as the training corpus. As expected, the false negative errors characterize the test corpora and – on the other hand – the profiles of the false positive mistakes characterize the tagging solutions. Lex-Tag solutions that are based on a large terminological resource in combination with false positive filtering produce better results, which, in addition, provide concept identifiers from a knowledge source in contrast to ML-Tag solutions.
Conclusion
The standard ML-Tag solutions achieve high performance, but not across all corpora, and thus should be trained using several different corpora to reduce possible biases. The LexTag solutions have different profiles for their precision and recall performance, but with similar F1-measure. This result is surprising and suggests that they cover a portion of the most common naming standards, but cope differently with the term variability across the corpora. The false positive filtering applied to LexTag solutions does improve the results by increasing their precision without compromising significantly their recall. The harmonisation of the annotation schemes in combination with standardized lexical resources in the tagging solutions will enable their comparability and will pave the way for a shared standard.
doi:10.1186/2041-1480-4-28
PMCID: PMC4021975  PMID: 24112383
18.  Evaluation and Cross-Comparison of Lexical Entities of Biological Interest (LexEBI) 
PLoS ONE  2013;8(10):e75185.
Motivation
Biomedical entities, their identifiers and names, are essential in the representation of biomedical facts and knowledge. In the same way, the complete set of biomedical and chemical terms, i.e. the biomedical “term space” (the “Lexeome”), forms a key resource to achieve the full integration of the scientific literature with biomedical data resources: any identified named entity can immediately be normalized to the correct database entry. This goal does not only require that we are aware of all existing terms, but would also profit from knowing all their senses and their semantic interpretation (ambiguities, nestedness).
Result
This study compiles a resource for lexical terms of biomedical interest in a standard format (called “LexEBI”), determines the overall number of terms, their reuse in different resources and the nestedness of terms. LexEBI comprises references for protein and gene entries and their term variants and chemical entities amongst other terms. In addition, disease terms have been identified from Medline and PubmedCentral and added to LexEBI. Our analysis demonstrates that the baseforms of terms from the different semantic types show only little polysemous use. Nonetheless, the term variants of protein and gene names (PGNs) frequently contain species mentions, which should have been avoided according to protein annotation guidelines. Furthermore, the protein and gene entities as well as the chemical entities, both do comprise enzymes leading to hierarchical polysemy, and a large portion of PGNs make reference to a chemical entity. Altogether, according to our analysis based on the Medline distribution, 401,869 unique PGNs in the documents contain a reference to 25,022 chemical entities, 3,125 disease terms or 1,576 species mentions.
Conclusion
LexEBI delivers the complete biomedical and chemical Lexeome in a standardized representation (http://www.ebi.ac.uk/Rebholz-srv/LexEBI/). The resource provides the disease terms as open source content, and fully interlinks terms across resources.
doi:10.1371/journal.pone.0075185
PMCID: PMC3790750  PMID: 24124474
19.  CRISPRmap: an automated classification of repeat conservation in prokaryotic adaptive immune systems 
Nucleic Acids Research  2013;41(17):8034-8044.
Central to Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR)-Cas systems are repeated RNA sequences that serve as Cas-protein–binding templates. Classification is based on the architectural composition of associated Cas proteins, considering repeat evolution is essential to complete the picture. We compiled the largest data set of CRISPRs to date, performed comprehensive, independent clustering analyses and identified a novel set of 40 conserved sequence families and 33 potential structure motifs for Cas-endoribonucleases with some distinct conservation patterns. Evolutionary relationships are presented as a hierarchical map of sequence and structure similarities for both a quick and detailed insight into the diversity of CRISPR-Cas systems. In a comparison with Cas-subtypes, I-C, I-E, I-F and type II were strongly coupled and the remaining type I and type III subtypes were loosely coupled to repeat and Cas1 evolution, respectively. Subtypes with a strong link to CRISPR evolution were almost exclusive to bacteria; nevertheless, we identified rare examples of potential horizontal transfer of I-C and I-E systems into archaeal organisms. Our easy-to-use web server provides an automated assignment of newly sequenced CRISPRs to our classification system and enables more informed choices on future hypotheses in CRISPR-Cas research: http://rna.informatik.uni-freiburg.de/CRISPRmap.
doi:10.1093/nar/gkt606
PMCID: PMC3783184  PMID: 23863837
20.  A graph kernel approach for alignment-free domain–peptide interaction prediction with an application to human SH3 domains 
Bioinformatics  2013;29(13):i335-i343.
Motivation: State-of-the-art experimental data for determining binding specificities of peptide recognition modules (PRMs) is obtained by high-throughput approaches like peptide arrays. Most prediction tools applicable to this kind of data are based on an initial multiple alignment of the peptide ligands. Building an initial alignment can be error-prone, especially in the case of the proline-rich peptides bound by the SH3 domains.
Results: Here, we present a machine-learning approach based on an efficient graph-kernel technique to predict the specificity of a large set of 70 human SH3 domains, which are an important class of PRMs. The graph-kernel strategy allows us to (i) integrate several types of physico-chemical information for each amino acid, (ii) consider high-order correlations between these features and (iii) eliminate the need for an initial peptide alignment. We build specialized models for each human SH3 domain and achieve competitive predictive performance of 0.73 area under precision-recall curve, compared with 0.27 area under precision-recall curve for state-of-the-art methods based on position weight matrices.
We show that better models can be obtained when we use information on the noninteracting peptides (negative examples), which is currently not used by the state-of-the art approaches based on position weight matrices. To this end, we analyze two strategies to identify subsets of high confidence negative data.
The techniques introduced here are more general and hence can also be used for any other protein domains, which interact with short peptides (i.e. other PRMs).
Availability: The program with the predictive models can be found at http://www.bioinf.uni-freiburg.de/Software/SH3PepInt/SH3PepInt.tar.gz. We also provide a genome-wide prediction for all 70 human SH3 domains, which can be found under http://www.bioinf.uni-freiburg.de/Software/SH3PepInt/Genome-Wide-Predictions.tar.gz.
Contact: backofen@informatik.uni-freiburg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt220
PMCID: PMC3694653  PMID: 23813002
21.  Semi-Supervised Prediction of SH2-Peptide Interactions from Imbalanced High-Throughput Data 
PLoS ONE  2013;8(5):e62732.
Src homology 2 (SH2) domains are the largest family of the peptide-recognition modules (PRMs) that bind to phosphotyrosine containing peptides. Knowledge about binding partners of SH2-domains is key for a deeper understanding of different cellular processes. Given the high binding specificity of SH2, in-silico ligand peptide prediction is of great interest. Currently however, only a few approaches have been published for the prediction of SH2-peptide interactions. Their main shortcomings range from limited coverage, to restrictive modeling assumptions (they are mainly based on position specific scoring matrices and do not take into consideration complex amino acids inter-dependencies) and high computational complexity. We propose a simple yet effective machine learning approach for a large set of known human SH2 domains. We used comprehensive data from micro-array and peptide-array experiments on 51 human SH2 domains. In order to deal with the high data imbalance problem and the high signal-to-noise ration, we casted the problem in a semi-supervised setting. We report competitive predictive performance w.r.t. state-of-the-art. Specifically we obtain 0.83 AUC ROC and 0.93 AUC PR in comparison to 0.71 AUC ROC and 0.87 AUC PR previously achieved by the position specific scoring matrices (PSSMs) based SMALI approach. Our work provides three main contributions. First, we showed that better models can be obtained when the information on the non-interacting peptides (negative examples) is also used. Second, we improve performance when considering high order correlations between the ligand positions employing regularization techniques to effectively avoid overfitting issues. Third, we developed an approach to tackle the data imbalance problem using a semi-supervised strategy. Finally, we performed a genome-wide prediction of human SH2-peptide binding, uncovering several findings of biological relevance. We make our models and genome-wide predictions, for all the 51 SH2-domains, freely available to the scientific community under the following URLs: http://www.bioinf.uni-freiburg.de/Software/SH2PepInt/SH2PepInt.tar.gz and http://www.bioinf.uni-freiburg.de/Software/SH2PepInt/Genome-wide-predictions.tar.gz, respectively.
doi:10.1371/journal.pone.0062732
PMCID: PMC3656881  PMID: 23690949
22.  LocARNAscan: Incorporating thermodynamic stability in sequence and structure-based RNA homology search 
Background
The search for distant homologs has become an import issue in genome annotation. A particular difficulty is posed by divergent homologs that have lost recognizable sequence similarity. This same problem also arises in the recognition of novel members of large classes of RNAs such as snoRNAs or microRNAs that consist of families unrelated by common descent. Current homology search tools for structured RNAs are either based entirely on sequence similarity (such as blast or hmmer) or combine sequence and secondary structure. The most prominent example of the latter class of tools is Infernal. Alternatives are descriptor-based methods. In most practical applications published to-date, however, the information contained in covariance models or manually prescribed search patterns is dominated by sequence information. Here we ask two related questions: (1) Is secondary structure alone informative for homology search and the detection of novel members of RNA classes? (2) To what extent is the thermodynamic propensity of the target sequence to fold into the correct secondary structure helpful for this task?
Results
Sequence-structure alignment can be used as an alternative search strategy. In this scenario, the query consists of a base pairing probability matrix, which can be derived either from a single sequence or from a multiple alignment representing a set of known representatives. Sequence information can be optionally added to the query. The target sequence is pre-processed to obtain local base pairing probabilities. As a search engine we devised a semi-global scanning variant of LocARNA’s algorithm for sequence-structure alignment. The LocARNAscan tool is optimized for speed and low memory consumption. In benchmarking experiments on artificial data we observe that the inclusion of thermodynamic stability is helpful, albeit only in a regime of extremely low sequence information in the query. We observe, furthermore, that the sensitivity is bounded in particular by the limited accuracy of the predicted local structures of the target sequence.
Conclusions
Although we demonstrate that a purely structure-based homology search is feasible in principle, it is unlikely to outperform tools such as Infernal in most application scenarios, where a substantial amount of sequence information is typically available. The LocARNAscan approach will profit, however, from high throughput methods to determine RNA secondary structure. In transcriptome-wide applications, such methods will provide accurate structure annotations on the target side.
Availability
Source code of the free software LocARNAscan 1.0 and supplementary data are available at http://www.bioinf.uni-leipzig.de/Software/LocARNAscan.
doi:10.1186/1748-7188-8-14
PMCID: PMC3716875  PMID: 23601347
23.  Essential requirements for the detection and degradation of invaders by the Haloferax volcanii CRISPR/Cas system I-B 
RNA Biology  2013;10(5):865-874.
To fend off foreign genetic elements, prokaryotes have developed several defense systems. The most recently discovered defense system, CRISPR/Cas, is sequence-specific, adaptive and heritable. The two central components of this system are the Cas proteins and the CRISPR RNA. The latter consists of repeat sequences that are interspersed with spacer sequences. The CRISPR locus is transcribed into a precursor RNA that is subsequently processed into short crRNAs. CRISPR/Cas systems have been identified in bacteria and archaea, and data show that many variations of this system exist. We analyzed the requirements for a successful defense reaction in the halophilic archaeon Haloferax volcanii. Haloferax encodes a CRISPR/Cas system of the I-B subtype, about which very little is known. Analysis of the mature crRNAs revealed that they contain a spacer as their central element, which is preceded by an eight-nucleotide-long 5′ handle that originates from the upstream repeat. The repeat sequences have the potential to fold into a minimal stem loop. Sequencing of the crRNA population indicated that not all of the spacers that are encoded by the three CRISPR loci are present in the same abundance. By challenging Haloferax with an invader plasmid, we demonstrated that the interaction of the crRNA with the invader DNA requires a 10-nucleotide-long seed sequence. In addition, we found that not all of the crRNAs from the three CRISPR loci are effective at triggering the degradation of invader plasmids. The interference does not seem to be influenced by the copy number of the invader plasmid.
doi:10.4161/rna.24282
PMCID: PMC3737343  PMID: 23594992
archaea; Haloferax volcanii; CRISPR/Cas; crRNA; PAM; seed sequence
24.  CRISPR-Cas Systems in the Cyanobacterium Synechocystis sp. PCC6803 Exhibit Distinct Processing Pathways Involving at Least Two Cas6 and a Cmr2 Protein 
PLoS ONE  2013;8(2):e56470.
The CRISPR-Cas (Clustered Regularly Interspaced Short Palindrome Repeats – CRISPR associated proteins) system provides adaptive immunity in archaea and bacteria. A hallmark of CRISPR-Cas is the involvement of short crRNAs that guide associated proteins in the destruction of invading DNA or RNA. We present three fundamentally distinct processing pathways in the cyanobacterium Synechocystis sp. PCC6803 for a subtype I-D (CRISPR1), and two type III systems (CRISPR2 and CRISPR3), which are located together on the plasmid pSYSA. Using high-throughput transcriptome analyses and assays of transcript accumulation we found all CRISPR loci to be highly expressed, but the individual crRNAs had profoundly varying abundances despite single transcription start sites for each array. In a computational analysis, CRISPR3 spacers with stable secondary structures displayed a greater ratio of degradation products. These structures might interfere with the loading of the crRNAs into RNP complexes, explaining the varying abundancies. The maturation of CRISPR1 and CRISPR2 transcripts depends on at least two different Cas6 proteins. Mutation of gene sll7090, encoding a Cmr2 protein led to the disappearance of all CRISPR3-derived crRNAs, providing in vivo evidence for a function of Cmr2 in the maturation, regulation of expression, Cmr complex formation or stabilization of CRISPR3 transcripts. Finally, we optimized CRISPR repeat structure prediction and the results indicate that the spacer context can influence individual repeat structures.
doi:10.1371/journal.pone.0056470
PMCID: PMC3575380  PMID: 23441196
25.  Navigating the unexplored seascape of pre-miRNA candidates in single-genome approaches 
Bioinformatics  2012;28(23):3034-3041.
Motivation: The computational search for novel microRNA (miRNA) precursors often involves some sort of structural analysis with the aim of identifying which type of structures are prone to being recognized and processed by the cellular miRNA-maturation machinery. A natural way to tackle this problem is to perform clustering over the candidate structures along with known miRNA precursor structures. Mixed clusters allow then the identification of candidates that are similar to known precursors. Given the large number of pre-miRNA candidates that can be identified in single-genome approaches, even after applying several filters for precursor robustness and stability, a conventional structural clustering approach is unfeasible.
Results: We propose a method to represent candidate structures in a feature space, which summarizes key sequence/structure characteristics of each candidate. We demonstrate that proximity in this feature space is related to sequence/structure similarity, and we select candidates that have a high similarity to known precursors. Additional filtering steps are then applied to further reduce the number of candidates to those with greater transcriptional potential. Our method is compared with another single-genome method (TripletSVM) in two datasets, showing better performance in one and comparable performance in the other, for larger training sets. Additionally, we show that our approach allows for a better interpretation of the results.
Availability and Implementation: The MinDist method is implemented using Perl scripts and is freely available at http://www.cravela.org/?mindist=1.
Contact: backofen@informatik.uni-freiburg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts574
PMCID: PMC3516144  PMID: 23052038

Results 1-25 (64)