Search tips
Search criteria

Results 1-25 (71)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
more »
1.  RC3H1 post-transcriptionally regulates A20 mRNA and modulates the activity of the IKK/NF-κB pathway 
Nature Communications  2015;6:7367.
The RNA-binding protein RC3H1 (also known as ROQUIN) promotes TNFα mRNA decay via a 3′UTR constitutive decay element (CDE). Here we applied PAR-CLIP to human RC3H1 to identify ∼3,800 mRNA targets with >16,000 binding sites. A large number of sites are distinct from the consensus CDE and revealed a structure-sequence motif with U-rich sequences embedded in hairpins. RC3H1 binds preferentially short-lived and DNA damage-induced mRNAs, indicating a role of this RNA-binding protein in the post-transcriptional regulation of the DNA damage response. Intriguingly, RC3H1 affects expression of the NF-κB pathway regulators such as IκBα and A20. RC3H1 uses ROQ and Zn-finger domains to contact a binding site in the A20 3′UTR, demonstrating a not yet recognized mode of RC3H1 binding. Knockdown of RC3H1 resulted in increased A20 protein expression, thereby interfering with IκB kinase and NF-κB activities, demonstrating that RC3H1 can modulate the activity of the IKK/NF-κB pathway.
The RNA-binding protein RC3H1/ROQUIN1 promotes the degradation of mRNA by binding to a consensus CDE present in the 3′UTR. Here the authors expand the set of consensus sequences through which RCH31 binds and regulates mRNA encoding members of the DNA damage response and IKK/NF-κB pathway.
PMCID: PMC4510711  PMID: 26170170
2.  Simultaneous Alignment and Folding of Protein Sequences 
Journal of Computational Biology  2014;21(7):477-491.
Accurate comparative analysis tools for low-homology proteins remains a difficult challenge in computational biology, especially sequence alignment and consensus folding problems. We present partiFold-Align, the first algorithm for simultaneous alignment and consensus folding of unaligned protein sequences; the algorithm's complexity is polynomial in time and space. Algorithmically, partiFold-Align exploits sparsity in the set of super-secondary structure pairings and alignment candidates to achieve an effectively cubic running time for simultaneous pairwise alignment and folding. We demonstrate the efficacy of these techniques on transmembrane β-barrel proteins, an important yet difficult class of proteins with few known three-dimensional structures. Testing against structurally derived sequence alignments, partiFold-Align significantly outperforms state-of-the-art pairwise and multiple sequence alignment tools in the most difficult low-sequence homology case. It also improves secondary structure prediction where current approaches fail. Importantly, partiFold-Align requires no prior training. These general techniques are widely applicable to many more protein families (partiFold-Align is available at
PMCID: PMC4082353  PMID: 24766258
Neuro-Oncology  2014;16(Suppl 3):iii19-iii20.
BACKGROUND: Tissue-specific alternative splicing is known to be critical to emergence of tissue identity during development, yet its role in malignant transformation is undefined. Tissue-specific splicing involves evolutionary-conserved, alternative exons, which represent only a minority of total alternative exons. Many, however, have functional features that influence activity in signaling pathways to profound biological effect. Given that tissue-specific splicing has a determinative role in brain development and the enrichment of genes containing tissue-specific exons for proteins with roles in signaling and development, it is thus plausible that changes in such exons could rewire normal neurogenesis towards malignant transformation. METHODS: We used integrated molecular genetic and cell biology analyses, computational biology, animal modeling, and clinical patient profiles to characterize the effect of aberrant splicing of a brain-enriched alternative exon in the membrane-binding tumor suppressor Annexin A7 (ANXA7) on oncogene regulation and brain tumorigenesis. RESULTS: We show that aberrant splicing of a tissue-specific cassette exon in ANXA7 diminishes endosomal targeting and consequent termination of the signal of the EGFR oncoprotein during brain tumorigenesis. Splicing of this exon is mediated by the ribonucleoprotein Polypyrimidine Tract-Binding Protein 1 (PTBP1), which is normally repressed during brain development but, we find, is excessively expressed in glioblastomas through either gene amplification or loss of a neuron-specific microRNA, miR-124. Silencing of PTBP1 attenuates both malignancy and angiogenesis in a stem cell-derived glioblastoma animal model characterized by a high native propensity to generate tumor endothelium or vascular pericytes to support tumor growth. We show that EGFR amplification and PTBP1 overexpression portend a similarly poor clinical outcome, further highlighting the importance of PTBP1-mediated activation of EGFR. CONCLUSIONS: Our data illustrate how anomalous splicing of a tissue-regulated exon in a constituent of an oncogenic signaling pathway eliminates its tumor suppressor function and promotes tumorigenesis. This paradigm of malignant glial transformation as a consequence of tissue-specific alternative exon splicing in a tumor suppressor, may have widespread applicability in explaining how changes in critical tissue-specific regulatory mechanisms reprogram normal development to oncogenesis. SECONDARY CATEGORY: n/a.
PMCID: PMC4144541
4.  Bioinformatics of prokaryotic RNAs 
RNA Biology  2014;11(5):470-483.
The genome of most prokaryotes gives rise to surprisingly complex transcriptomes, comprising not only protein-coding mRNAs, often organized as operons, but also harbors dozens or even hundreds of highly structured small regulatory RNAs and unexpectedly large levels of anti-sense transcripts. Comprehensive surveys of prokaryotic transcriptomes and the need to characterize also their non-coding components is heavily dependent on computational methods and workflows, many of which have been developed or at least adapted specifically for the use with bacterial and archaeal data. This review provides an overview on the state-of-the-art of RNA bioinformatics focusing on applications to prokaryotes.
PMCID: PMC4152356  PMID: 24755880
RNA bioinformatics; RNA–RNA interaction; TSS annotation; gene finding; secondary structure prediction; target prediction
5.  SPARSE: quadratic time simultaneous alignment and folding of RNAs without sequence-based heuristics 
Bioinformatics  2015;31(15):2489-2496.
Motivation: RNA-Seq experiments have revealed a multitude of novel ncRNAs. The gold standard for their analysis based on simultaneous alignment and folding suffers from extreme time complexity of O(n6). Subsequently, numerous faster ‘Sankoff-style’ approaches have been suggested. Commonly, the performance of such methods relies on sequence-based heuristics that restrict the search space to optimal or near-optimal sequence alignments; however, the accuracy of sequence-based methods breaks down for RNAs with sequence identities below 60%. Alignment approaches like LocARNA that do not require sequence-based heuristics, have been limited to high complexity (≥ quartic time).
Results: Breaking this barrier, we introduce the novel Sankoff-style algorithm ‘sparsified prediction and alignment of RNAs based on their structure ensembles (SPARSE)’, which runs in quadratic time without sequence-based heuristics. To achieve this low complexity, on par with sequence alignment algorithms, SPARSE features strong sparsification based on structural properties of the RNA ensembles. Following PMcomp, SPARSE gains further speed-up from lightweight energy computation. Although all existing lightweight Sankoff-style methods restrict Sankoff’s original model by disallowing loop deletions and insertions, SPARSE transfers the Sankoff algorithm to the lightweight energy model completely for the first time. Compared with LocARNA, SPARSE achieves similar alignment and better folding quality in significantly less time (speedup: 3.7). At similar run-time, it aligns low sequence identity instances substantially more accurate than RAF, which uses sequence-based heuristics.
Availability and implementation: SPARSE is freely available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4514930  PMID: 25838465
6.  Cell type specific gene expression analysis of prostate needle biopsies resolves tumor tissue heterogeneity 
Oncotarget  2014;6(2):1302-1314.
A lack of cell surface markers for the specific identification, isolation and subsequent analysis of living prostate tumor cells hampers progress in the field. Specific characterization of tumor cells and their microenvironment in a multi-parameter molecular assay could significantly improve prognostic accuracy for the heterogeneous prostate tumor tissue. Novel functionalized gold-nano particles allow fluorescence-based detection of absolute mRNA expression levels in living cells by fluorescent activated flow cytometry (FACS). We use of this technique to separate prostate tumor and benign cells in human prostate needle biopsies based on the expression levels of the tumor marker alpha-methylacyl-CoA racemase (AMACR). We combined RNA and protein detection of living cells by FACS to gate for epithelial cell adhesion molecule (EPCAM) positive tumor and benign cells, EPCAM/CD45 double negative mesenchymal cells and CD45 positive infiltrating lymphocytes. EPCAM positive epithelial cells were further sub-gated into AMACR high and low expressing cells. Two hundred cells from each population and several biopsies from the same patient were analyzed using a multiplexed gene expression profile to generate a cell type resolved profile of the specimen. This technique provides the basis for the clinical evaluation of cell type resolved gene expression profiles as pre-therapeutic prognostic markers for prostate cancer.
PMCID: PMC4359234  PMID: 25514598
prostate cancer; RNA detection; living cells; needle biopsy; gene expression; tumor heterogeneity
7.  ExpaRNA-P: simultaneous exact pattern matching and folding of RNAs 
BMC Bioinformatics  2014;15(1):404.
Identifying sequence-structure motifs common to two RNAs can speed up the comparison of structural RNAs substantially. The core algorithm of the existent approach ExpaRNA solves this problem for a priori known input structures. However, such structures are rarely known; moreover, predicting them computationally is no rescue, since single sequence structure prediction is highly unreliable.
The novel algorithm ExpaRNA-P computes exactly matching sequence-structure motifs in entire Boltzmann-distributed structure ensembles of two RNAs; thereby we match and fold RNAs simultaneously, analogous to the well-known “simultaneous alignment and folding” of RNAs. While this implies much higher flexibility compared to ExpaRNA, ExpaRNA-P has the same very low complexity (quadratic in time and space), which is enabled by its novel structure ensemble-based sparsification. Furthermore, we devise a generalized chaining algorithm to compute compatible subsets of ExpaRNA-P’s sequence-structure motifs. Resulting in the very fast RNA alignment approach ExpLoc-P, we utilize the best chain as anchor constraints for the sequence-structure alignment tool LocARNA. ExpLoc-P is benchmarked in several variants and versus state-of-the-art approaches. In particular, we formally introduce and evaluate strict and relaxed variants of the problem; the latter makes the approach sensitive to compensatory mutations. Across a benchmark set of typical non-coding RNAs, ExpLoc-P has similar accuracy to LocARNA but is four times faster (in both variants), while it achieves a speed-up over 30-fold for the longest benchmark sequences (≈400nt). Finally, different ExpLoc-P variants enable tailoring of the method to specific application scenarios. ExpaRNA-P and ExpLoc-P are distributed as part of the LocARNA package. The source code is freely available at
ExpaRNA-P’s novel ensemble-based sparsification reduces its complexity to quadratic time and space. Thereby, ExpaRNA-P significantly speeds up sequence-structure alignment while maintaining the alignment quality. Different ExpaRNA-P variants support a wide range of applications.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0404-0) contains supplementary material, which is available to authorized users.
PMCID: PMC4302096  PMID: 25551362
RNA bioinformatics; Structure-based comparison of RNA; Sparsification
8.  An Active Immune Defense with a Minimal CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) RNA and without the Cas6 Protein* 
The Journal of Biological Chemistry  2014;290(7):4192-4201.
Background: CRISPR RNAs (crRNAs) are generated by Cas6b in type I-B systems. They are essential for the interference reaction.
Results: An icrRNA is generated independently from Cas6b and functions like a crRNA.
Conclusion: In the presence of an icrRNA, Cas6b is not required for the interference reaction.
Significance: This setup allows the Cas6b-independent generation of icrRNAs and thereby interference without Cas6b.
The prokaryotic immune system CRISPR-Cas (clustered regularly interspaced short palindromic repeats-CRISPR-associated) is a defense system that protects prokaryotes against foreign DNA. The short CRISPR RNAs (crRNAs) are central components of this immune system. In CRISPR-Cas systems type I and III, crRNAs are generated by the endonuclease Cas6. We developed a Cas6b-independent crRNA maturation pathway for the Haloferax type I-B system in vivo that expresses a functional crRNA, which we termed independently generated crRNA (icrRNA). The icrRNA is effective in triggering degradation of an invader plasmid carrying the matching protospacer sequence. The Cas6b-independent maturation of the icrRNA allowed mutation of the repeat sequence without interfering with signals important for Cas6b processing. We generated 23 variants of the icrRNA and analyzed them for activity in the interference reaction. icrRNAs with deletions or mutations of the 3′ handle are still active in triggering an interference reaction. The complete 3′ handle could be removed without loss of activity. However, manipulations of the 5′ handle mostly led to loss of interference activity. Furthermore, we could show that in the presence of an icrRNA a strain without Cas6b (Δcas6b) is still active in interference.
PMCID: PMC4326828  PMID: 25512373
Archaea; Cas6; CRISPR/Cas; crRNA; Haloferax volcanii; Type I-B
9.  Atom mapping with constraint programming 
Chemical reactions are rearrangements of chemical bonds. Each atom in an educt molecule thus appears again in a specific position of one of the reaction products. This bijection between educt and product atoms is not reported by chemical reaction databases, however, so that the “Atom Mapping Problem” of finding this bijection is left as an important computational task for many practical applications in computational chemistry and systems biology. Elementary chemical reactions feature a cyclic imaginary transition state (ITS) that imposes additional restrictions on the bijection between educt and product atoms that are not taken into account by previous approaches. We demonstrate that Constraint Programming is well-suited to solving the Atom Mapping Problem in this setting. The performance of our approach is evaluated for a manually curated subset of chemical reactions from the KEGG database featuring various ITS cycle layouts and reaction mechanisms.
Electronic supplementary material
The online version of this article (doi:10.1186/s13015-014-0023-3) contains supplementary material, which is available to authorized users.
PMCID: PMC4256833  PMID: 25484913
Atom-atom mapping; Constraint programming; Chemical reaction; Imaginary transition state
10.  Dynamic DNA methylation orchestrates cardiomyocyte development, maturation and disease 
Nature Communications  2014;5:5288.
The heart is a highly specialized organ with essential function for the organism throughout life. The significance of DNA methylation in shaping the phenotype of the heart remains only partially known. Here we generate and analyse DNA methylomes from highly purified cardiomyocytes of neonatal, adult healthy and adult failing hearts. We identify large genomic regions that are differentially methylated during cardiomyocyte development and maturation. Demethylation of cardiomyocyte gene bodies correlates strongly with increased gene expression. Silencing of demethylated genes is characterized by the polycomb mark H3K27me3 or by DNA methylation. De novo methylation by DNA methyltransferases 3A/B causes repression of fetal cardiac genes, including essential components of the cardiac sarcomere. Failing cardiomyocytes partially resemble neonatal methylation patterns. This study establishes DNA methylation as a highly dynamic process during postnatal growth of cardiomyocytes and their adaptation to pathological stress in a process tightly linked to gene regulation and activity.
DNA methylation is essential for proper gene expression, development and genome stability. Here the authors present whole-genome DNA methylation analyses of purified mouse cardiomyocytes from newborn, adult and failing hearts and find highly dynamic patterns between the three phenotypes of cardiomyocytes.
PMCID: PMC4220495  PMID: 25335909
11.  Graph-distance distribution of the Boltzmann ensemble of RNA secondary structures 
Large RNA molecules are often composed of multiple functional domains whose spatial arrangement strongly influences their function. Pre-mRNA splicing, for instance, relies on the spatial proximity of the splice junctions that can be separated by very long introns. Similar effects appear in the processing of RNA virus genomes. Albeit a crude measure, the distribution of spatial distances in thermodynamic equilibrium harbors useful information on the shape of the molecule that in turn can give insights into the interplay of its functional domains.
Spatial distance can be approximated by the graph-distance in RNA secondary structure. We show here that the equilibrium distribution of graph-distances between a fixed pair of nucleotides can be computed in polynomial time by means of dynamic programming. While a naïve implementation would yield recursions with a very high time complexity of O(n6D5) for sequence length n and D distinct distance values, it is possible to reduce this to O(n4) for practical applications in which predominantly small distances are of of interest. Further reductions, however, seem to be difficult. Therefore, we introduced sampling approaches that are much easier to implement. They are also theoretically favorable for several real-life applications, in particular since these primarily concern long-range interactions in very large RNA molecules.
The graph-distance distribution can be computed using a dynamic programming approach. Although a crude approximation of reality, our initial results indicate that the graph-distance can be related to the smFRET data. The additional file and the software of our paper are available from
PMCID: PMC4181469  PMID: 25285153
Graph-distance; Boltzmann distribution; Partition function; Pre-mRNA splicing; smFRET
12.  CRISPRstrand: predicting repeat orientations to determine the crRNA-encoding strand at CRISPR loci 
Bioinformatics  2014;30(17):i489-i496.
Motivation: The discovery of CRISPR-Cas systems almost 20 years ago rapidly changed our perception of the bacterial and archaeal immune systems. CRISPR loci consist of several repetitive DNA sequences called repeats, inter-spaced by stretches of variable length sequences called spacers. This CRISPR array is transcribed and processed into multiple mature RNA species (crRNAs). A single crRNA is integrated into an interference complex, together with CRISPR-associated (Cas) proteins, to bind and degrade invading nucleic acids. Although existing bioinformatics tools can recognize CRISPR loci by their characteristic repeat-spacer architecture, they generally output CRISPR arrays of ambiguous orientation and thus do not determine the strand from which crRNAs are processed. Knowledge of the correct orientation is crucial for many tasks, including the classification of CRISPR conservation, the detection of leader regions, the identification of target sites (protospacers) on invading genetic elements and the characterization of protospacer-adjacent motifs.
Results: We present a fast and accurate tool to determine the crRNA-encoding strand at CRISPR loci by predicting the correct orientation of repeats based on an advanced machine learning approach. Both the repeat sequence and mutation information were encoded and processed by an efficient graph kernel to learn higher-order correlations. The model was trained and tested on curated data comprising >4500 CRISPRs and yielded a remarkable performance of 0.95 AUC ROC (area under the curve of the receiver operator characteristic). In addition, we show that accurate orientation information greatly improved detection of conserved repeat sequence families and structure motifs. We integrated CRISPRstrand predictions into our CRISPRmap web server of CRISPR conservation and updated the latter to version 2.0.
Availability: CRISPRmap and CRISPRstrand are available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4147912  PMID: 25161238
13.  Tandem Stem Loops in roX RNAs Act Together to Mediate X Chromosome Dosage Compensation in Drosophila 
Molecular cell  2013;51(2):156-173.
Dosage compensation in Drosophila is an epigenetic phenomenon utilizing proteins and long noncoding RNAs (lncRNAs) for transcriptional upregulation of the male X chromosome. Here, by using UV crosslinking followed by deep sequencing, we show that two enzymes in the Male-Specific Lethal complex, MLE RNA helicase and MSL2 ubiquitin ligase, bind evolutionarily conserved domains containing tandem stem loops in roX1 and roX2 RNAs in vivo. These domains constitute the minimal RNA unit present in multiple copies in diverse arrangements for nucleation of the MSL complex. MLE binds to these domains with distinct ATP-independent and ATP-dependent behavior. Importantly, we show that different roX RNA domains have overlapping function, since only combinatorial mutations in the tandem stem loops result in severe loss of dosage compensation and consequently male-specific lethality. We propose that repetitive structural motifs in lncRNAs could provide plasticity during multiprotein complex assemblies to ensure efficient targeting in cis or in trans along chromosomes.
PMCID: PMC3804161  PMID: 23870142
14.  BlockClust: efficient clustering and classification of non-coding RNAs from short read RNA-seq profiles 
Bioinformatics  2014;30(12):i274-i282.
Summary: Non-coding RNAs (ncRNAs) play a vital role in many cellular processes such as RNA splicing, translation, gene regulation. However the vast majority of ncRNAs still have no functional annotation. One prominent approach for putative function assignment is clustering of transcripts according to sequence and secondary structure. However sequence information is changed by post-transcriptional modifications, and secondary structure is only a proxy for the true 3D conformation of the RNA polymer. A different type of information that does not suffer from these issues and that can be used for the detection of RNA classes, is the pattern of processing and its traces in small RNA-seq reads data. Here we introduce BlockClust, an efficient approach to detect transcripts with similar processing patterns. We propose a novel way to encode expression profiles in compact discrete structures, which can then be processed using fast graph-kernel techniques. We perform both unsupervised clustering and develop family specific discriminative models; finally we show how the proposed approach is scalable, accurate and robust across different organisms, tissues and cell lines.
Availability: The whole BlockClust galaxy workflow including all tool dependencies is available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4058930  PMID: 24931994
15.  MoDPepInt: an interactive web server for prediction of modular domain–peptide interactions 
Bioinformatics  2014;30(18):2668-2669.
Summary: MoDPepInt (Modular Domain Peptide Interaction) is a new easy-to-use web server for the prediction of binding partners for modular protein domains. Currently, we offer models for SH2, SH3 and PDZ domains via the tools SH2PepInt, SH3PepInt and PDZPepInt, respectively. More specifically, our server offers predictions for 51 SH2 human domains and 69 SH3 human domains via single domain models, and predictions for 226 PDZ domains across several species, via 43 multidomain models. All models are based on support vector machines with different kernel functions ranging from polynomial, to Gaussian, to advanced graph kernels. In this way, we model non-linear interactions between amino acid residues. Results were validated on manually curated datasets achieving competitive performance against various state-of-the-art approaches.
Availability and implementation: The MoDPepInt server is available under the URL
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4155253  PMID: 24872426
16.  Lineage-specific splicing of a brain-enriched alternative exon promotes glioblastoma progression 
The Journal of Clinical Investigation  2014;124(7):2861-2876.
Tissue-specific alternative splicing is critical for the emergence of tissue identity during development, yet the role of this process in malignant transformation is undefined. Tissue-specific splicing involves evolutionarily conserved, alternative exons that represent only a minority of the total alternative exons identified. Many of these conserved exons have functional features that influence signaling pathways to profound biological effect. Here, we determined that lineage-specific splicing of a brain-enriched cassette exon in the membrane-binding tumor suppressor annexin A7 (ANXA7) diminishes endosomal targeting of the EGFR oncoprotein, consequently enhancing EGFR signaling during brain tumor progression. ANXA7 exon splicing was mediated by the ribonucleoprotein PTBP1, which is normally repressed during neuronal development. PTBP1 was highly expressed in glioblastomas due to loss of a brain-enriched microRNA (miR-124) and to PTBP1 amplification. The alternative ANXA7 splicing trait was present in precursor cells, suggesting that glioblastoma cells inherit the trait from a potential tumor-initiating ancestor and that these cells exploit this trait through accumulation of mutations that enhance EGFR signaling. Our data illustrate that lineage-specific splicing of a tissue-regulated alternative exon in a constituent of an oncogenic pathway eliminates tumor suppressor functions and promotes glioblastoma progression. This paradigm may offer a general model as to how tissue-specific regulatory mechanisms can reprogram normal developmental processes into oncogenic ones.
PMCID: PMC4071411  PMID: 24865424
17.  MOF-associated complexes ensure stem cell identity and Xist repression 
eLife  2014;3:e02024.
Histone acetyl transferases (HATs) play distinct roles in many cellular processes and are frequently misregulated in cancers. Here, we study the regulatory potential of MYST1-(MOF)-containing MSL and NSL complexes in mouse embryonic stem cells (ESCs) and neuronal progenitors. We find that both complexes influence transcription by targeting promoters and TSS-distal enhancers. In contrast to flies, the MSL complex is not exclusively enriched on the X chromosome, yet it is crucial for mammalian X chromosome regulation as it specifically regulates Tsix, the major repressor of Xist lncRNA. MSL depletion leads to decreased Tsix expression, reduced REX1 recruitment, and consequently, enhanced accumulation of Xist and variable numbers of inactivated X chromosomes during early differentiation. The NSL complex provides additional, Tsix-independent repression of Xist by maintaining pluripotency. MSL and NSL complexes therefore act synergistically by using distinct pathways to ensure a fail-safe mechanism for the repression of X inactivation in ESCs.
eLife digest
Gene expression is controlled by a complicated network of mechanisms involving a wide range of enzymes and protein complexes. Many of these mechanisms are identical in males and females, but some are not. Female mammals, for example, carry two X chromosomes, whereas males have one X and one Y chromosome. Since the two X chromosomes in females contain essentially the same set of genes, one of them undergoes silencing to prevent the overproduction of certain proteins. This process, which is called X-inactivation, occurs during different stages of development and it must be tightly controlled.
An enzyme called MOF was originally found in flies in two distinct complexes—the male-specific lethal (MSL) complex, which forms only in males, and the non-specific lethal (NSL) complex, which is ubiquitous in both males and females. These complexes are evolutionary conserved and are also found in mammals. While mammalian MOF is reasonably well understood, the MSL and NSL complexes are not, so Chelmicki, Dündar et al. have used various sequencing techniques, in combination with biochemical experiments, to investigate their roles in embryonic stem cells and neuronal progenitor cells in mice.
These experiments show that MSL and NSL complexes engage in the regulation of thousands of genes. Although the two complexes often show different gene preferences, they often regulate the same cellular processes. The MSL/NSL-dependent regulation of X chromosome inactivation is a prime example of this phenomenon.
The MSL complex reduces the production of an RNA molecule called Xist, which is responsible for the inactivation of one of the two X chromosomes in females. The NSL complex, meanwhile, ensures the production of multiple proteins that are crucial for the development of embryonic stem cells, and are also involved in the repression of X inactivation.
This analysis sheds light on how different complexes can cooperate and complement each other in order to reach the same goal in the cell. The knowledge gained from this study will pave the way towards better understanding of complex processes such as embryonic development, organogenesis and the pathogenesis of disorders like cancer.
PMCID: PMC4059889  PMID: 24842875
D. melanogaster; epigenetics; chromatin; transcription; acetylation; X inactivation; mouse
18.  CopraRNA and IntaRNA: predicting small RNA targets, networks and interaction domains 
Nucleic Acids Research  2014;42(Web Server issue):W119-W123.
CopraRNA (Comparative prediction algorithm for small RNA targets) is the most recent asset to the Freiburg RNA Tools webserver. It incorporates and extends the functionality of the existing tool IntaRNA (Interacting RNAs) in order to predict targets, interaction domains and consequently the regulatory networks of bacterial small RNA molecules. The CopraRNA prediction results are accompanied by extensive postprocessing methods such as functional enrichment analysis and visualization of interacting regions. Here, we introduce the functionality of the CopraRNA and IntaRNA webservers and give detailed explanations on their postprocessing functionalities. Both tools are freely accessible at
PMCID: PMC4086077  PMID: 24838564
19.  Comparative analysis of Cas6b processing and CRISPR RNA stability 
RNA Biology  2013;10(5):700-707.
The prokaryotic antiviral defense systems CRISPR (clustered regularly interspaced short palindromic repeats)/Cas (CRISPR-associated) employs short crRNAs (CRISPR RNAs) to target invading viral nucleic acids. A short spacer sequence of these crRNAs can be derived from a viral genome and recognizes a reoccurring attack of a virus via base complementarity. We analyzed the effect of spacer sequences on the maturation of crRNAs of the subtype I-B Methanococcus maripaludis C5 CRISPR cluster. The responsible endonuclease, termed Cas6b, bound non-hydrolyzable repeat RNA as a dimer and mature crRNA as a monomer. Comparative analysis of Cas6b processing of individual spacer-repeat-spacer RNA substrates and crRNA stability revealed the potential influence of spacer sequence and length on these parameters. Correlation of these observations with the variable abundance of crRNAs visualized by deep-sequencing analyses is discussed. Finally, insertion of spacer and repeat sequences with archaeal poly-T termination signals is suggested to be prevented in archaeal CRISPR/Cas systems.
PMCID: PMC3737328  PMID: 23392318
CRISPR; Cas6; endonuclease; crRNA; in-line probing; RNA binding; transcription termination
20.  Two CRISPR-Cas systems inMethanosarcina mazeistrain Gö1 display common processing features despite belonging to different types I and III 
RNA Biology  2013;10(5):779-791.
The clustered regularly interspaced short palindromic repeats (CRISPR) system represents a highly adaptive and heritable defense system against foreign nucleic acids in bacteria and archaea. We analyzed the two CRISPR-Cas systems in Methanosarcina mazei strain Gö1. Although belonging to different subtypes (I-B and III-B), the leaders and repeats of both loci are nearly identical. Also, despite many point mutations in each array, a common hairpin motif was identified in the repeats by a bioinformatics analysis and in vitro structural probing. The expression and maturation of CRISPR-derived RNAs (crRNAs) were studied in vitro and in vivo. Both respective potential Cas6b-type endonucleases were purified and their activity tested in vitro. Each protein showed significant activity and could cleave both repeats at the same processing site. Cas6b of subtype III-B, however, was significantly more efficient in its cleavage activity compared with Cas6b of subtype I-B. Northern blot and differential RNAseq analyses were performed to investigate in vivo transcription and maturation of crRNAs, revealing generally very low expression of both systems, whereas significant induction at high NaCl concentrations was observed. crRNAs derived proximal to the leader were generally more abundant than distal ones and in vivo processing sites were clarified for both loci, confirming the previously well-established 8 nt 5′ repeat tags. The 3′-ends were more diverse, but generally ended in a prefix of the following repeat sequence (3′-tag). The analysis further revealed a 5′-hydroxy and 3′-phosphate termini architecture of small crRNAs specific for cleavage products of Cas6 endonucleases from type I-E and I-F and type III-B.
PMCID: PMC3737336  PMID: 23619576
methanoarchaea; CRISPR-Cas system; immunity of prokaryotes; regulatory RNA; phages; Methanosarcina mazei
21.  Cluster based prediction of PDZ-peptide interactions 
BMC Genomics  2014;15(Suppl 1):S5.
PDZ domains are one of the most promiscuous protein recognition modules that bind with short linear peptides and play an important role in cellular signaling. Recently, few high-throughput techniques (e.g. protein microarray screen, phage display) have been applied to determine in-vitro binding specificity of PDZ domains. Currently, many computational methods are available to predict PDZ-peptide interactions but they often provide domain specific models and/or have a limited domain coverage.
Here, we composed the largest set of PDZ domains derived from human, mouse, fly and worm proteomes and defined binding models for PDZ domain families to improve the domain coverage and prediction specificity. For that purpose, we first identified a novel set of 138 PDZ families, comprising of 548 PDZ domains from aforementioned organisms, based on efficient clustering according to their sequence identity. For 43 PDZ families, covering 226 PDZ domains with available interaction data, we built specialized models using a support vector machine approach. The advantage of family-wise models is that they can also be used to determine the binding specificity of a newly characterized PDZ domain with sufficient sequence identity to the known families. Since most current experimental approaches provide only positive data, we have to cope with the class imbalance problem. Thus, to enrich the negative class, we introduced a powerful semi-supervised technique to generate high confidence non-interaction data. We report competitive predictive performance with respect to state-of-the-art approaches.
Our approach has several contributions. First, we show that domain coverage can be increased by applying accurate clustering technique. Second, we developed an approach based on a semi-supervised strategy to get high confidence negative data. Third, we allowed high order correlations between the amino acid positions in the binding peptides. Fourth, our method is general enough and will easily be applicable to other peptide recognition modules such as SH2 domains and finally, we performed a genome-wide prediction for 101 human and 102 mouse PDZ domains and uncovered novel interactions with biological relevance. We make all the predictive models and genome-wide predictions freely available to the scientific community.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-S1-S5) contains supplementary material, which is available to authorized users.
PMCID: PMC4046824  PMID: 24564547
PDZ domain-peptide interactions; protein recognition modules; protein domain clustering; semi-supervised learning; support vector machines
22.  A Complex of Cas Proteins 5, 6, and 7 Is Required for the Biogenesis and Stability of Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)-derived RNAs (crRNAs) in Haloferax volcanii* 
The Journal of Biological Chemistry  2014;289(10):7164-7177.
Background: The Cas6 protein is required for generating crRNAs in CRISPR-Cas I and III systems.
Results: The Cas6 protein is necessary for crRNA production but not sufficient for crRNA maintenance in Haloferax.
Conclusion: A Cascade-like complex is required in the type I-B system for a stable crRNA population.
Significance: The CRISPR-Cas system I-B has a similar Cascade complex like types I-A and I-E.
The clustered regularly interspaced short palindromic repeats/CRISPR-associated (CRISPR-Cas) system is a prokaryotic defense mechanism against foreign genetic elements. A plethora of CRISPR-Cas versions exist, with more than 40 different Cas protein families and several different molecular approaches to fight the invading DNA. One of the key players in the system is the CRISPR-derived RNA (crRNA), which directs the invader-degrading Cas protein complex to the invader. The CRISPR-Cas types I and III use the Cas6 protein to generate mature crRNAs. Here, we show that the Cas6 protein is necessary for crRNA production but that additional Cas proteins that form a CRISPR-associated complex for antiviral defense (Cascade)-like complex are needed for crRNA stability in the CRISPR-Cas type I-B system in Haloferax volcanii in vivo. Deletion of the cas6 gene results in the loss of mature crRNAs and interference. However, cells that have the complete cas gene cluster (cas1–8b) removed and are transformed with the cas6 gene are not able to produce and stably maintain mature crRNAs. crRNA production and stability is rescued only if cas5, -6, and -7 are present. Mutational analysis of the cas6 gene reveals three amino acids (His-41, Gly-256, and Gly-258) that are essential for pre-crRNA cleavage, whereas the mutation of two amino acids (Ser-115 and Ser-224) leads to an increase of crRNA amounts. This is the first systematic in vivo analysis of Cas6 protein variants. In addition, we show that the H. volcanii I-B system contains a Cascade-like complex with a Cas7, Cas5, and Cas6 core that protects the crRNA.
PMCID: PMC3945376  PMID: 24459147
Archaea; Microbiology; Molecular Biology; Molecular Genetics; Protein Complexes; CRISPR/Cas; Cas6; Haloferax volcanii; crRNA; Type I-B
23.  GraphProt: modeling binding preferences of RNA-binding proteins 
Genome Biology  2014;15(1):R17.
We present GraphProt, a computational framework for learning sequence- and structure-binding preferences of RNA-binding proteins (RBPs) from high-throughput experimental data. We benchmark GraphProt, demonstrating that the modeled binding preferences conform to the literature, and showcase the biological relevance and two applications of GraphProt models. First, estimated binding affinities correlate with experimental measurements. Second, predicted Ago2 targets display higher levels of expression upon Ago2 knockdown, whereas control targets do not. Computational binding models, such as those provided by GraphProt, are essential for predicting RBP binding sites and affinities in all tissues. GraphProt is freely available at
PMCID: PMC4053806  PMID: 24451197
24.  Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources 
The identification of protein and gene names (PGNs) from the scientific literature requires semantic resources: Terminological and lexical resources deliver the term candidates into PGN tagging solutions and the gold standard corpora (GSC) train them to identify term parameters and contextual features. Ideally all three resources, i.e. corpora, lexica and taggers, cover the same domain knowledge, and thus support identification of the same types of PGNs and cover all of them. Unfortunately, none of the three serves as a predominant standard and for this reason it is worth exploring, how these three resources comply with each other. We systematically compare different PGN taggers against publicly available corpora and analyze the impact of the included lexical resource in their performance. In particular, we determine the performance gains through false positive filtering, which contributes to the disambiguation of identified PGNs.
In general, machine learning approaches (ML-Tag) for PGN tagging show higher F1-measure performance against the BioCreative-II and Jnlpba GSCs (exact matching), whereas the lexicon based approaches (LexTag) in combination with disambiguation methods show better results on FsuPrge and PennBio. The ML-Tag solutions balance precision and recall, whereas the LexTag solutions have different precision and recall profiles at the same F1-measure across all corpora. Higher recall is achieved with larger lexical resources, which also introduce more noise (false positive results). The ML-Tag solutions certainly perform best, if the test corpus is from the same GSC as the training corpus. As expected, the false negative errors characterize the test corpora and – on the other hand – the profiles of the false positive mistakes characterize the tagging solutions. Lex-Tag solutions that are based on a large terminological resource in combination with false positive filtering produce better results, which, in addition, provide concept identifiers from a knowledge source in contrast to ML-Tag solutions.
The standard ML-Tag solutions achieve high performance, but not across all corpora, and thus should be trained using several different corpora to reduce possible biases. The LexTag solutions have different profiles for their precision and recall performance, but with similar F1-measure. This result is surprising and suggests that they cover a portion of the most common naming standards, but cope differently with the term variability across the corpora. The false positive filtering applied to LexTag solutions does improve the results by increasing their precision without compromising significantly their recall. The harmonisation of the annotation schemes in combination with standardized lexical resources in the tagging solutions will enable their comparability and will pave the way for a shared standard.
PMCID: PMC4021975  PMID: 24112383
25.  Evaluation and Cross-Comparison of Lexical Entities of Biological Interest (LexEBI) 
PLoS ONE  2013;8(10):e75185.
Biomedical entities, their identifiers and names, are essential in the representation of biomedical facts and knowledge. In the same way, the complete set of biomedical and chemical terms, i.e. the biomedical “term space” (the “Lexeome”), forms a key resource to achieve the full integration of the scientific literature with biomedical data resources: any identified named entity can immediately be normalized to the correct database entry. This goal does not only require that we are aware of all existing terms, but would also profit from knowing all their senses and their semantic interpretation (ambiguities, nestedness).
This study compiles a resource for lexical terms of biomedical interest in a standard format (called “LexEBI”), determines the overall number of terms, their reuse in different resources and the nestedness of terms. LexEBI comprises references for protein and gene entries and their term variants and chemical entities amongst other terms. In addition, disease terms have been identified from Medline and PubmedCentral and added to LexEBI. Our analysis demonstrates that the baseforms of terms from the different semantic types show only little polysemous use. Nonetheless, the term variants of protein and gene names (PGNs) frequently contain species mentions, which should have been avoided according to protein annotation guidelines. Furthermore, the protein and gene entities as well as the chemical entities, both do comprise enzymes leading to hierarchical polysemy, and a large portion of PGNs make reference to a chemical entity. Altogether, according to our analysis based on the Medline distribution, 401,869 unique PGNs in the documents contain a reference to 25,022 chemical entities, 3,125 disease terms or 1,576 species mentions.
LexEBI delivers the complete biomedical and chemical Lexeome in a standardized representation ( The resource provides the disease terms as open source content, and fully interlinks terms across resources.
PMCID: PMC3790750  PMID: 24124474

Results 1-25 (71)