PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-19 (19)
 

Clipboard (0)
None

Select a Filter Below

Journals
Year of Publication
1.  Cohesin Rings Devoid of Scc3 and Pds5 Maintain Their Stable Association with the DNA 
PLoS Genetics  2012;8(8):e1002856.
Cohesin is a protein complex that forms a ring around sister chromatids thus holding them together. The ring is composed of three proteins: Smc1, Smc3 and Scc1. The roles of three additional proteins that associate with the ring, Scc3, Pds5 and Wpl1, are not well understood. It has been proposed that these three factors form a complex that stabilizes the ring and prevents it from opening. This activity promotes sister chromatid cohesion but at the same time poses an obstacle for the initial entrapment of sister DNAs. This hindrance to cohesion establishment is overcome during DNA replication via acetylation of the Smc3 subunit by the Eco1 acetyltransferase. However, the full mechanistic consequences of Smc3 acetylation remain unknown. In the current work, we test the requirement of Scc3 and Pds5 for the stable association of cohesin with DNA. We investigated the consequences of Scc3 and Pds5 depletion in vivo using degron tagging in budding yeast. The previously described DHFR–based N-terminal degron as well as a novel Eco1-derived C-terminal degron were employed in our study. Scc3 and Pds5 associate with cohesin complexes independently of each other and require the Scc1 “core” subunit for their association with chromosomes. Contrary to previous data for Scc1 downregulation, depletion of either Scc3 or Pds5 had a strong effect on sister chromatid cohesion but not on cohesin binding to DNA. Quantity, stability and genome-wide distribution of cohesin complexes remained mostly unchanged after the depletion of Scc3 and Pds5. Our findings are inconsistent with a previously proposed model that Scc3 and Pds5 are cohesin maintenance factors required for cohesin ring stability or for maintaining its association with DNA. We propose that Scc3 and Pds5 specifically function during cohesion establishment in S phase.
Author Summary
When a cell divides, each daughter cell receives one, and only one, of each sister DNA molecule from the mother. These identical DNA molecules, called chromatids, result from the replication of a single DNA molecule and are held together by a ring-shaped protein complex termed cohesin. As a cell’s genetic information is divided into several distinct chromosomes, this arrangement, termed sister chromatid cohesion, makes it possible to distinguish sister and non-sister chromatids and is a prerequisite for the faithful division of genetic information. Cohesin rings, consisting of three subunits, trap two sister DNA molecules inside them. Additional proteins are required to load the rings onto DNA and to ensure that they capture both sister DNA molecules. We have investigated the roles of Scc3 and Pds5, two proteins that associate with cohesin rings, and were previously proposed to keep them stably locked once loaded onto DNA. Surprisingly, when we depleted Scc3 and Pds5 from yeast, the rings remained stably associated with the DNA; however, cohesion between the sisters was severely compromised. We conclude that Scc3 and Pds5 function to capture the two sister DNA molecules together inside the cohesin ring.
doi:10.1371/journal.pgen.1002856
PMCID: PMC3415457  PMID: 22912589
2.  HnRNP L and L-like cooperate in multiple-exon regulation of CD45 alternative splicing 
Nucleic Acids Research  2012;40(12):5666-5678.
CD45 encodes a trans-membrane protein-tyrosine phosphatase expressed in diverse cells of the immune system. By combinatorial use of three variable exons 4–6, isoforms are generated that differ in their extracellular domain, thereby modulating phosphatase activity and immune response. Alternative splicing of these CD45 exons involves two heterogeneous ribonucleoproteins, hnRNP L and its cell-type specific paralog hnRNP L-like (LL). To address the complex combinatorial splicing of exons 4–6, we investigated hnRNP L/LL protein expression in human B-cells in relation to CD45 splicing patterns, applying RNA-Seq. In addition, mutational and RNA-binding analyses were carried out in HeLa cells. We conclude that hnRNP LL functions as the major CD45 splicing repressor, with two CA elements in exon 6 as its primary target. In exon 4, one element is targeted by both hnRNP L and LL. In contrast, exon 5 was never repressed on its own and only co-regulated with exons 4 and 6. Stable L/LL interaction requires CD45 RNA, specifically exons 4 and 6. We propose a novel model of combinatorial alternative splicing: HnRNP L and LL cooperate on the CD45 pre-mRNA, bridging exons 4 and 6 and looping out exon 5, thereby achieving full repression of the three variable exons.
doi:10.1093/nar/gks221
PMCID: PMC3384337  PMID: 22402488
4.  Persistence and Availability of Web Services in Computational Biology 
PLoS ONE  2011;6(9):e24914.
We have conducted a study on the long-term availability of bioinformatics Web services: an observation of 927 Web services published in the annual Nucleic Acids Research Web Server Issues between 2003 and 2009.
We found that 72% of Web sites are still available at the published addresses, only 9% of services are completely unavailable. Older addresses often redirect to new pages. We checked the functionality of all available services: for 33%, we could not test functionality because there was no example data or a related problem; 13% were truly no longer working as expected; we could positively confirm functionality only for 45% of all services.
Additionally, we conducted a survey among 872 Web Server Issue corresponding authors; 274 replied. 78% of all respondents indicate their services have been developed solely by students and researchers without a permanent position. Consequently, these services are in danger of falling into disrepair after the original developers move to another institution, and indeed, for 24% of services, there is no plan for maintenance, according to the respondents.
We introduce a Web service quality scoring system that correlates with the number of citations: services with a high score are cited 1.8 times more often than low-scoring services. We have identified key characteristics that are predictive of a service's survival, providing reviewers, editors, and Web service developers with the means to assess or improve Web services. A Web service conforming to these criteria receives more citations and provides more reliable service for its users.
The most effective way of ensuring continued access to a service is a persistent Web address, offered either by the publishing journal, or created on the authors' own initiative, for example at http://bioweb.me. The community would benefit the most from a policy requiring any source code needed to reproduce results to be deposited in a public repository.
doi:10.1371/journal.pone.0024914
PMCID: PMC3178567  PMID: 21966383
5.  Support vector machines-based identification of alternative splicing in Arabidopsis thaliana from whole-genome tiling arrays 
BMC Bioinformatics  2011;12:55.
Background
Alternative splicing (AS) is a process which generates several distinct mRNA isoforms from the same gene by splicing different portions out of the precursor transcript. Due to the (patho-)physiological importance of AS, a complete inventory of AS is of great interest. While this is in reach for human and mammalian model organisms, our knowledge of AS in plants has remained more incomplete. Experimental approaches for monitoring AS are either based on transcript sequencing or rely on hybridization to DNA microarrays. Among the microarray platforms facilitating the discovery of AS events, tiling arrays are well-suited for identifying intron retention, the most prevalent type of AS in plants. However, analyzing tiling array data is challenging, because of high noise levels and limited probe coverage.
Results
In this work, we present a novel method to detect intron retentions (IR) and exon skips (ES) from tiling arrays. While statistical tests have typically been proposed for this purpose, our method instead utilizes support vector machines (SVMs) which are appreciated for their accuracy and robustness to noise. Existing EST and cDNA sequences served for supervised training and evaluation. Analyzing a large collection of publicly available microarray and sequence data for the model plant A. thaliana, we demonstrated that our method is more accurate than existing approaches. The method was applied in a genome-wide screen which resulted in the discovery of 1,355 IR events. A comparison of these IR events to the TAIR annotation and a large set of short-read RNA-seq data showed that 830 of the predicted IR events are novel and that 525 events (39%) overlap with either the TAIR annotation or the IR events inferred from the RNA-seq data.
Conclusions
The method developed in this work expands the scarce repertoire of analysis tools for the identification of alternative mRNA splicing from whole-genome tiling arrays. Our predictions are highly enriched with known AS events and complement the A. thaliana genome annotation with respect to AS. Since all predicted AS events can be precisely attributed to experimental conditions, our work provides a basis for follow-up studies focused on the elucidation of the regulatory mechanisms underlying tissue-specific and stress-dependent AS in plants.
doi:10.1186/1471-2105-12-55
PMCID: PMC3051901  PMID: 21324185
6.  Inferring latent task structure for Multitask Learning by Multiple Kernel Learning 
BMC Bioinformatics  2010;11(Suppl 8):S5.
Background
The lack of sufficient training data is the limiting factor for many Machine Learning applications in Computational Biology. If data is available for several different but related problem domains, Multitask Learning algorithms can be used to learn a model based on all available information. In Bioinformatics, many problems can be cast into the Multitask Learning scenario by incorporating data from several organisms. However, combining information from several tasks requires careful consideration of the degree of similarity between tasks. Our proposed method simultaneously learns or refines the similarity between tasks along with the Multitask Learning classifier. This is done by formulating the Multitask Learning problem as Multiple Kernel Learning, using the recently published q-Norm MKL algorithm.
Results
We demonstrate the performance of our method on two problems from Computational Biology. First, we show that our method is able to improve performance on a splice site dataset with given hierarchical task structure by refining the task relationships. Second, we consider an MHC-I dataset, for which we assume no knowledge about the degree of task relatedness. Here, we are able to learn the task similarities ab initio along with the Multitask classifiers. In both cases, we outperform baseline methods that we compare against.
Conclusions
We present a novel approach to Multitask Learning that is capable of learning task similarity along with the classifiers. The framework is very general as it allows to incorporate prior knowledge about tasks relationships if available, but is also able to identify task similarities in absence of such prior information. Both variants show promising results in applications from Computational Biology.
doi:10.1186/1471-2105-11-S8-S5
PMCID: PMC2966292  PMID: 21034430
7.  Exploiting physico-chemical properties in string kernels 
BMC Bioinformatics  2010;11(Suppl 8):S7.
Background
String kernels are commonly used for the classification of biological sequences, nucleotide as well as amino acid sequences. Although string kernels are already very powerful, when it comes to amino acids they have a major short coming. They ignore an important piece of information when comparing amino acids: the physico-chemical properties such as size, hydrophobicity, or charge. This information is very valuable, especially when training data is less abundant. There have been only very few approaches so far that aim at combining these two ideas.
Results
We propose new string kernels that combine the benefits of physico-chemical descriptors for amino acids with the ones of string kernels. The benefits of the proposed kernels are assessed on two problems: MHC-peptide binding classification using position specific kernels and protein classification based on the substring spectrum of the sequences. Our experiments demonstrate that the incorporation of amino acid properties in string kernels yields improved performances compared to standard string kernels and to previously proposed non-substring kernels.
Conclusions
In summary, the proposed modifications, in particular the combination with the RBF substring kernel, consistently yield improvements without affecting the computational complexity. The proposed kernels therefore appear to be the kernels of choice for any protein sequence-based inference.
Availability
Data sets, code and additional information are available from http://www.fml.tuebingen.mpg.de/raetsch/suppl/aask. Implementations of the developed kernels are available as part of the Shogun toolbox.
doi:10.1186/1471-2105-11-S8-S7
PMCID: PMC2966294  PMID: 21034432
8.  rQuant.web: a tool for RNA-Seq-based transcript quantitation 
Nucleic Acids Research  2010;38(Web Server issue):W348-W351.
We provide a novel web service, called rQuant.web, allowing convenient access to tools for quantitative analysis of RNA sequencing data. The underlying quantitation technique rQuant is based on quadratic programming and estimates different biases induced by library preparation, sequencing and read mapping. It can tackle multiple transcripts per gene locus and is therefore particularly well suited to quantify alternative transcripts. rQuant.web is available as a tool in a Galaxy installation at http://galaxy.fml.mpg.de. Using rQuant.web is free of charge, it is open to all users, and there is no login requirement.
doi:10.1093/nar/gkq448
PMCID: PMC2896134  PMID: 20551130
9.  Transcript quantification with RNA-Seq data 
BMC Bioinformatics  2009;10(Suppl 13):P5.
doi:10.1186/1471-2105-10-S13-P5
PMCID: PMC2764136
11.  mGene.web: a web service for accurate computational gene finding 
Nucleic Acids Research  2009;37(Web Server issue):W312-W316.
We describe mGene.web, a web service for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences. It offers pre-trained models for the recognition of gene structures including untranslated regions in an increasing number of organisms. With mGene.web, users have the additional possibility to train the system with their own data for other organisms on the push of a button, a functionality that will greatly accelerate the annotation of newly sequenced genomes. The system is built in a highly modular way, such that individual components of the framework, like the promoter prediction tool or the splice site predictor, can be used autonomously. The underlying gene finding system mGene is based on discriminative machine learning techniques and its high accuracy has been demonstrated in an international competition on nematode genomes. mGene.web is available at http://www.mgene.org/web, it is free of charge and can be used for eukaryotic genomes of small to moderate size (several hundred Mbp).
doi:10.1093/nar/gkp479
PMCID: PMC2703990  PMID: 19494180
12.  KIRMES: kernel-based identification of regulatory modules in euchromatic sequences 
Bioinformatics  2009;25(16):2126-2133.
Motivation: Understanding transcriptional regulation is one of the main challenges in computational biology. An important problem is the identification of transcription factor (TF) binding sites in promoter regions of potential TF target genes. It is typically approached by position weight matrix-based motif identification algorithms using Gibbs sampling, or heuristics to extend seed oligos. Such algorithms succeed in identifying single, relatively well-conserved binding sites, but tend to fail when it comes to the identification of combinations of several degenerate binding sites, as those often found in cis-regulatory modules.
Results: We propose a new algorithm that combines the benefits of existing motif finding with the ones of support vector machines (SVMs) to find degenerate motifs in order to improve the modeling of regulatory modules. In experiments on microarray data from Arabidopsis thaliana, we were able to show that the newly developed strategy significantly improves the recognition of TF targets.
Availability: The python source code (open source-licensed under GPL), the data for the experiments and a Galaxy-based web service are available at http://www.fml.mpg.de/raetsch/suppl/kirmes/
Contact: sebi@tuebingen.mpg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp278
PMCID: PMC2722996  PMID: 19389732
14.  At-TAX: a whole genome tiling array resource for developmental expression analysis and transcript identification in Arabidopsis thaliana 
Genome Biology  2008;9(7):R112.
A developmental expression atlas, At-TAX, based on whole-genome tiling arrays, is presented along with associated analysis methods.
Gene expression maps for model organisms, including Arabidopsis thaliana, have typically been created using gene-centric expression arrays. Here, we describe a comprehensive expression atlas, Arabidopsis thaliana Tiling Array Express (At-TAX), which is based on whole-genome tiling arrays. We demonstrate that tiling arrays are accurate tools for gene expression analysis and identified more than 1,000 unannotated transcribed regions. Visualizations of gene expression estimates, transcribed regions, and tiling probe measurements are accessible online at the At-TAX homepage.
doi:10.1186/gb-2008-9-7-r112
PMCID: PMC2530869  PMID: 18613972
15.  POIMs: positional oligomer importance matrices—understanding support vector machine-based signal detectors 
Bioinformatics  2008;24(13):i6-i14.
Motivation: At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts.
Results: To make SVM-based sequence classifiers more accessible and profitable, we introduce the concept of positional oligomer importance matrices (POIMs) and propose an efficient algorithm for their computation. In contrast to the raw SVM feature weighting, POIMs take the underlying correlation structure of k-mer features induced by overlaps of related k-mers into account. POIMs can be seen as a powerful generalization of sequence logos: they allow to capture and visualize sequence patterns that are relevant for the investigated biological phenomena.
Availability: All source code, datasets, tables and figures are available at http://www.fml.tuebingen.mpg.de/raetsch/projects/POIM.
Contact: Soeren.Sonnenburg@first.fraunhofer.de
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btn170
PMCID: PMC2718648  PMID: 18586746
17.  Accurate splice site prediction using support vector machines 
BMC Bioinformatics  2007;8(Suppl 10):S7.
Background
For splice site recognition, one has to solve two classification problems: discriminating true from decoy splice sites for both acceptor and donor sites. Gene finding systems typically rely on Markov Chains to solve these tasks.
Results
In this work we consider Support Vector Machines for splice site recognition. We employ the so-called weighted degree kernel which turns out well suited for this task, as we will illustrate in several experiments where we compare its prediction accuracy with that of recently proposed systems. We apply our method to the genome-wide recognition of splice sites in Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, and Homo sapiens. Our performance estimates indicate that splice sites can be recognized very accurately in these genomes and that our method outperforms many other methods including Markov Chains, GeneSplicer and SpliceMachine. We provide genome-wide predictions of splice sites and a stand-alone prediction tool ready to be used for incorporation in a gene finder.
Availability
Data, splits, additional information on the model selection, the whole genome predictions, as well as the stand-alone prediction tool are available for download at .
doi:10.1186/1471-2105-8-S10-S7
PMCID: PMC2230508  PMID: 18269701
18.  Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning 
PLoS Computational Biology  2007;3(2):e20.
For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [1] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%–13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.
Author Summary
Eukaryotic genes contain introns, which are intervening sequences that are excised from a gene transcript with the concomitant ligation of flanking segments called exons. The process of removing introns is called splicing. It involves biochemical mechanisms that to date are too complex to be modeled comprehensively and accurately. However, abundant sequencing results can serve as a blueprint database exemplifying what this process accomplishes. Using this database, we employ discriminative machine learning techniques to predict the mature mRNA given the unspliced pre-mRNA. Our method utilizes support vector machines and recent advances in label sequence learning, originally developed for natural language processing. The system, called mSplicer, was trained and evaluated on the genome of the nematode C. elegans, a well-studied model organism. We were able to show that mSplicer correctly predicts the splice form in most cases. Surprisingly, our predictions on currently unconfirmed genes deviate considerably from the public genome annotation. It is hypothesized that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation and additional sequencing results show the superiority of mSplicer's predictions. It is concluded that the annotation of nematode and other genomes can be greatly enhanced using modern machine learning.
doi:10.1371/journal.pcbi.0030020
PMCID: PMC1808025  PMID: 17319737
19.  Learning Interpretable SVMs for Biological Sequence Classification 
BMC Bioinformatics  2006;7(Suppl 1):S9.
Background
Support Vector Machines (SVMs) – using a variety of string kernels – have been successfully applied to biological sequence classification problems. While SVMs achieve high classification accuracy they lack interpretability. In many applications, it does not suffice that an algorithm just detects a biological signal in the sequence, but it should also provide means to interpret its solution in order to gain biological insight.
Results
We propose novel and efficient algorithms for solving the so-called Support Vector Multiple Kernel Learning problem. The developed techniques can be used to understand the obtained support vector decision function in order to extract biologically relevant knowledge about the sequence analysis problem at hand. We apply the proposed methods to the task of acceptor splice site prediction and to the problem of recognizing alternatively spliced exons. Our algorithms compute sparse weightings of substring locations, highlighting which parts of the sequence are important for discrimination.
Conclusion
The proposed method is able to deal with thousands of examples while combining hundreds of kernels within reasonable time, and reliably identifies a few statistically significant positions.
doi:10.1186/1471-2105-7-S1-S9
PMCID: PMC1810320  PMID: 16723012

Results 1-19 (19)