Search tips
Search criteria

Results 1-12 (12)

Clipboard (0)

Select a Filter Below

Year of Publication
1.  Deep learning of the tissue-regulated splicing code 
Bioinformatics  2014;30(12):i121-i129.
Motivation: Alternative splicing (AS) is a regulated process that directs the generation of different transcripts from single genes. A computational model that can accurately predict splicing patterns based on genomic features and cellular context is highly desirable, both in understanding this widespread phenomenon, and in exploring the effects of genetic variations on AS.
Methods: Using a deep neural network, we developed a model inferred from mouse RNA-Seq data that can predict splicing patterns in individual tissues and differences in splicing patterns across tissues. Our architecture uses hidden variables that jointly represent features in genomic sequences and tissue types when making predictions. A graphics processing unit was used to greatly reduce the training time of our models with millions of parameters.
Results: We show that the deep architecture surpasses the performance of the previous Bayesian method for predicting AS patterns. With the proper optimization procedure and selection of hyperparameters, we demonstrate that deep architectures can be beneficial, even with a moderately sparse dataset. An analysis of what the model has learned in terms of the genomic features is presented.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4058935  PMID: 24931975
2.  Mutations in STAMBP, encoding a deubiquitinating enzyme, cause Microcephaly-Capillary Malformation syndrome 
Nature genetics  2013;45(5):556-562.
Microcephaly-capillary malformation (MIC-CAP) syndrome exhibits severe microcephaly with progressive cortical atrophy, intractable epilepsy, profound developmental delay and multiple small capillary malformations on the skin. We employed whole-exome sequencing of five patients with MIC-CAP syndrome and identified novel recessive mutations in STAMBP, a gene encoding the deubiquitinating (DUB) isopeptidase STAMBP (STAM-binding protein)/AMSH (Associated Molecule with the SH3 domain of STAM), that plays a key role in cell surface receptor-mediated endocytosis and sorting. Patient cell lines showed reduced STAMBP expression associated with accumulation of ubiquitin-conjugated protein aggregates, elevated apoptosis and insensitive activation of the RAS-MAPK and PI3K-AKT-mTOR pathways. The latter cellular phenotype is significant considering the established connection between these pathways and their association with vascular and capillary malformations. Furthermore, our findings of a congenital human disorder caused by a defective DUB protein that functions in endocytosis, implicates ubiquitin-conjugate aggregation and elevated apoptosis as factors potentially influencing the progressive neuronal loss underlying MIC-CAP.
PMCID: PMC4000253  PMID: 23542699 CAMSID: cams4064
3.  MBNL proteins repress embryonic stem cell-specific alternative splicing and reprogramming 
Nature  2013;498(7453):241-245.
Previous investigations of the core gene regulatory circuitry that controls embryonic stem cell (ESC) pluripotency have largely focused on the roles of transcription, chromatin and non-coding RNA regulators1–3. Alternative splicing (AS) represents a widely acting mode of gene regulation4–8, yet its role in regulating ESC pluripotency and differentiation is poorly understood. Here, we identify the Muscleblind-like RNA binding proteins, MBNL1 and MBNL2, as conserved and direct negative regulators of a large program of cassette exon AS events that are differentially regulated between ESCs and other cell types. Knockdown of MBNL proteins in differentiated cells causes switching to an ESC-like AS pattern for approximately half of these events, whereas over-expression of MBNL proteins in ESCs promotes differentiated cell-like AS patterns. Among the MBNL-regulated events is an ESC-specific AS switch in the forkhead family transcription factor FOXP1 that controls pluripotency9. Consistent with a central and negative regulatory role for MBNL proteins in pluripotency, their knockdown significantly enhances the expression of key pluripotency genes and the formation of induced pluripotent stem cells (iPSCs) during somatic cell reprogramming.
PMCID: PMC3933998  PMID: 23739326
4.  A compendium of RNA-binding motifs for decoding gene regulation 
Nature  2013;499(7457):172-177.
RNA-binding proteins are key regulators of gene expression, yet only a small fraction have been functionally characterized. Here we report a systematic analysis of the RNA motifs recognized by RNA-binding proteins, encompassing 205 distinct genes from 24 diverse eukaryotes. The sequence specificities of RNA-binding proteins display deep evolutionary conservation, and the recognition preferences for a large fraction of metazoan RNA-binding proteins can thus be inferred from their RNA-binding domain sequence. The motifs that we identify in vitro correlate well with in vivo RNA-binding data. Moreover, we can associate them with distinct functional roles in diverse types of post-transcriptional regulation, enabling new insights into the functions of RNA-binding proteins both in normal physiology and in human disease. These data provide an unprecedented overview of RNA-binding proteins and their targets, and constitute an invaluable resource for determining post-transcriptional regulatory mechanisms in eukaryotes.
PMCID: PMC3929597  PMID: 23846655
5.  AVISPA: a web tool for the prediction and analysis of alternative splicing 
Genome Biology  2013;14(10):R114.
Transcriptome complexity and its relation to numerous diseases underpins the need to predict in silico splice variants and the regulatory elements that affect them. Building upon our recently described splicing code, we developed AVISPA, a Galaxy-based web tool for splicing prediction and analysis. Given an exon and its proximal sequence, the tool predicts whether the exon is alternatively spliced, displays tissue-dependent splicing patterns, and whether it has associated regulatory elements. We assess AVISPA's accuracy on an independent dataset of tissue-dependent exons, and illustrate how the tool can be applied to analyze a gene of interest. AVISPA is available at
PMCID: PMC4014802  PMID: 24156756
6.  A generalizable pre-clinical research approach for orphan disease therapy 
With the advent of next-generation DNA sequencing, the pace of inherited orphan disease gene identification has increased dramatically, a situation that will continue for at least the next several years. At present, the numbers of such identified disease genes significantly outstrips the number of laboratories available to investigate a given disorder, an asymmetry that will only increase over time. The hope for any genetic disorder is, where possible and in addition to accurate diagnostic test formulation, the development of therapeutic approaches. To this end, we propose here the development of a strategic toolbox and preclinical research pathway for inherited orphan disease. Taking much of what has been learned from rare genetic disease research over the past two decades, we propose generalizable methods utilizing transcriptomic, system-wide chemical biology datasets combined with chemical informatics and, where possible, repurposing of FDA approved drugs for pre-clinical orphan disease therapies. It is hoped that this approach may be of utility for the broader orphan disease research community and provide funding organizations and patient advocacy groups with suggestions for the optimal path forward. In addition to enabling academic pre-clinical research, strategies such as this may also aid in seeding startup companies, as well as further engaging the pharmaceutical industry in the treatment of rare genetic disease.
PMCID: PMC3458970  PMID: 22704758
Orphan disease therapy; Preclinical drug development; Generalizable screening methods; Translational toolbox
7.  Challenges in estimating percent inclusion of alternatively spliced junctions from RNA-seq data 
BMC Bioinformatics  2012;13(Suppl 6):S11.
Transcript quantification is a long-standing problem in genomics and estimating the relative abundance of alternatively-spliced isoforms from the same transcript is an important special case. Both problems have recently been illuminated by high-throughput RNA sequencing experiments which are quickly generating large amounts of data. However, much of the signal present in this data is corrupted or obscured by biases resulting in non-uniform and non-proportional representation of sequences from different transcripts. Many existing analyses attempt to deal with these and other biases with various task-specific approaches, which makes direct comparison between them difficult. However, two popular tools for isoform quantification, MISO and Cufflinks, have adopted a general probabilistic framework to model and mitigate these biases in a more general fashion. These advances motivate the need to investigate the effects of RNA-seq biases on the accuracy of different approaches for isoform quantification. We conduct the investigation by building models of increasing sophistication to account for noise introduced by the biases and compare their accuracy to the established approaches.
We focus on methods that estimate the expression of alternatively-spliced isoforms with the percent-spliced-in (PSI) metric for each exon skipping event. To improve their estimates, many methods use evidence from RNA-seq reads that align to exon bodies. However, the methods we propose focus on reads that span only exon-exon junctions. As a result, our approaches are simpler and less sensitive to exon definitions than existing methods, which enables us to distinguish their strengths and weaknesses more easily. We present several probabilistic models of of position-specific read counts with increasing complexity and compare them to each other and to the current state-of-the-art methods in isoform quantification, MISO and Cufflinks. On a validation set with RT-PCR measurements for 26 cassette events, some of our methods are more accurate and some are significantly more consistent than these two popular tools. This comparison demonstrates the challenges in estimating the percent inclusion of alternatively spliced junctions and illuminates the tradeoffs between different approaches.
PMCID: PMC3330053  PMID: 22537040
8.  Transcriptional Profiling of Endocrine Cerebro-Osteodysplasia Using Microarray and Next-Generation Sequencing 
PLoS ONE  2011;6(9):e25400.
Transcriptome profiling of patterns of RNA expression is a powerful approach to identify networks of genes that play a role in disease. To date, most mRNA profiling of tissues has been accomplished using microarrays, but next-generation sequencing can offer a richer and more comprehensive picture.
Methodology/Principal Findings
ECO is a rare multi-system developmental disorder caused by a homozygous mutation in ICK encoding intestinal cell kinase. We performed gene expression profiling using both cDNA microarrays and next-generation mRNA sequencing (mRNA-seq) of skin fibroblasts from ECO-affected subjects. We then validated a subset of differentially expressed transcripts identified by each method using quantitative reverse transcription-polymerase chain reaction (qRT-PCR). Finally, we used gene ontology (GO) to identify critical pathways and processes that were abnormal according to each technical platform. Methodologically, mRNA-seq identifies a much larger number of differentially expressed genes with much better correlation to qRT-PCR results than the microarray (r2 = 0.794 and 0.137, respectively). Biologically, cDNA microarray identified functional pathways focused on anatomical structure and development, while the mRNA-seq platform identified a higher proportion of genes involved in cell division and DNA replication pathways.
Transcriptome profiling with mRNA-seq had greater sensitivity, range and accuracy than the microarray. The two platforms generated different but complementary hypotheses for further evaluation.
PMCID: PMC3181319  PMID: 21980446
9.  Computational refinement of post-translational modifications predicted from tandem mass spectrometry 
Bioinformatics  2011;27(6):797-806.
Motivation: A post-translational modification (PTM) is a chemical modification of a protein that occurs naturally. Many of these modifications, such as phosphorylation, are known to play pivotal roles in the regulation of protein function. Henceforth, PTM perturbations have been linked to diverse diseases like Parkinson's, Alzheimer's, diabetes and cancer. To discover PTMs on a genome-wide scale, there is a recent surge of interest in analyzing tandem mass spectrometry data, and several unrestrictive (so-called ‘blind’) PTM search methods have been reported. However, these approaches are subject to noise in mass measurements and in the predicted modification site (amino acid position) within peptides, which can result in false PTM assignments.
Results: To address these issues, we devised a machine learning algorithm, PTMClust, that can be applied to the output of blind PTM search methods to improve prediction quality, by suppressing noise in the data and clustering peptides with the same underlying modification to form PTM groups. We show that our technique outperforms two standard clustering algorithms on a simulated dataset. Additionally, we show that our algorithm significantly improves sensitivity and specificity when applied to the output of three different blind PTM search engines, SIMS, InsPecT and MODmap. Additionally, PTMClust markedly outperforms another PTM refinement algorithm, PTMFinder. We demonstrate that our technique is able to reduce false PTM assignments, improve overall detection coverage and facilitate novel PTM discovery, including terminus modifications. We applied our technique to a large-scale yeast MS/MS proteome profiling dataset and found numerous known and novel PTMs. Accurately identifying modifications in protein sequences is a critical first step for PTM profiling, and thus our approach may benefit routine proteomic analysis.
Availability: Our algorithm is implemented in Matlab and is freely available for academic use. The software is available online from
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3051323  PMID: 21258065
10.  Model-based detection of alternative splicing signals 
Bioinformatics  2010;26(12):i325-i333.
Motivation: Transcripts from ∼95% of human multi-exon genes are subject to alternative splicing (AS). The growing interest in AS is propelled by its prominent contribution to transcriptome and proteome complexity and the role of aberrant AS in numerous diseases. Recent technological advances enable thousands of exons to be simultaneously profiled across diverse cell types and cellular conditions, but require accurate identification of condition-specific splicing changes. It is necessary to accurately identify such splicing changes to elucidate the underlying regulatory programs or link the splicing changes to specific diseases.
Results: We present a probabilistic model tailored for high-throughput AS data, where observed isoform levels are explained as combinations of condition-specific AS signals. According to our formulation, given an AS dataset our tasks are to detect common signals in the data and identify the exons relevant to each signal. Our model can incorporate prior knowledge about underlying AS signals, measurement quality and gene expression level effects. Using a large-scale multi-tissue AS dataset, we demonstrate the advantage of our method over standard alternative approaches. In addition, we describe newly found tissue-specific AS signals which were verified experimentally, and discuss associated regulatory features.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2881385  PMID: 20529924
11.  Functional coordination of alternative splicing in the mammalian central nervous system 
Genome Biology  2007;8(6):R108.
A microarray analysis provides new evidence suggesting that specific cellular processes in the mammalian CNS are coordinated at the level of alternative splicing, and that a complex splicing code underlies CNS-specific alternative splicing regulation.
Alternative splicing (AS) functions to expand proteomic complexity and plays numerous important roles in gene regulation. However, the extent to which AS coordinates functions in a cell and tissue type specific manner is not known. Moreover, the sequence code that underlies cell and tissue type specific regulation of AS is poorly understood.
Using quantitative AS microarray profiling, we have identified a large number of widely expressed mouse genes that contain single or coordinated pairs of alternative exons that are spliced in a tissue regulated fashion. The majority of these AS events display differential regulation in central nervous system (CNS) tissues. Approximately half of the corresponding genes have neural specific functions and operate in common processes and interconnected pathways. Differential regulation of AS in the CNS tissues correlates strongly with a set of mostly new motifs that are predominantly located in the intron and constitutive exon sequences neighboring CNS-regulated alternative exons. Different subsets of these motifs are correlated with either increased inclusion or increased exclusion of alternative exons in CNS tissues, relative to the other profiled tissues.
Our findings provide new evidence that specific cellular processes in the mammalian CNS are coordinated at the level of AS, and that a complex splicing code underlies CNS specific AS regulation. This code appears to comprise many new motifs, some of which are located in the constitutive exons neighboring regulated alternative exons. These data provide a basis for understanding the molecular mechanisms by which the tissue specific functions of widely expressed genes are coordinated at the level of AS.
PMCID: PMC2394768  PMID: 17565696
12.  The functional landscape of mouse gene expression 
Journal of Biology  2004;3(5):21.
Large-scale quantitative analysis of transcriptional co-expression has been used to dissect regulatory networks and to predict the functions of new genes discovered by genome sequencing in model organisms such as yeast. Although the idea that tissue-specific expression is indicative of gene function in mammals is widely accepted, it has not been objectively tested nor compared with the related but distinct strategy of correlating gene co-expression as a means to predict gene function.
We generated microarray expression data for nearly 40,000 known and predicted mRNAs in 55 mouse tissues, using custom-built oligonucleotide arrays. We show that quantitative transcriptional co-expression is a powerful predictor of gene function. Hundreds of functional categories, as defined by Gene Ontology 'Biological Processes', are associated with characteristic expression patterns across all tissues, including categories that bear no overt relationship to the tissue of origin. In contrast, simple tissue-specific restriction of expression is a poor predictor of which genes are in which functional categories. As an example, the highly conserved mouse gene PWP1 is widely expressed across different tissues but is co-expressed with many RNA-processing genes; we show that the uncharacterized yeast homolog of PWP1 is required for rRNA biogenesis.
We conclude that 'functional genomics' strategies based on quantitative transcriptional co-expression will be as fruitful in mammals as they have been in simpler organisms, and that transcriptional control of mammalian physiology is more modular than is generally appreciated. Our data and analyses provide a public resource for mammalian functional genomics.
PMCID: PMC549719  PMID: 15588312

Results 1-12 (12)