Search tips
Search criteria

Results 1-14 (14)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Evaluation of de novo transcriptome assemblies from RNA-Seq data 
Genome Biology  2014;15(12):553.
De novo RNA-Seq assembly facilitates the study of transcriptomes for species without sequenced genomes, but it is challenging to select the most accurate assembly in this context. To address this challenge, we developed a model-based score, RSEM-EVAL, for evaluating assemblies when the ground truth is unknown. We show that RSEM-EVAL correctly reflects assembly accuracy, as measured by REF-EVAL, a refined set of ground-truth-based scores that we also developed. Guided by RSEM-EVAL, we assembled the transcriptome of the regenerating axolotl limb; this assembly compares favorably to a previous assembly. A software package implementing our methods, DETONATE, is freely available at
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0553-5) contains supplementary material, which is available to authorized users.
PMCID: PMC4298084  PMID: 25608678
2.  Positional orthology: putting genomic evolutionary relationships into context 
Briefings in Bioinformatics  2011;12(5):401-412.
Orthology is a powerful refinement of homology that allows us to describe more precisely the evolution of genomes and understand the function of the genes they contain. However, because orthology is not concerned with genomic position, it is limited in its ability to describe genes that are likely to have equivalent roles in different genomes. Because of this limitation, the concept of ‘positional orthology’ has emerged, which describes the relation between orthologous genes that retain their ancestral genomic positions. In this review, we formally define this concept, for which we introduce the shorter term ‘toporthology’, with respect to the evolutionary events experienced by a gene’s ancestors. Through a discussion of recent studies on the role of genomic context in gene evolution, we show that the distinction between orthology and toporthology is biologically significant. We then review a number of orthology prediction methods that take genomic context into account and thus that may be used to infer the important relation of toporthology.
PMCID: PMC3178058  PMID: 21705766
positional orthology; toporthology; homology; synteny; genome alignment
3.  Gata2 cis-element is required for hematopoietic stem cell generation in the mammalian embryo 
The Journal of Experimental Medicine  2013;210(13):2833-2842.
Cis-element requirement for the emergence of HSCs in the AGM and for hemogenic endothelium to generate HSC-containing c-Kit+ cell clusters.
The generation of hematopoietic stem cells (HSCs) from hemogenic endothelium within the aorta, gonad, mesonephros (AGM) region of the mammalian embryo is crucial for development of the adult hematopoietic system. We described a deletion of a Gata2 cis-element (+9.5) that depletes fetal liver HSCs, is lethal at E13–14 of embryogenesis, and is mutated in an immunodeficiency that progresses to myelodysplasia/leukemia. Here, we demonstrate that the +9.5 element enhances Gata2 expression and is required to generate long-term repopulating HSCs in the AGM. Deletion of the +9.5 element abrogated the capacity of hemogenic endothelium to generate HSC-containing clusters in the aorta. Genomic analyses indicated that the +9.5 element regulated a rich ensemble of genes that control hemogenic endothelium and HSCs, as well as genes not implicated in hematopoiesis. These results reveal a mechanism that controls stem cell emergence from hemogenic endothelium to establish the adult hematopoietic system.
PMCID: PMC3865483  PMID: 24297994
4.  De novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity 
Nature protocols  2013;8(8):10.1038/nprot.2013.084.
De novo assembly of RNA-Seq data allows us to study transcriptomes without the need for a genome sequence, such as in non-model organisms of ecological and evolutionary importance, cancer samples, or the microbiome. In this protocol, we describe the use of the Trinity platform for de novo transcriptome assembly from RNA-Seq data in non-model organisms. We also present Trinity’s supported companion utilities for downstream applications, including RSEM for transcript abundance estimation, R/Bioconductor packages for identifying differentially expressed transcripts across samples, and approaches to identify protein coding genes. In an included tutorial we provide a workflow for genome-independent transcriptome analysis leveraging the Trinity platform. The software, documentation and demonstrations are freely available from
PMCID: PMC3875132  PMID: 23845962
5.  Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures 
Nature  2007;450(7167):219-232.
Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the genomes of 12 Drosophila species for the de novo discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or ‘evolutionary signatures’, dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies.
PMCID: PMC2474711  PMID: 17994088
6.  Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs 
Bioinformatics  2013;29(18):2300-2310.
Motivation: Alternative splicing and other processes that allow for different transcripts to be derived from the same gene are significant forces in the eukaryotic cell. RNA-Seq is a promising technology for analyzing alternative transcripts, as it does not require prior knowledge of transcript structures or genome sequences. However, analysis of RNA-Seq data in the presence of genes with large numbers of alternative transcripts is currently challenging due to efficiency, identifiability and representation issues.
Results: We present RNA-Seq models and associated inference algorithms based on the concept of probabilistic splice graphs, which alleviate these issues. We prove that our models are often identifiable and demonstrate that our inference methods for quantification and differential processing detection are efficient and accurate.
Availability: Software implementing our methods is available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3753571  PMID: 23846746
7.  Comparative RNA-seq Analysis in the Unsequenced Axolotl: The Oncogene Burst Highlights Early Gene Expression in the Blastema 
PLoS Computational Biology  2013;9(3):e1002936.
The salamander has the remarkable ability to regenerate its limb after amputation. Cells at the site of amputation form a blastema and then proliferate and differentiate to regrow the limb. To better understand this process, we performed deep RNA sequencing of the blastema over a time course in the axolotl, a species whose genome has not been sequenced. Using a novel comparative approach to analyzing RNA-seq data, we characterized the transcriptional dynamics of the regenerating axolotl limb with respect to the human gene set. This approach involved de novo assembly of axolotl transcripts, RNA-seq transcript quantification without a reference genome, and transformation of abundances from axolotl contigs to human genes. We found a prominent burst in oncogene expression during the first day and blastemal/limb bud genes peaking at 7 to 14 days. In addition, we found that limb patterning genes, SALL genes, and genes involved in angiogenesis, wound healing, defense/immunity, and bone development are enriched during blastema formation and development. Finally, we identified a category of genes with no prior literature support for limb regeneration that are candidates for further evaluation based on their expression pattern during the regenerative process.
Author Summary
Salamanders such as the axolotl can fully regenerate a limb upon amputation, making them the vertebrate champions of regeneration. On the other hand, humans and other mammals possess a very limited ability to regenerate limb structures. Learning about the genes, gene networks, and pathways activated in the salamander during limb regeneration will provide cues to improving the regenerative response in mammals. Elucidating these genes, networks, and pathways is difficult, however, because the axolotl does not yet have its genome sequenced and because it has diverged evolutionarily from species with a sequenced genome. Here, we produce a set of gene transcripts via RNA sequencing (RNA-seq) for the axolotl and provide information on the nature of the genes activated during regeneration. To determine the identity of these axolotl genes, we use comparative transcriptomics techniques to match the axolotl transcript data to that of the well-annotated human gene set. Supporting previous studies, we find upregulation of many genes previously found to be involved in limb development and regeneration. In addition, we find a burst of cancer-related genes during the first phase of regeneration and identify a set of genes previously not associated with the regeneration process.
PMCID: PMC3591270  PMID: 23505351
8.  Rbm20 regulates titin alternative splicing as a splicing repressor 
Nucleic Acids Research  2013;41(4):2659-2672.
Titin, a sarcomeric protein expressed primarily in striated muscles, is responsible for maintaining the structure and biomechanical properties of muscle cells. Cardiac titin undergoes developmental size reduction from 3.7 megadaltons in neonates to primarily 2.97 megadaltons in the adult. This size reduction results from gradually increased exon skipping between exons 50 and 219 of titin mRNA. Our previous study reported that Rbm20 is the splicing factor responsible for this process. In this work, we investigated its molecular mechanism. We demonstrate that Rbm20 mediates exon skipping by binding to titin pre-mRNA to repress the splicing of some regions; the exons/introns in these Rbm20-repressed regions are ultimately skipped. Rbm20 was also found to mediate intron retention and exon shuffling. The two Rbm20 speckles found in nuclei from muscle tissues were identified as aggregates of Rbm20 protein on the partially processed titin pre-mRNAs. Cooperative repression and alternative 3′ splice site selection were found to be used by Rbm20 to skip different subsets of titin exons, and the splicing pathway selected depended on the ratio of Rbm20 to other splicing factors that vary with tissue type and developmental age.
PMCID: PMC3575840  PMID: 23307558
9.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome 
BMC Bioinformatics  2011;12:323.
RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments.
We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene.
RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.
PMCID: PMC3163565  PMID: 21816040
10.  RNA-Seq gene expression estimation with read mapping uncertainty 
Bioinformatics  2009;26(4):493-500.
Motivation: RNA-Seq is a promising new technology for accurately measuring gene expression levels. Expression estimation with RNA-Seq requires the mapping of relatively short sequencing reads to a reference genome or transcript set. Because reads are generally shorter than transcripts from which they are derived, a single read may map to multiple genes and isoforms, complicating expression analyses. Previous computational methods either discard reads that map to multiple locations or allocate them to genes heuristically.
Results: We present a generative statistical model and associated inference methods that handle read mapping uncertainty in a principled manner. Through simulations parameterized by real RNA-Seq data, we show that our method is more accurate than previous methods. Our improved accuracy is the result of handling read mapping uncertainty with a statistical model and the estimation of gene expression levels as the sum of isoform expression levels. Unlike previous methods, our method is capable of modeling non-uniform read distributions. Simulations with our method indicate that a read length of 20–25 bases is optimal for gene-level expression estimation from mouse and maize RNA-Seq data when sequencing throughput is fixed.
Availability: An initial C++ implementation of our method that was used for the results presented in this article is available at
Supplementary information: Supplementary data are available at Bioinformatics on
PMCID: PMC2820677  PMID: 20022975
11.  Fine-Scale Phylogenetic Discordance across the House Mouse Genome 
PLoS Genetics  2009;5(11):e1000729.
Population genetic theory predicts discordance in the true phylogeny of different genomic regions when studying recently diverged species. Despite this expectation, genome-wide discordance in young species groups has rarely been statistically quantified. The house mouse subspecies group provides a model system for examining phylogenetic discordance. House mouse subspecies are recently derived, suggesting that even if there has been a simple tree-like population history, gene trees could disagree with the population history due to incomplete lineage sorting. Subspecies of house mice also hybridize in nature, raising the possibility that recent introgression might lead to additional phylogenetic discordance. Single-locus approaches have revealed support for conflicting topologies, resulting in a subspecies tree often summarized as a polytomy. To analyze phylogenetic histories on a genomic scale, we applied a recently developed method, Bayesian concordance analysis, to dense SNP data from three closely related subspecies of house mice: Mus musculus musculus, M. m. castaneus, and M. m. domesticus. We documented substantial variation in phylogenetic history across the genome. Although each of the three possible topologies was strongly supported by a large number of loci, there was statistical evidence for a primary phylogenetic history in which M. m. musculus and M. m. castaneus are sister subspecies. These results underscore the importance of measuring phylogenetic discordance in other recently diverged groups using methods such as Bayesian concordance analysis, which are designed for this purpose.
Author Summary
The phylogenetic history of individual genes can differ strongly from the species history if taxa are recently derived, making inferences of a species history from only a handful of genes especially difficult in these cases. Genome-scale data sets now allow phylogenetic histories to be reconstructed from a large number of genes. Although data sets of this size are becoming more common, few studies have characterized variation in phylogenetic history across whole genomes. We summarize fine scale variation in phylogenetic history across the genome of house mice, a recently derived group of subspecies, using a method that combines phylogenetic uncertainty among gene trees. We document substantial variation in phylogenetic history among 14,081 loci and describe a primary history in the face of this variation. These results support the use of genome-scale datasets and methods that accommodate phylogenetic discordance in attempts to reconstruct the history of closely related groups.
PMCID: PMC2770633  PMID: 19936022
12.  Population Genomics: Whole-Genome Analysis of Polymorphism and Divergence in Drosophila simulans 
PLoS Biology  2007;5(11):e310.
The population genetic perspective is that the processes shaping genomic variation can be revealed only through simultaneous investigation of sequence polymorphism and divergence within and between closely related species. Here we present a population genetic analysis of Drosophila simulans based on whole-genome shotgun sequencing of multiple inbred lines and comparison of the resulting data to genome assemblies of the closely related species, D. melanogaster and D. yakuba. We discovered previously unknown, large-scale fluctuations of polymorphism and divergence along chromosome arms, and significantly less polymorphism and faster divergence on the X chromosome. We generated a comprehensive list of functional elements in the D. simulans genome influenced by adaptive evolution. Finally, we characterized genomic patterns of base composition for coding and noncoding sequence. These results suggest several new hypotheses regarding the genetic and biological mechanisms controlling polymorphism and divergence across the Drosophila genome, and provide a rich resource for the investigation of adaptive evolution and functional variation in D. simulans.
Author Summary
Population genomics, the study of genome-wide patterns of sequence variation within and between closely related species, can provide a comprehensive view of the relative importance of mutation, recombination, natural selection, and genetic drift in evolution. It can also provide fundamental insights into the biological attributes of organisms that are specifically shaped by adaptive evolution. One approach for generating population genomic datasets is to align DNA sequences from whole-genome shotgun projects to a standard reference sequence. We used this approach to carry out whole-genome analysis of polymorphism and divergence in Drosophila simulans, a close relative of the model system, D. melanogaster. We find that polymorphism and divergence fluctuate on a large scale across the genome and that these fluctuations are probably explained by natural selection rather than by variation in mutation rates. Our analysis suggests that adaptive protein evolution is common and is often related to biological processes that may be associated with gene expression, chromosome biology, and reproduction. The approaches presented here will have broad applicability to future analysis of population genomic variation in other systems, including humans.
Low-coverage genome sequences from multiple Drosophila simulans strains provide the first comprehensive view of polymorphism and divergence in the fruit fly.
PMCID: PMC2062478  PMID: 17988176
13.  Compensatory relationship between splice sites and exonic splicing signals depending on the length of vertebrate introns 
BMC Genomics  2006;7:311.
The signals that determine the specificity and efficiency of splicing are multiple and complex, and are not fully understood. Among other factors, the relative contributions of different mechanisms appear to depend on intron size inasmuch as long introns might hinder the activity of the spliceosome through interference with the proper positioning of the intron-exon junctions. Indeed, it has been shown that the information content of splice sites positively correlates with intron length in the nematode, Drosophila, and fungi. We explored the connections between the length of vertebrate introns, the strength of splice sites, exonic splicing signals, and evolution of flanking exons.
A compensatory relationship is shown to exist between different types of signals, namely, the splice sites and the exonic splicing enhancers (ESEs). In the range of relatively short introns (approximately, < 1.5 kilobases in length), the enhancement of the splicing signals for longer introns was manifest in the increased concentration of ESEs. In contrast, for longer introns, this effect was not detectable, and instead, an increase in the strength of the donor and acceptor splice sites was observed. Conceivably, accumulation of A-rich ESE motifs beyond a certain limit is incompatible with functional constraints operating at the level of protein sequence evolution, which leads to compensation in the form of evolution of the splice sites themselves toward greater strength. In addition, however, a correlation between sequence conservation in the exon ends and intron length, particularly, in synonymous positions, was observed throughout the entire length range of introns. Thus, splicing signals other than the currently defined ESEs, i.e., potential new classes of ESEs, might exist in exon sequences, particularly, those that flank long introns.
Several weak but statistically significant correlations were observed between vertebrate intron length, splice site strength, and potential exonic splicing signals. Taken together, these findings attest to a compensatory relationship between splice sites and exonic splicing signals, depending on intron length.
PMCID: PMC1713244  PMID: 17156453
14.  Parametric Alignment of Drosophila Genomes 
PLoS Computational Biology  2006;2(6):e73.
The classic algorithms of Needleman–Wunsch and Smith–Waterman find a maximum a posteriori probability alignment for a pair hidden Markov model (PHMM). To process large genomes that have undergone complex genome rearrangements, almost all existing whole genome alignment methods apply fast heuristics to divide genomes into small pieces that are suitable for Needleman–Wunsch alignment. In these alignment methods, it is standard practice to fix the parameters and to produce a single alignment for subsequent analysis by biologists. As the number of alignment programs applied on a whole genome scale continues to increase, so does the disagreement in their results. The alignments produced by different programs vary greatly, especially in non-coding regions of eukaryotic genomes where the biologically correct alignment is hard to find. Parametric alignment is one possible remedy. This methodology resolves the issue of robustness to changes in parameters by finding all optimal alignments for all possible parameters in a PHMM. Our main result is the construction of a whole genome parametric alignment of Drosophila melanogaster and Drosophila pseudoobscura. This alignment draws on existing heuristics for dividing whole genomes into small pieces for alignment, and it relies on advances we have made in computing convex polytopes that allow us to parametrically align non-coding regions using biologically realistic models. We demonstrate the utility of our parametric alignment for biological inference by showing that cis-regulatory elements are more conserved between Drosophila melanogaster and Drosophila pseudoobscura than previously thought. We also show how whole genome parametric alignment can be used to quantitatively assess the dependence of branch length estimates on alignment parameters.
Dewey and colleagues describe a parametric alignment of the genomes of Drosophila melanogaster and Drosophila pseudoobscura. The parametric alignment consists of all optimal alignments of the two Drosophila genomes for all choices of parameters for some widely used scoring schemes. Computation and analysis of the parametric alignment requires the integration of ideas from mathematics, algorithms, and biology. Mathematically, the parametric analysis rests on the geometric principle of convexity. In particular, the alignment polytope, which organizes the alignments according to the optimal alignments, is introduced and described. Algorithmically, efficient procedures are developed for computing alignment polytopes on a large scale and for models with more parameters than had previously been practical. Biologically, the utility of parametric analysis is demonstrated by showing that the degree of conservation between cis-regulatory elements in Drosophila melanogaster and Drosophila pseudoobscura is higher than previously thought, and by assessing the dependence of branch length estimates on alignment parameters.
PMCID: PMC1480539  PMID: 16789815

Results 1-14 (14)