PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-23 (23)
 

Clipboard (0)
None

Select a Filter Below

Journals
Year of Publication
more »
1.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads 
Nature biotechnology  2015;33(3):290-295.
Methods used to sequence the transcriptome often produce more than 200 million short sequences. We introduce StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts. When used to analyze both simulated and real data sets, StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph. For example, on 90 million reads from human blood, StringTie correctly assembled 10,990 transcripts, whereas the next best assembly was of 7,187 transcripts by Cufflinks, which is a 53% increase in transcripts assembled. On a simulated data set, StringTie correctly assembled 7,559 transcripts, which is 20% more than the 6,310 assembled by Cufflinks. As well as producing a more complete transcriptome assembly, StringTie runs faster on all data sets tested to date compared with other assembly software, including Cufflinks.
doi:10.1038/nbt.3122
PMCID: PMC4643835  PMID: 25690850
2.  Broad CTL response is required to clear latent HIV-1 due to dominance of escape mutations 
Nature  2015;517(7534):381-385.
Despite antiretroviral therapy (ART), HIV-1 persists in a stable latent reservoir1, 2, primarily in resting memory CD4+ T cells3, 4. This reservoir presents a major barrier to the cure of HIV-1 infection. To purge the reservoir, pharmacological reactivation of latent HIV-1 has been proposed5 and tested both in vitro and in vivo6–8. A key remaining question is whether virus-specific immune mechanisms including cytolytic T lymphocytes (CTL) can clear infected cells in ART-treated patients after latency is reversed. Here we show that there is a striking all or none pattern for CTL escape mutations in HIV-1 Gag epitopes. Unless ART is started early, the vast majority (>98%) of latent viruses carry CTL escape mutations that render infected cells insensitive to CTLs directed at common epitopes. To solve this problem, we identified CTLs that could recognize epitopes from latent HIV-1 that were unmutated in every chronically infected patient tested. Upon stimulation, these CTLs eliminated target cells infected with autologous virus derived from the latent reservoir, both in vitro and in patient-derived humanized mice. The predominance of CTL-resistant viruses in the latent reservoir poses a major challenge to viral eradication. Our results demonstrate that chronically infected patients retain a broad spectrum viral-specific CTL response and that appropriate boosting of this response may be required for the elimination of the latent reservoir.
doi:10.1038/nature14053
PMCID: PMC4406054  PMID: 25561180
3.  Diamund: Direct comparison of genomes to detect mutations 
Human mutation  2014;35(3):283-288.
DNA sequencing has become a powerful method to discover the genetic basis of disease. Standard, widely-used protocols for analysis usually begin by comparing each individual to the human reference genome. When applied to a set of related individuals, this approach reveals millions of differences, most of which are shared among the individuals and unrelated to the disease being investigated. We have developed a novel algorithm for variant detection, one that compares DNA sequences directly to one another, without aligning them to the reference genome. When used to find de novo mutations in exome sequences from family trios, or to compare normal and diseased samples from the same individual, the new method, Diamund, produces a dramatically smaller list of candidate mutations than previous methods, without losing sensitivity to detect the true cause of a genetic disease. We demonstrate our results on several example cases, including two family trios in which it correctly found the disease-causing variant while excluding thousands of harmless variants that standard methods had identified.
doi:10.1002/humu.22503
PMCID: PMC4031744  PMID: 24375697
variant detection; computational biology; bioinformatics; exome sequencing; sequence alignment
4.  DIAMUND: Direct Comparison of Genomes to Detect Mutations 
Human Mutation  2013;35(3):283-288.
DNA sequencing has become a powerful method to discover the genetic basis of disease. Standard, widely used protocols for analysis usually begin by comparing each individual to the human reference genome. When applied to a set of related individuals, this approach reveals millions of differences, most of which are shared among the individuals and unrelated to the disease being investigated. We have developed a novel algorithm for variant detection, one that compares DNA sequences directly to one another, without aligning them to the reference genome. When used to find de novo mutations in exome sequences from family trios, or to compare normal and diseased samples from the same individual, the new method, direct alignment for mutation discovery (DIAMUND), produces a dramatically smaller list of candidate mutations than previous methods, without losing sensitivity to detect the true cause of a genetic disease. We demonstrate our results on several example cases, including two family trios in which it correctly found the disease-causing variant while excluding thousands of harmless variants that standard methods had identified.
doi:10.1002/humu.22503
PMCID: PMC4031744  PMID: 24375697
variant detection; computational biology; bioinformatics; exome sequencing; sequence alignment
5.  Genome sequence of the human malaria parasite Plasmodium falciparum 
Nature  2002;419(6906):10.1038/nature01097.
The parasite Plasmodium falciparum is responsible for hundreds of millions of cases of malaria, and kills more than one million African children annually. Here we report an analysis of the genome sequence of P. falciparum clone 3D7. The 23-megabase nuclear genome consists of 14 chromosomes, encodes about 5,300 genes, and is the most (A + T)-rich genome sequenced to date. Genes involved in antigenic variation are concentrated in the subtelomeric regions of the chromosomes. Compared to the genomes of free-living eukaryotic microbes, the genome of this intracellular parasite encodes fewer enzymes and transporters, but a large proportion of genes are devoted to immune evasion and host–parasite interactions. Many nuclear-encoded proteins are targeted to the apicoplast, an organelle involved in fatty-acid and isoprenoid metabolism. The genome sequence provides the foundation for future studies of this organism, and is being exploited in the search for new drugs and vaccines to fight malaria.
doi:10.1038/nature01097
PMCID: PMC3836256  PMID: 12368864
6.  The Genome of the Basidiomycetous Yeast and Human Pathogen Cryptococcus neoformans 
Science (New York, N.Y.)  2005;307(5713):1321-1324.
Cryptococcus neoformans is a basidiomycetous yeast ubiquitous in the environment, a model for fungal pathogenesis, and an opportunistic human pathogen of global importance. We have sequenced its ~20-megabase genome, which contains ~6500 intron-rich gene structures and encodes a transcriptome abundant in alternatively spliced and antisense messages. The genome is rich in transposons, many of which cluster at candidate centromeric regions. The presence of these transposons may drive karyotype instability and phenotypic variation. C. neoformans encodes unique genes that may contribute to its unusual virulence properties, and comparison of two phenotypically distinct strains reveals variation in gene content in addition to sequence polymorphisms between the genomes.
doi:10.1126/science.1103773
PMCID: PMC3520129  PMID: 15653466
7.  The Human Transcriptome: An Unfinished Story 
Genes  2012;3(3):344-360.
Despite recent technological advances, the study of the human transcriptome is still in its early stages. Here we provide an overview of the complex human transcriptomic landscape, present the bioinformatics challenges posed by the vast quantities of transcriptomic data, and discuss some of the studies that have tried to determine how much of the human genome is transcribed. Recent evidence has suggested that more than 90% of the human genome is transcribed into RNA. However, this view has been strongly contested by groups of scientists who argued that many of the observed transcripts are simply the result of transcriptional noise. In this review, we conclude that the full extent of transcription remains an open question that will not be fully addressed until we decipher the complete range and biological diversity of the transcribed genomic sequences.
doi:10.3390/genes3030344
PMCID: PMC3422666  PMID: 22916334
transcriptome; pervasive transcription; RNA-seq; mRNA; ncRNA
8.  The Human Transcriptome: An Unfinished Story  
Genes  2012;3(3):344-360.
Despite recent technological advances, the study of the human transcriptome is still in its early stages. Here we provide an overview of the complex human transcriptomic landscape, present the bioinformatics challenges posed by the vast quantities of transcriptomic data, and discuss some of the studies that have tried to determine how much of the human genome is transcribed. Recent evidence has suggested that more than 90% of the human genome is transcribed into RNA. However, this view has been strongly contested by groups of scientists who argued that many of the observed transcripts are simply the result of transcriptional noise. In this review, we conclude that the full extent of transcription remains an open question that will not be fully addressed until we decipher the complete range and biological diversity of the transcribed genomic sequences.
doi:10.3390/genes3030344
PMCID: PMC3422666  PMID: 22916334
transcriptome; pervasive transcription; RNA-seq; mRNA; ncRNA
9.  Detection of lineage-specific evolutionary changes among primate species 
BMC Bioinformatics  2011;12:274.
Background
Comparison of the human genome with other primates offers the opportunity to detect evolutionary events that created the diverse phenotypes among the primate species. Because the primate genomes are highly similar to one another, methods developed for analysis of more divergent species do not always detect signs of evolutionary selection.
Results
We have developed a new method, called DivE, specifically designed to find regions that have evolved either more or less rapidly than expected, for any clade within a set of very closely related species. Unlike some previous methods, DivE does not rely on rates of synonymous and nonsynonymous substitution, which enables it to detect evolutionary events in noncoding regions. We demonstrate using simulated data that DivE compares favorably to alternative methods, and we then apply DivE to the ENCODE regions in 14 primate species. We identify thousands of regions in these primates, ranging from 50 to >10000 bp in length, that appear to have experienced either constrained or accelerated rates of evolution. In particular, we detected 4942 regions that have potentially undergone positive selection in one or more primate species. Most of these regions occur outside of protein-coding genes, although we identified 20 proteins that have experienced positive selection.
Conclusions
DivE provides an easy-to-use method to predict both positive and negative selection in noncoding DNA, that is particularly well-suited to detecting lineage-specific selection in large genomes.
doi:10.1186/1471-2105-12-274
PMCID: PMC3143108  PMID: 21726447
10.  Between a chicken and a grape: estimating the number of human genes 
Genome Biology  2010;11(5):206.
The number of genes in the human genome is still an estimate.
Many people expected the question 'How many genes in the human genome?' to be resolved with the publication of the genome sequence in 2001, but estimates continue to fluctuate.
doi:10.1186/gb-2010-11-5-206
PMCID: PMC2898077  PMID: 20441615
11.  Between a chicken and a grape: estimating the number of human genes 
Genome Biology  2010;11(Suppl 1):I1.
doi:10.1186/gb-2010-11-s1-i1
PMCID: PMC3026201  PMID: 21134298
12.  Do-it-yourself genetic testing 
Genome Biology  2010;11(10):404.
We developed a computational screen that tests an individual's genome for mutations in the BRCA genes, despite the fact that both are currently protected by patents.
doi:10.1186/gb-2010-11-10-404
PMCID: PMC3218655  PMID: 20932271
13.  The genome of the blood fluke Schistosoma mansoni 
Nature  2009;460(7253):352-358.
Schistosoma mansoni is responsible for the neglected tropical disease schistosomiasis that affects 210 million people in 76 countries. We report here analysis of the 363 megabase nuclear genome of the blood fluke. It encodes at least 11,809 genes, with an unusual intron size distribution, and novel families of micro-exon genes that undergo frequent alternate splicing. As the first sequenced flatworm, and a representative of the lophotrochozoa, it offers insights into early events in the evolution of the animals, including the development of a body pattern with bilateral symmetry, and the development of tissues into organs. Our analysis has been informed by the need to find new drug targets. The deficits in lipid metabolism that make schistosomes dependent on the host are revealed, while the identification of membrane receptors, ion channels and more than 300 proteases, provide new insights into the biology of the life cycle and novel targets. Bioinformatics approaches have identified metabolic chokepoints while a chemogenomic screen has pinpointed schistosome proteins for which existing drugs may be active. The information generated provides an invaluable resource for the research community to develop much needed new control tools for the treatment and eradication of this important and neglected disease.
doi:10.1038/nature08160
PMCID: PMC2756445  PMID: 19606141
14.  Sim4cc: a cross-species spliced alignment program 
Nucleic Acids Research  2009;37(11):e80.
Advances in sequencing technologies have accelerated the sequencing of new genomes, far outpacing the generation of gene and protein resources needed to annotate them. Direct comparison and alignment of existing cDNA sequences from a related species is an effective and readily available means to determine genes in the new genomes. Current spliced alignment programs are inadequate for comparing sequences between different species, owing to their low sensitivity and splice junction accuracy. A new spliced alignment tool, sim4cc, overcomes problems in the earlier tools by incorporating three new features: universal spaced seeds, to increase sensitivity and allow comparisons between species at various evolutionary distances, and powerful splice signal models and evolutionarily-aware alignment techniques, to improve the accuracy of gene models. When tested on vertebrate comparisons at diverse evolutionary distances, sim4cc had significantly higher sensitivity compared to existing alignment programs, more than 10% higher than the closest competitor for some comparisons, while being comparable in speed to its predecessor, sim4. Sim4cc can be used in one-to-one or one-to-many comparisons of genomic and cDNA sequences, and can also be effectively incorporated into a high-throughput annotation engine, as demonstrated by the mapping of 64 000 Fagus grandifolia 454 ESTs and unigenes to the poplar genome.
doi:10.1093/nar/gkp319
PMCID: PMC2699533  PMID: 19429899
15.  Draft Genome of the Filarial Nematode Parasite Brugia malayi 
Science (New York, N.Y.)  2007;317(5845):1756-1760.
Parasitic nematodes that cause elephantiasis and river blindness threaten hundreds of millions of people in the developing world. We have sequenced the ~90 megabase (Mb) genome of the human filarial parasite Brugia malayi and predict ~11,500 protein coding genes in 71 Mb of robustly assembled sequence. Comparative analysis with the free-living, model nematode Caenorhabditis elegans revealed that, despite these genes having maintained little conservation of local synteny during ~350 million years of evolution, they largely remain in linkage on chromosomal units. More than 100 conserved operons were identified. Analysis of the predicted proteome provides evidence for adaptations of B. malayi to niches in its human and vector hosts and insights into the molecular basis of a mutualistic relationship with its Wolbachia endosymbiont. These findings offer a foundation for rational drug design.
doi:10.1126/science.1145406
PMCID: PMC2613796  PMID: 17885136
16.  OperonDB: a comprehensive database of predicted operons in microbial genomes 
Nucleic Acids Research  2008;37(Database issue):D479-D482.
The fast pace of bacterial genome sequencing and the resulting dependence on highly automated annotation methods has driven the development of many genome-wide analysis tools. OperonDB, first released in 2001, is a database containing the results of a computational algorithm for locating operon structures in microbial genomes. OperonDB has grown from 34 genomes in its initial release to more than 500 genomes today. In addition to increasing the size of the database, we have re-designed our operon finding algorithm and improved its accuracy. The new database is updated regularly as additional genomes become available in public archives. OperonDB can be accessed at: http://operondb.cbcb.umd.edu
doi:10.1093/nar/gkn784
PMCID: PMC2686487  PMID: 18948284
17.  Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments 
Genome Biology  2008;9(1):R7.
EVidenceModeler (EVM) is an automated annotation tool that predicts protein-coding regions, alternatively spliced transcripts and untranslated regions of eukaryotic genes.
EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.
doi:10.1186/gb-2008-9-1-r7
PMCID: PMC2395244  PMID: 18190707
18.  Draft Genome Sequence of the Sexually Transmitted Pathogen Trichomonas vaginalis 
Science (New York, N.Y.)  2007;315(5809):207-212.
We describe the genome sequence of the protist Trichomonas vaginalis, a sexually transmitted human pathogen. Repeats and transposable elements comprise about two-thirds of the ~160-megabase genome, reflecting a recent massive expansion of genetic material. This expansion, in conjunction with the shaping of metabolic pathways that likely transpired through lateral gene transfer from bacteria, and amplification of specific gene families implicated in pathogenesis and phagocytosis of host proteins may exemplify adaptations of the parasite during its transition to a urogenital environment. The genome sequence predicts previously unknown functions for the hydrogenosome, which support a common evolutionary origin of this unusual organelle with mitochondria.
doi:10.1126/science.1132894
PMCID: PMC2080659  PMID: 17218520
19.  A computational survey of candidate exonic splicing enhancer motifs in the model plant Arabidopsis thaliana 
BMC Bioinformatics  2007;8:159.
Background
Algorithmic approaches to splice site prediction have relied mainly on the consensus patterns found at the boundaries between protein coding and non-coding regions. However exonic splicing enhancers have been shown to enhance the utilization of nearby splice sites.
Results
We have developed a new computational technique to identify significantly conserved motifs involved in splice site regulation. First, 84 putative exonic splicing enhancer hexamers are identified in Arabidopsis thaliana. Then a Gibbs sampling program called ELPH was used to locate conserved motifs represented by these hexamers in exonic regions near splice sites in confirmed genes. Oligomers containing 35 of these motifs have been shown experimentally to induce significant inclusion of A. thaliana exons. Second, integration of our regulatory motifs into two different splice site recognition programs significantly improved the ability of the software to correctly predict splice sites in a large database of confirmed genes. We have released GeneSplicerESE, the improved splice site recognition code, as open source software.
Conclusion
Our results show that the use of the ESE motifs consistently improves splice site prediction accuracy.
doi:10.1186/1471-2105-8-159
PMCID: PMC1892810  PMID: 17517127
20.  JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions 
Genome Biology  2006;7(Suppl 1):S9.
Background
Predicting complete protein-coding genes in human DNA remains a significant challenge. Though a number of promising approaches have been investigated, an ideal suite of tools has yet to emerge that can provide near perfect levels of sensitivity and specificity at the level of whole genes. As an incremental step in this direction, it is hoped that controlled gene finding experiments in the ENCODE regions will provide a more accurate view of the relative benefits of different strategies for modeling and predicting gene structures.
Results
Here we describe our general-purpose eukaryotic gene finding pipeline and its major components, as well as the methodological adaptations that we found necessary in accommodating human DNA in our pipeline, noting that a similar level of effort may be necessary by ourselves and others with similar pipelines whenever a new class of genomes is presented to the community for analysis. We also describe a number of controlled experiments involving the differential inclusion of various types of evidence and feature states into our models and the resulting impact these variations have had on predictive accuracy.
Conclusion
While in the case of the non-comparative gene finders we found that adding model states to represent specific biological features did little to enhance predictive accuracy, for our evidence-based 'combiner' program the incorporation of additional evidence tracks tended to produce significant gains in accuracy for most evidence types, suggesting that improved modeling efforts at the hidden Markov model level are of relatively little value. We relate these findings to our current plans for future research.
doi:10.1186/gb-2006-7-s1-s9
PMCID: PMC1810558  PMID: 16925843
21.  Efficient decoding algorithms for generalized hidden Markov model gene finders 
BMC Bioinformatics  2005;6:16.
Background
The Generalized Hidden Markov Model (GHMM) has proven a useful framework for the task of computational gene prediction in eukaryotic genomes, due to its flexibility and probabilistic underpinnings. As the focus of the gene finding community shifts toward the use of homology information to improve prediction accuracy, extensions to the basic GHMM model are being explored as possible ways to integrate this homology information into the prediction process. Particularly prominent among these extensions are those techniques which call for the simultaneous prediction of genes in two or more genomes at once, thereby increasing significantly the computational cost of prediction and highlighting the importance of speed and memory efficiency in the implementation of the underlying GHMM algorithms. Unfortunately, the task of implementing an efficient GHMM-based gene finder is already a nontrivial one, and it can be expected that this task will only grow more onerous as our models increase in complexity.
Results
As a first step toward addressing the implementation challenges of these next-generation systems, we describe in detail two software architectures for GHMM-based gene finders, one comprising the common array-based approach, and the other a highly optimized algorithm which requires significantly less memory while achieving virtually identical speed. We then show how both of these architectures can be accelerated by a factor of two by optimizing their content sensors. We finish with a brief illustration of the impact these optimizations have had on the feasibility of our new homology-based gene finder, TWAIN.
Conclusions
In describing a number of optimizations for GHMM-based gene finders and making available two complete open-source software systems embodying these methods, it is our hope that others will be more enabled to explore promising extensions to the GHMM framework, thereby improving the state-of-the-art in gene prediction techniques.
doi:10.1186/1471-2105-6-16
PMCID: PMC552317  PMID: 15667658
22.  GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders 
Nucleic Acids Research  2003;31(13):3601-3604.
We present three programs for ab initio gene prediction in eukaryotes: Exonomy, Unveil and GlimmerM. Exonomy is a 23-state Generalized Hidden Markov Model (GHMM), Unveil is a 283-state standard Hidden Markov Model (HMM) and GlimmerM is a previously-described genefinder which utilizes decision trees and Interpolated Markov Models (IMMs). All three are readily re-trainable for new organisms and have been found to perform well compared to other genefinders. Results are presented for Arabidopsis thaliana. Cases have been found where each of the genefinders outperforms each of the others, demonstrating the collective value of this ensemble of genefinders. These programs are all accessible through webservers at http://www.tigr.org/software.
PMCID: PMC168934  PMID: 12824375
23.  GeneSplicer: a new computational method for splice site prediction 
Nucleic Acids Research  2001;29(5):1185-1190.
GeneSplicer is a new, flexible system for detecting splice sites in the genomic DNA of various eukaryotes. The system has been tested successfully using DNA from two reference organisms: the model plant Arabidopsis thaliana and human. It was compared to six programs representing the leading splice site detectors for each of these species: NetPlantGene, NetGene2, HSPL, NNSplice, GENIO and SpliceView. In each case GeneSplicer performed comparably to the best alternative, in terms of both accuracy and computational efficiency.
PMCID: PMC29713  PMID: 11222768

Results 1-23 (23)