Cryptococcus neoformans is a basidiomycetous yeast ubiquitous in the environment, a model for fungal pathogenesis, and an opportunistic human pathogen of global importance. We have sequenced its ~20-megabase genome, which contains ~6500 intron-rich gene structures and encodes a transcriptome abundant in alternatively spliced and antisense messages. The genome is rich in transposons, many of which cluster at candidate centromeric regions. The presence of these transposons may drive karyotype instability and phenotypic variation. C. neoformans encodes unique genes that may contribute to its unusual virulence properties, and comparison of two phenotypically distinct strains reveals variation in gene content in addition to sequence polymorphisms between the genomes.
Despite recent technological advances, the study of the human transcriptome is still in its early stages. Here we provide an overview of the complex human transcriptomic landscape, present the bioinformatics challenges posed by the vast quantities of transcriptomic data, and discuss some of the studies that have tried to determine how much of the human genome is transcribed. Recent evidence has suggested that more than 90% of the human genome is transcribed into RNA. However, this view has been strongly contested by groups of scientists who argued that many of the observed transcripts are simply the result of transcriptional noise. In this review, we conclude that the full extent of transcription remains an open question that will not be fully addressed until we decipher the complete range and biological diversity of the transcribed genomic sequences.
transcriptome; pervasive transcription; RNA-seq; mRNA; ncRNA
Comparison of the human genome with other primates offers the opportunity to detect evolutionary events that created the diverse phenotypes among the primate species. Because the primate genomes are highly similar to one another, methods developed for analysis of more divergent species do not always detect signs of evolutionary selection.
We have developed a new method, called DivE, specifically designed to find regions that have evolved either more or less rapidly than expected, for any clade within a set of very closely related species. Unlike some previous methods, DivE does not rely on rates of synonymous and nonsynonymous substitution, which enables it to detect evolutionary events in noncoding regions. We demonstrate using simulated data that DivE compares favorably to alternative methods, and we then apply DivE to the ENCODE regions in 14 primate species. We identify thousands of regions in these primates, ranging from 50 to >10000 bp in length, that appear to have experienced either constrained or accelerated rates of evolution. In particular, we detected 4942 regions that have potentially undergone positive selection in one or more primate species. Most of these regions occur outside of protein-coding genes, although we identified 20 proteins that have experienced positive selection.
DivE provides an easy-to-use method to predict both positive and negative selection in noncoding DNA, that is particularly well-suited to detecting lineage-specific selection in large genomes.
The number of genes in the human genome is still an estimate.
Many people expected the question 'How many genes in the human genome?' to be resolved with the publication of the genome sequence in 2001, but estimates continue to fluctuate.
We developed a computational screen that tests an individual's genome for mutations in the BRCA genes, despite the fact that both are currently protected by patents.
Schistosoma mansoni is responsible for the neglected tropical disease schistosomiasis that affects 210 million people in 76 countries. We report here analysis of the 363 megabase nuclear genome of the blood fluke. It encodes at least 11,809 genes, with an unusual intron size distribution, and novel families of micro-exon genes that undergo frequent alternate splicing. As the first sequenced flatworm, and a representative of the lophotrochozoa, it offers insights into early events in the evolution of the animals, including the development of a body pattern with bilateral symmetry, and the development of tissues into organs. Our analysis has been informed by the need to find new drug targets. The deficits in lipid metabolism that make schistosomes dependent on the host are revealed, while the identification of membrane receptors, ion channels and more than 300 proteases, provide new insights into the biology of the life cycle and novel targets. Bioinformatics approaches have identified metabolic chokepoints while a chemogenomic screen has pinpointed schistosome proteins for which existing drugs may be active. The information generated provides an invaluable resource for the research community to develop much needed new control tools for the treatment and eradication of this important and neglected disease.
Advances in sequencing technologies have accelerated the sequencing of new genomes, far outpacing the generation of gene and protein resources needed to annotate them. Direct comparison and alignment of existing cDNA sequences from a related species is an effective and readily available means to determine genes in the new genomes. Current spliced alignment programs are inadequate for comparing sequences between different species, owing to their low sensitivity and splice junction accuracy. A new spliced alignment tool, sim4cc, overcomes problems in the earlier tools by incorporating three new features: universal spaced seeds, to increase sensitivity and allow comparisons between species at various evolutionary distances, and powerful splice signal models and evolutionarily-aware alignment techniques, to improve the accuracy of gene models. When tested on vertebrate comparisons at diverse evolutionary distances, sim4cc had significantly higher sensitivity compared to existing alignment programs, more than 10% higher than the closest competitor for some comparisons, while being comparable in speed to its predecessor, sim4. Sim4cc can be used in one-to-one or one-to-many comparisons of genomic and cDNA sequences, and can also be effectively incorporated into a high-throughput annotation engine, as demonstrated by the mapping of 64 000 Fagus grandifolia 454 ESTs and unigenes to the poplar genome.
Parasitic nematodes that cause elephantiasis and river blindness threaten hundreds of millions of people in the developing world. We have sequenced the ~90 megabase (Mb) genome of the human filarial parasite Brugia malayi and predict ~11,500 protein coding genes in 71 Mb of robustly assembled sequence. Comparative analysis with the free-living, model nematode Caenorhabditis elegans revealed that, despite these genes having maintained little conservation of local synteny during ~350 million years of evolution, they largely remain in linkage on chromosomal units. More than 100 conserved operons were identified. Analysis of the predicted proteome provides evidence for adaptations of B. malayi to niches in its human and vector hosts and insights into the molecular basis of a mutualistic relationship with its Wolbachia endosymbiont. These findings offer a foundation for rational drug design.
The fast pace of bacterial genome sequencing and the resulting dependence on highly automated annotation methods has driven the development of many genome-wide analysis tools. OperonDB, first released in 2001, is a database containing the results of a computational algorithm for locating operon structures in microbial genomes. OperonDB has grown from 34 genomes in its initial release to more than 500 genomes today. In addition to increasing the size of the database, we have re-designed our operon finding algorithm and improved its accuracy. The new database is updated regularly as additional genomes become available in public archives. OperonDB can be accessed at: http://operondb.cbcb.umd.edu
EVidenceModeler (EVM) is an automated annotation tool that predicts protein-coding regions, alternatively spliced transcripts and untranslated regions of eukaryotic genes.
EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.
We describe the genome sequence of the protist Trichomonas vaginalis, a sexually transmitted human pathogen. Repeats and transposable elements comprise about two-thirds of the ~160-megabase genome, reflecting a recent massive expansion of genetic material. This expansion, in conjunction with the shaping of metabolic pathways that likely transpired through lateral gene transfer from bacteria, and amplification of specific gene families implicated in pathogenesis and phagocytosis of host proteins may exemplify adaptations of the parasite during its transition to a urogenital environment. The genome sequence predicts previously unknown functions for the hydrogenosome, which support a common evolutionary origin of this unusual organelle with mitochondria.
Algorithmic approaches to splice site prediction have relied mainly on the consensus patterns found at the boundaries between protein coding and non-coding regions. However exonic splicing enhancers have been shown to enhance the utilization of nearby splice sites.
We have developed a new computational technique to identify significantly conserved motifs involved in splice site regulation. First, 84 putative exonic splicing enhancer hexamers are identified in Arabidopsis thaliana. Then a Gibbs sampling program called ELPH was used to locate conserved motifs represented by these hexamers in exonic regions near splice sites in confirmed genes. Oligomers containing 35 of these motifs have been shown experimentally to induce significant inclusion of A. thaliana exons. Second, integration of our regulatory motifs into two different splice site recognition programs significantly improved the ability of the software to correctly predict splice sites in a large database of confirmed genes. We have released GeneSplicerESE, the improved splice site recognition code, as open source software.
Our results show that the use of the ESE motifs consistently improves splice site prediction accuracy.
Predicting complete protein-coding genes in human DNA remains a significant challenge. Though a number of promising approaches have been investigated, an ideal suite of tools has yet to emerge that can provide near perfect levels of sensitivity and specificity at the level of whole genes. As an incremental step in this direction, it is hoped that controlled gene finding experiments in the ENCODE regions will provide a more accurate view of the relative benefits of different strategies for modeling and predicting gene structures.
Here we describe our general-purpose eukaryotic gene finding pipeline and its major components, as well as the methodological adaptations that we found necessary in accommodating human DNA in our pipeline, noting that a similar level of effort may be necessary by ourselves and others with similar pipelines whenever a new class of genomes is presented to the community for analysis. We also describe a number of controlled experiments involving the differential inclusion of various types of evidence and feature states into our models and the resulting impact these variations have had on predictive accuracy.
While in the case of the non-comparative gene finders we found that adding model states to represent specific biological features did little to enhance predictive accuracy, for our evidence-based 'combiner' program the incorporation of additional evidence tracks tended to produce significant gains in accuracy for most evidence types, suggesting that improved modeling efforts at the hidden Markov model level are of relatively little value. We relate these findings to our current plans for future research.
The Generalized Hidden Markov Model (GHMM) has proven a useful framework for the task of computational gene prediction in eukaryotic genomes, due to its flexibility and probabilistic underpinnings. As the focus of the gene finding community shifts toward the use of homology information to improve prediction accuracy, extensions to the basic GHMM model are being explored as possible ways to integrate this homology information into the prediction process. Particularly prominent among these extensions are those techniques which call for the simultaneous prediction of genes in two or more genomes at once, thereby increasing significantly the computational cost of prediction and highlighting the importance of speed and memory efficiency in the implementation of the underlying GHMM algorithms. Unfortunately, the task of implementing an efficient GHMM-based gene finder is already a nontrivial one, and it can be expected that this task will only grow more onerous as our models increase in complexity.
As a first step toward addressing the implementation challenges of these next-generation systems, we describe in detail two software architectures for GHMM-based gene finders, one comprising the common array-based approach, and the other a highly optimized algorithm which requires significantly less memory while achieving virtually identical speed. We then show how both of these architectures can be accelerated by a factor of two by optimizing their content sensors. We finish with a brief illustration of the impact these optimizations have had on the feasibility of our new homology-based gene finder, TWAIN.
In describing a number of optimizations for GHMM-based gene finders and making available two complete open-source software systems embodying these methods, it is our hope that others will be more enabled to explore promising extensions to the GHMM framework, thereby improving the state-of-the-art in gene prediction techniques.
We present three programs for ab initio gene prediction in eukaryotes: Exonomy, Unveil and GlimmerM. Exonomy is a 23-state Generalized Hidden Markov Model (GHMM), Unveil is a 283-state standard Hidden Markov Model (HMM) and GlimmerM is a previously-described genefinder which utilizes decision trees and Interpolated Markov Models (IMMs). All three are readily re-trainable for new organisms and have been found to perform well compared to other genefinders. Results are presented for Arabidopsis thaliana. Cases have been found where each of the genefinders outperforms each of the others, demonstrating the collective value of this ensemble of genefinders. These programs are all accessible through webservers at http://www.tigr.org/software.
GeneSplicer is a new, flexible system for detecting splice sites
in the genomic DNA of various eukaryotes. The system has been tested
successfully using DNA from two reference organisms: the model plant Arabidopsis thaliana and human. It was compared
to six programs representing the leading splice site detectors for
each of these species: NetPlantGene, NetGene2, HSPL, NNSplice, GENIO
and SpliceView. In each case GeneSplicer performed comparably to the
best alternative, in terms of both accuracy and computational efficiency.