Motivation: Kinases of the eukaryotic protein kinase superfamily are key regulators of most aspects eukaryotic cellular behavior and have provided several drug targets including kinases dysregulated in cancers. The rapid increase in the number of genomic sequences has created an acute need to identify and classify members of this important class of enzymes efficiently and accurately.
Results: Kinannote produces a draft kinome and comparative analyses for a predicted proteome using a single line command, and it is currently the only tool that automatically classifies protein kinases using the controlled vocabulary of Hanks and Hunter [Hanks and Hunter (1995)]. A hidden Markov model in combination with a position-specific scoring matrix is used by Kinannote to identify kinases, which are subsequently classified using a BLAST comparison with a local version of KinBase, the curated protein kinase dataset from www.kinase.com. Kinannote was tested on the predicted proteomes from four divergent species. The average sensitivity and precision for kinome retrieval from the test species are 94.4 and 96.8%. The ability of Kinannote to classify identified kinases was also evaluated, and the average sensitivity and precision for full classification of conserved kinases are 71.5 and 82.5%, respectively. Kinannote has had a significant impact on eukaryotic genome annotation, providing protein kinase annotations for 36 genomes made public by the Broad Institute in the period spanning 2009 to the present.
Availability: Kinannote is freely available at http://sourceforge.net/projects/kinannote.
Supplementary data are available at Bioinformatics online.
Oomycetes in the class Saprolegniomycetidae of the Eukaryotic kingdom Stramenopila have evolved as severe pathogens of amphibians, crustaceans, fish and insects, resulting in major losses in aquaculture and damage to aquatic ecosystems. We have sequenced the 63 Mb genome of the fresh water fish pathogen, Saprolegnia parasitica. Approximately 1/3 of the assembled genome exhibits loss of heterozygosity, indicating an efficient mechanism for revealing new variation. Comparison of S. parasitica with plant pathogenic oomycetes suggests that during evolution the host cellular environment has driven distinct patterns of gene expansion and loss in the genomes of plant and animal pathogens. S. parasitica possesses one of the largest repertoires of proteases (270) among eukaryotes that are deployed in waves at different points during infection as determined from RNA-Seq data. In contrast, despite being capable of living saprotrophically, parasitism has led to loss of inorganic nitrogen and sulfur assimilation pathways, strikingly similar to losses in obligate plant pathogenic oomycetes and fungi. The large gene families that are hallmarks of plant pathogenic oomycetes such as Phytophthora appear to be lacking in S. parasitica, including those encoding RXLR effectors, Crinkler's, and Necrosis Inducing-Like Proteins (NLP). S. parasitica also has a very large kinome of 543 kinases, 10% of which is induced upon infection. Moreover, S. parasitica encodes several genes typical of animals or animal-pathogens and lacking from other oomycetes, including disintegrins and galactose-binding lectins, whose expression and evolutionary origins implicate horizontal gene transfer in the evolution of animal pathogenesis in S. parasitica.
Fish are an increasingly important source of animal protein globally, with aquaculture production rising dramatically over the past decade. Saprolegnia is a fungal-like oomycete and one of the most destructive fish pathogens, causing millions of dollars in losses to the aquaculture industry annually. Saprolegnia has also been linked to a worldwide decline in wild fish and amphibian populations. Here we describe the genome sequence of the first animal pathogenic oomycete and compare the genome content with the available plant pathogenic oomycetes. We found that Saprolegnia lacks the large effector families that are hallmarks of plant pathogenic oomycetes, showing evolutionary adaptation to the host. Moreover, Saprolegnia harbors pathogenesis-related genes that were derived by lateral gene transfer from the host and other animal pathogens. The retrotransposon LINE family also appears to be acquired from animal lineages. By transcriptome analysis we show a high rate of allelic variation, which reveals rapidly evolving genes and potentially adaptive evolutionary mechanisms coupled to selective pressures exerted by the animal host. The genome and transcriptome data, as well as subsequent biochemical analyses, provided us with insight in the disease process of Saprolegnia at a molecular and cellular level, providing us with targets for sustainable control of Saprolegnia.
Massively-parallel cDNA sequencing has opened the way to deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here, we present the Trinity methodology for de novo full-length transcriptome reconstruction, and evaluate it on samples from fission yeast, mouse, and whitefly – an insect whose genome has not yet been sequenced. Trinity fully reconstructs a large fraction of the transcripts present in the data, also reporting alternative splice isoforms and transcripts from recently duplicated genes. In all cases, Trinity performs better than other available de novo transcriptome assembly programs, and its sensitivity is comparable to methods relying on genome alignments. Our approach provides a unified and general solution for transcriptome reconstruction in any sample, especially in the complete absence of a reference genome.
The large outbreak of diarrhea and hemolytic uremic syndrome (HUS) caused by Shiga toxin-producing Escherichia coli O104:H4 in Europe from May to July 2011 highlighted the potential of a rarely identified E. coli serogroup to cause severe disease. Prior to the outbreak, there were very few reports of disease caused by this pathogen and thus little known of its diversity and evolution. The identification of cases of HUS caused by E. coli O104:H4 in France and Turkey after the outbreak and with no clear epidemiological links raises questions about whether these sporadic cases are derived from the outbreak. Here, we report genome sequences of five independent isolates from these cases and results of a comparative analysis with historical and 2011 outbreak isolates. These analyses revealed that the five isolates are not derived from the outbreak strain; however, they are more closely related to the outbreak strain and each other than to isolates identified prior to the 2011 outbreak. Over the short time scale represented by these closely related organisms, the majority of genome variation is found within their mobile genetic elements: none of the nine O104:H4 isolates compared here contain the same set of plasmids, and their prophages and genomic islands also differ. Moreover, the presence of closely related HUS-associated E. coli O104:H4 isolates supports the contention that fully virulent O104:H4 isolates are widespread and emphasizes the possibility of future food-borne E. coli O104:H4 outbreaks.
In the summer of 2011, a large outbreak of bloody diarrhea with a high rate of severe complications took place in Europe, caused by a previously rarely seen Escherichia coli strain of serogroup O104:H4. Identification of subsequent infections caused by E. coli O104:H4 raised questions about whether these new cases represented ongoing transmission of the outbreak strain. In this study, we sequenced the genomes of isolates from five recent cases and compared them with historical isolates. The analyses reveal that, in the very short term, evolution of the bacterial genome takes place in parts of the genome that are exchanged among bacteria, and these regions contain genes involved in adaptation to local environments. We show that these recent isolates are not derived from the outbreak strain but are very closely related and share many of the same disease-causing genes, emphasizing the concern that these bacteria may cause future severe outbreaks.
High-throughput sequencing of cDNA libraries (RNA-Seq) has proven to be a highly effective approach for studying bacterial transcriptomes. A central challenge in designing RNA-Seq-based experiments is estimating a priori the number of reads per sample needed to detect and quantify thousands of individual transcripts with a large dynamic range of abundance.
We have conducted a systematic examination of how changes in the number of RNA-Seq reads per sample influences both profiling of a single bacterial transcriptome and the comparison of gene expression among samples. Our findings suggest that the number of reads typically produced in a single lane of the Illumina HiSeq sequencer far exceeds the number needed to saturate the annotated transcriptomes of diverse bacteria growing in monoculture. Moreover, as sequencing depth increases, so too does the detection of cDNAs that likely correspond to spurious transcripts or genomic DNA contamination. Finally, even when dozens of barcoded individual cDNA libraries are sequenced in a single lane, the vast majority of transcripts in each sample can be detected and numerous genes differentially expressed between samples can be identified.
Our analysis provides a guide for the many researchers seeking to determine the appropriate sequencing depth for RNA-Seq-based studies of diverse bacterial species.
Fungal genome annotation is the starting point for analysis of genome content. This generally involves the application of diverse methods to identify features on a genome assembly such as protein-coding and non-coding genes, repeats and transposable elements, and pseudogenes. Here we describe tools and methods leveraged for eukaryotic genome annotation with a focus on the annotation of fungal nuclear and mitochondrial genomes. We highlight the application of the latest technologies and tools to improve the quality of predicted gene sets. The Broad Institute eukaryotic genome annotation pipeline is described as one example of how such methods and tools are integrated into a sequencing center’s production genome annotation environment.
Splicing of mRNA is an ancient and evolutionarily conserved process in eukaryotic organisms, but intron-exon structures vary. Plasmodium falciparum has an extreme AT nucleotide bias (>80%), providing a unique opportunity to investigate how evolutionary forces have acted on intron structures. In this study, we developed an in vivo luciferase reporter splicing assay and employed it in combination with lariat isolation and sequencing to characterize 5′ and 3′ splicing requirements and experimentally determine the intron branch point in P. falciparum. This analysis indicates that P. falciparum mRNAs have canonical 5′ and 3′ splice sites. However, the 5′ consensus motif is weakly conserved and tolerates nucleotide substitution, including the fifth nucleotide in the intron, which is more typically a G nucleotide in most eukaryotes. In comparison, the 3′ splice site has a strong eukaryotic consensus sequence and adjacent polypyrimidine tract. In four different P. falciparum pre-mRNAs, multiple branch points per intron were detected, with some at U instead of the typical A residue. A weak branch point consensus was detected among 18 identified branch points. This analysis indicates that P. falciparum retains many consensus eukaryotic splice site features, despite having an extreme codon bias, and possesses flexibility in branch point nucleophilic attack.
We have developed a process for transcriptome analysis of bacterial communities that accommodates both intact and fragmented starting RNA and combines efficient rRNA removal with strand-specific RNA-seq. We applied this approach to an RNA mixture derived from three diverse cultured bacterial species and to RNA isolated from clinical stool samples. The resulting expression profiles were highly reproducible, enriched up to 40-fold for non-rRNA transcripts, and correlated well with profiles representing undepleted total RNA.
A report on the Advances in Genome Biology & Technology conference, Marco Island, USA, 2-5 February 2011.
The fission yeast clade, comprising Schizosaccharomyces pombe, S. octosporus, S. cryophilus and S. japonicus, occupies the basal branch of Ascomycete fungi and is an important model of eukaryote biology. A comparative annotation of these genomes identified a near extinction of transposons and the associated innovation of transposon-free centromeres. Expression analysis established that meiotic genes are subject to antisense transcription during vegetative growth, suggesting a mechanism for their tight regulation. In addition, trans-acting regulators control new genes within the context of expanded functional modules for meiosis and stress response. Differences in gene content and regulation also explain why, unlike the Saccharomycotina, fission yeasts cannot use ethanol as a primary carbon source. These analyses elucidate the genome structure and gene regulation of fission yeast and provide tools for investigation across the Schizosaccharomyces clade.
Motivation: Chimeric DNA sequences often form during polymerase chain reaction amplification, especially when sequencing single regions (e.g. 16S rRNA or fungal Internal Transcribed Spacer) to assess diversity or compare populations. Undetected chimeras may be misinterpreted as novel species, causing inflated estimates of diversity and spurious inferences of differences between populations. Detection and removal of chimeras is therefore of critical importance in such experiments.
Results: We describe UCHIME, a new program that detects chimeric sequences with two or more segments. UCHIME either uses a database of chimera-free sequences or detects chimeras de novo by exploiting abundance data. UCHIME has better sensitivity than ChimeraSlayer (previously the most sensitive database method), especially with short, noisy sequences. In testing on artificial bacterial communities with known composition, UCHIME de novo sensitivity is shown to be comparable to Perseus. UCHIME is >100× faster than Perseus and >1000× faster than ChimeraSlayer.
Availability: Source, binaries and data: http://drive5.com/uchime.
Supplementary information: Supplementary data are available at Bioinformatics online.
Castor bean (Ricinus communis) is an oil crop that belongs to the spurge (Euphorbiaceae) family. Its seeds are the source of castor oil, used for the production of high-quality lubricants due to its high proportion of the unusual fatty acid ricinoleic acid. Castor bean seeds also produce ricin, a highly toxic ribosome inactivating protein, making castor bean relevant for biosafety. We report here the 4.6X draft genome sequence of castor bean, representing the first reported Euphorbiaceae genome sequence. Our analysis shows that most key castor oil metabolism genes are single-copy while the ricin gene family is larger than previously thought. Comparative genomics analysis suggests the presence of an ancient hexaploidization event that is conserved across the dicotyledonous lineage.
Schistosoma mansoni is responsible for the neglected tropical disease schistosomiasis that affects 210 million people in 76 countries. We report here analysis of the 363 megabase nuclear genome of the blood fluke. It encodes at least 11,809 genes, with an unusual intron size distribution, and novel families of micro-exon genes that undergo frequent alternate splicing. As the first sequenced flatworm, and a representative of the lophotrochozoa, it offers insights into early events in the evolution of the animals, including the development of a body pattern with bilateral symmetry, and the development of tissues into organs. Our analysis has been informed by the need to find new drug targets. The deficits in lipid metabolism that make schistosomes dependent on the host are revealed, while the identification of membrane receptors, ion channels and more than 300 proteases, provide new insights into the biology of the life cycle and novel targets. Bioinformatics approaches have identified metabolic chokepoints while a chemogenomic screen has pinpointed schistosome proteins for which existing drugs may be active. The information generated provides an invaluable resource for the research community to develop much needed new control tools for the treatment and eradication of this important and neglected disease.
Tetrahymena thermophila, a widely studied model for cellular and molecular biology, is a binucleated single-celled organism with a germline micronucleus (MIC) and somatic macronucleus (MAC). The recent draft MAC genome assembly revealed low sequence repetitiveness, a result of the epigenetic removal of invasive DNA elements found only in the MIC genome. Such low repetitiveness makes complete closure of the MAC genome a feasible goal, which to achieve would require standard closure methods as well as removal of minor MIC contamination of the MAC genome assembly. Highly accurate preliminary annotation of Tetrahymena's coding potential was hindered by the lack of both comparative genomic sequence information from close relatives and significant amounts of cDNA evidence, thus limiting the value of the genomic information and also leaving unanswered certain questions, such as the frequency of alternative splicing.
We addressed the problem of MIC contamination using comparative genomic hybridization with purified MIC and MAC DNA probes against a whole genome oligonucleotide microarray, allowing the identification of 763 genome scaffolds likely to contain MIC-limited DNA sequences. We also employed standard genome closure methods to essentially finish over 60% of the MAC genome. For the improvement of annotation, we have sequenced and analyzed over 60,000 verified EST reads from a variety of cellular growth and development conditions. Using this EST evidence, a combination of automated and manual reannotation efforts led to updates that affect 16% of the current protein-coding gene models. By comparing EST abundance, many genes showing apparent differential expression between these conditions were identified. Rare instances of alternative splicing and uses of the non-standard amino acid selenocysteine were also identified.
We report here significant progress in genome closure and reannotation of Tetrahymena thermophila. Our experience to date suggests that complete closure of the MAC genome is attainable. Using the new EST evidence, automated and manual curation has resulted in substantial improvements to the over 24,000 gene models, which will be valuable to researchers studying this model organism as well as for comparative genomics purposes.
The position of a poly(A) site of eukaryotic mRNA is determined by sequence signals in pre-mRNA and a group of polyadenylation factors. To reveal rice poly(A) signals at a genome level, we constructed a dataset of 55 742 authenticated poly(A) sites and characterized the poly(A) signals. This resulted in identifying the typical tripartite cis-elements, including FUE, NUE and CE, as previously observed in Arabidopsis. The average size of the 3′-UTR was 289 nucleotides. When mapped to the genome, however, 15% of these poly(A) sites were found to be located in the currently annotated intergenic regions. Moreover, an extensive alternative polyadenylation profile was evident where 50% of the genes analyzed had more than one unique poly(A) site (excluding microheterogeneity sites), and 13% had four or more poly(A) sites. About 4% of the analyzed genes possessed alternative poly(A) sites at their introns, 5′-UTRs, or protein coding regions. The authenticity of these alternative poly(A) sites was partially confirmed using MPSS data. Analysis of nucleotide profile and signal patterns indicated that there may be a different set of poly(A) signals for those poly(A) sites found in the coding regions. Based on the features of rice poly(A) signals, an updated algorithm termed PASS-Rice was designed to predict poly(A) sites.
High gene numbers in plant genomes reflect polyploidy and major gene duplication events. Oryza sativa, cultivated rice, is a diploid monocotyledonous species with a ~390 Mb genome that has undergone segmental duplication of a substantial portion of its genome. This, coupled with other genetic events such as tandem duplications, has resulted in a substantial number of its genes, and resulting proteins, occurring in paralogous families.
Using a computational pipeline that utilizes Pfam and novel protein domains, we characterized paralogous families in rice and compared these with paralogous families in the model dicotyledonous diploid species, Arabidopsis thaliana. Arabidopsis, which has undergone genome duplication as well, has a substantially smaller genome (~120 Mb) and gene complement compared to rice. Overall, 53% and 68% of the non-transposable element-related rice and Arabidopsis proteins could be classified into paralogous protein families, respectively. Singleton and paralogous family genes differed substantially in their likelihood of encoding a protein of known or putative function; 26% and 66% of singleton genes compared to 73% and 96% of the paralogous family genes encode a known or putative protein in rice and Arabidopsis, respectively. Furthermore, a major skew in the distribution of specific gene function was observed; a total of 17 Gene Ontology categories in both rice and Arabidopsis were statistically significant in their differential distribution between paralogous family and singleton proteins. In contrast to mammalian organisms, we found that duplicated genes in rice and Arabidopsis tend to have more alternative splice forms. Using data from Massively Parallel Signature Sequencing, we show that a significant portion of the duplicated genes in rice show divergent expression although a correlation between sequence divergence and correlation of expression could be seen in very young genes.
Collectively, these data suggest that while co-regulation and conserved function are present in some paralogous protein family members, evolutionary pressures have resulted in functional divergence with differential expression patterns.
EVidenceModeler (EVM) is an automated annotation tool that predicts protein-coding regions, alternatively spliced transcripts and untranslated regions of eukaryotic genes.
EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.
Recently, genomic sequencing efforts were finished for Oryza sativa (cultivated rice) and Arabidopsis thaliana (Arabidopsis). Additionally, these two plant species have extensive cDNA and expressed sequence tag (EST) libraries. We employed the Program to Assemble Spliced Alignments (PASA) to identify and analyze alternatively spliced isoforms in both species.
A comprehensive analysis of alternative splicing was performed in rice that started with >1.1 million publicly available spliced ESTs and over 30,000 full length cDNAs in conjunction with the newly enhanced PASA software. A parallel analysis was performed with Arabidopsis to compare and ascertain potential differences between monocots and dicots. Alternative splicing is a widespread phenomenon (observed in greater than 30% of the loci with transcript support) and we have described nine alternative splicing variations. While alternative splicing has the potential to create many RNA isoforms from a single locus, the majority of loci generate only two or three isoforms and transcript support indicates that these isoforms are generally not rare events. For the alternate donor (AD) and acceptor (AA) classes, the distance between the splice sites for the majority of events was found to be less than 50 basepairs (bp). In both species, the most frequent distance between AA is 3 bp, consistent with reports in mammalian systems. Conversely, the most frequent distance between AD is 4 bp in both plant species, as previously observed in mouse. Most alternative splicing variations are localized to the protein coding sequence and are predicted to significantly alter the coding sequence.
Alternative splicing is widespread in both rice and Arabidopsis and these species share many common features. Interestingly, alternative splicing may play a role beyond creating novel combinations of transcripts that expand the proteome. Many isoforms will presumably have negative consequences for protein structure and function, suggesting that their biological role involves post-transcriptional regulation of gene expression.
In this study, we addressed whether a single 454 Life Science GS20 sequencing run provides new gene discovery from a normalized cDNA library, and whether the short reads produced via this technology are of value in gene structure annotation.
A single 454 GS20 sequencing run on adapter-ligated cDNA, from a normalized cDNA library, generated 292,465 reads that were reduced to 252,384 reads with an average read length of 92 nucleotides after cleaning. After clustering and assembly, a total of 184,599 unique sequences were generated containing over 400 SSRs. The 454 sequences generated hits to more genes than a comparable amount of sequence from MtGI. Although short, the 454 reads are of sufficient length to map to a unique genome location as effectively as longer ESTs produced by conventional sequencing. Functional interpretation of the sequences was carried out by Gene Ontology assignments from matches to Arabidopsis and was shown to cover a broad range of GO categories. 53,796 assemblies and singletons (29%) had no match in the existing MtGI. Within the previously unobserved Medicago transcripts, thousands had matches in a comprehensive protein database and one or more of the TIGR Plant Gene Indices. Approximately 20% of these novel sequences could be found in the Medicago genome sequence. A total of 70,026 reads generated by the 454 technology were mapped to 785 Medicago finished BACs using PASA and over 1,000 gene models required modification. In parallel to 454 sequencing, 4,445 5'-prime reads were generated by conventional sequencing using the same library and from the assembled sequences it was shown to contain about 52% full length cDNAs encoding proteins from 50 to over 500 amino acids in length.
Due to the large number of reads afforded by the 454 DNA sequencing technology, it is effective in revealing the expression of transcripts from a broad range of GO categories and contains many rare transcripts in normalized cDNA libraries, although only a limited portion of their sequence is uncovered. As with longer ESTs, 454 reads can be mapped uniquely onto genomic sequence to provide support for, and modifications of, gene predictions.
Listeria monocytogenes, a foodborne bacterial pathogen, is comprised of four phylogenetic lineages that vary with regard to their serotypes and distribution among sources. In order to characterize lineage-specific genomic diversity within L. monocytogenes, we sequenced the genomes of eight strains from several lineages and serotypes, and characterized the accessory genome, which was hypothesized to contribute to phenotypic differences across lineages. The eight L. monocytogenes genomes sequenced range in size from 2.85–3.14 Mb, encode 2,822–3,187 genes, and include the first publicly available sequenced representatives of serotypes 1/2c, 3a and 4c. Mapping of the distribution of accessory genes revealed two distinct regions of the L. monocytogenes chromosome: an accessory-rich region in the first 65° adjacent to the origin of replication and a more stable region in the remaining 295°. This pattern of genome organization is distinct from that of related bacteria Staphylococcus aureus and Bacillus cereus. The accessory genome of all lineages is enriched for cell surface-related genes and phosphotransferase systems, and transcriptional regulators, highlighting the selective pressures faced by contemporary strains from their hosts, other microbes, and their environment. Phylogenetic analysis of O-antigen genes and gene clusters predicts that serotype 4 was ancestral in L. monocytogenes and serotype 1/2 associated gene clusters were putatively introduced through horizontal gene transfer in the ancestral population of L. monocytogenes lineage I and II.
Since the initial publication of its complete genome sequence, Arabidopsis thaliana has become more important than ever as a model for plant research. However, the initial genome annotation was submitted by multiple centers using inconsistent methods, making the data difficult to use for many applications.
Over the course of three years, TIGR has completed its effort to standardize the structural and functional annotation of the Arabidopsis genome. Using both manual and automated methods, Arabidopsis gene structures were refined and gene products were renamed and assigned to Gene Ontology categories. We present an overview of the methods employed, tools developed, and protocols followed, summarizing the contents of each data release with special emphasis on our final annotation release (version 5).
Over the entire period, several thousand new genes and pseudogenes were added to the annotation. Approximately one third of the originally annotated gene models were significantly refined yielding improved gene structure annotations, and every protein-coding gene was manually inspected and classified using Gene Ontology terms.
The spliced alignment of expressed sequence data to genomic sequence has proven a key tool in the comprehensive annotation of genes in eukaryotic genomes. A novel algorithm was developed to assemble clusters of overlapping transcript alignments (ESTs and full-length cDNAs) into maximal alignment assemblies, thereby comprehensively incorporating all available transcript data and capturing subtle splicing variations. Complete and partial gene structures identified by this method were used to improve The Institute for Genomic Research Arabidopsis genome annotation (TIGR release v.4.0). The alignment assemblies permitted the automated modeling of several novel genes and >1000 alternative splicing variations as well as updates (including UTR annotations) to nearly half of the ∼27 000 annotated protein coding genes. The algorithm of the Program to Assemble Spliced Alignments (PASA) tool is described, as well as the results of automated updates to Arabidopsis gene annotations.
Annotation of eukaryotic genomes is a complex endeavor that requires the integration of evidence from multiple, often contradictory, sources. With the ever-increasing amount of genome sequence data now available, methods for accurate identification of large numbers of genes have become urgently needed. In an effort to create a set of very high-quality gene models, we used the sequence of 5,000 full-length gene transcripts from Arabidopsis to re-annotate its genome. We have mapped these transcripts to their exact chromosomal locations and, using alignment programs, have created gene models that provide a reference set for this organism.
Approximately 35% of the transcripts indicated that previously annotated genes needed modification, and 5% of the transcripts represented newly discovered genes. We also discovered that multiple transcription initiation sites appear to be much more common than previously known, and we report numerous cases of alternative mRNA splicing. We include a comparison of different alignment software and an analysis of how the transcript data improved the previously published annotation.
Our results demonstrate that sequencing of large numbers of full-length transcripts followed by computational mapping greatly improves identification of the complete exon structures of eukaryotic genes. In addition, we are able to find numerous introns in the untranslated regions of the genes.
A computational system for analysis of the repetitive structure of genomic sequences is described. The method uses suffix trees to organize and search the input sequences; this data structure has been used previously for efficient computation of exact and degenerate repeats.
The resulting software tool collects all repeat classes and outputs summary statistics as well as a file containing multiple sequences (multi fasta), that can be used as the target of searches. Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences.
We propose a new clustering method for analysis of the repeat data captured in suffix trees. This method has been incorporated into a system that can find repeats in individual genome sequences or sets of sequences, and that can organize those repeats into classes. It quickly and accurately creates repeat databases from small and large genomes. The associated software (RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial genome sequences.
An endonuclease IV homolog was identified as the product of a conceptual open reading frame in the genome of the hyperthermophilic bacterium Thermotoga maritima. The T. maritima endonuclease IV gene encodes a 287-amino-acid protein with 32% sequence identity to Escherichia coli endonuclease IV. The gene was cloned, and the expressed protein was purified and shown to have enzymatic activities that are characteristic of the endonuclease IV family of DNA repair enzymes, including apurinic/apyrimidinic endonuclease activity and repair activities on 3′-phosphates, 3′-phosphoglycolates, and 3′-trans-4-hydroxy-2-pentenal-5-phosphates. The T. maritima enzyme exhibits enzyme activity at both low and high temperatures. Circular dichroism spectroscopy indicates that T. maritima endonuclease IV has secondary structure similar to that of E. coli endonuclease IV and that the T. maritima endonuclease IV structure is more stable than E. coli endonuclease IV by almost 20°C, beginning to rapidly denature only at temperatures approaching 90°C. The presence of this enzyme, which is part of the DNA base excision repair pathway, suggests that thermophiles use a mechanism similar to that used by mesophiles to deal with the large number of abasic sites that arise in their chromosomes due to the increased rates of DNA damage at elevated temperatures.