Transposable elements (TEs) are mobile, repetitive sequences that make up significant fractions of metazoan genomes. Despite their near ubiquity and importance in genome and chromosome biology, most efforts to annotate TEs in genome sequences rely on the results of a single computational program, RepeatMasker. In contrast, recent advances in gene annotation indicate that high-quality gene models can be produced from combining multiple independent sources of computational evidence. To elevate the quality of TE annotations to a level comparable to that of gene models, we have developed a combined evidence-model TE annotation pipeline, analogous to systems used for gene annotation, by integrating results from multiple homology-based and de novo TE identification methods. As proof of principle, we have annotated “TE models” in Drosophila melanogaster Release 4 genomic sequences using the combined computational evidence derived from RepeatMasker, BLASTER, TBLASTX, all-by-all BLASTN, RECON, TE-HMM and the previous Release 3.1 annotation. Our system is designed for use with the Apollo genome annotation tool, allowing automatic results to be curated manually to produce reliable annotations. The euchromatic TE fraction of D. melanogaster is now estimated at 5.3% (cf. 3.86% in Release 3.1), and we found a substantially higher number of TEs (n = 6,013) than previously identified (n = 1,572). Most of the new TEs derive from small fragments of a few hundred nucleotides long and highly abundant families not previously annotated (e.g., INE-1). We also estimated that 518 TE copies (8.6%) are inserted into at least one other TE, forming a nest of elements. The pipeline allows rapid and thorough annotation of even the most complex TE models, including highly deleted and/or nested elements such as those often found in heterochromatic sequences. Our pipeline can be easily adapted to other genome sequences, such as those of the D. melanogaster heterochromatin or other species in the genus Drosophila.
A first step in adding value to the large-scale DNA sequences generated by genome projects is the process of annotation—marking biological features on the raw string of adenines, cytosines, guanines, and thymines. The predominant goal in genome annotation thus far has been to identify gene sequences that encode proteins; however, many functional sequences exist in non-protein-coding regions and their annotation remains incomplete. Mobile, repetitive DNA segments known as transposable elements (TEs) are one class of functional sequence in non-protein-coding regions, which can make up large fractions of genome sequences (e.g., about 45% in the human) and can play important roles in gene and chromosome structure and regulation. As a consequence, there has been increasing interest in the computational identification of TEs in genome sequences. Borrowing current ideas from the field of gene annotation, the authors have developed a pipeline to predict TEs in genome sequences that combines multiple sources of evidence from different computational methods. The authors' combined-evidence pipeline represents an important step towards raising the standards of TE annotation to the same quality as that of genes, and should help catalyze their understanding of the biological role of these fascinating sequences.
Annotation of an improved whole-genome shotgun assembly of the Drosophila melanogaster genome predicted 297 protein-coding genes and six non-protein-coding genes, including known heterochromatic genes, and regions of similarity to known transposable elements. Fluorescence in situ hybridization was used to correlate the genomic sequence with the cytogenetic map; the annotated euchromatic sequence extends into the centric heterochromatin on each chromosome arm.
Most eukaryotic genomes include a substantial repeat-rich fraction termed heterochromatin, which is concentrated in centric and telomeric regions. The repetitive nature of heterochromatic sequence makes it difficult to assemble and analyze. To better understand the heterochromatic component of the Drosophila melanogaster genome, we characterized and annotated portions of a whole-genome shotgun sequence assembly.
WGS3, an improved whole-genome shotgun assembly, includes 20.7 Mb of draft-quality sequence not represented in the Release 3 sequence spanning the euchromatin. We annotated this sequence using the methods employed in the re-annotation of the Release 3 euchromatic sequence. This analysis predicted 297 protein-coding genes and six non-protein-coding genes, including known heterochromatic genes, and regions of similarity to known transposable elements. Bacterial artificial chromosome (BAC)-based fluorescence in situ hybridization analysis was used to correlate the genomic sequence with the cytogenetic map in order to refine the genomic definition of the centric heterochromatin; on the basis of our cytological definition, the annotated Release 3 euchromatic sequence extends into the centric heterochromatin on each chromosome arm.
Whole-genome shotgun assembly produced a reliable draft-quality sequence of a significant part of the Drosophila heterochromatin. Annotation of this sequence defined the intron-exon structures of 30 known protein-coding genes and 267 protein-coding gene models. The cytogenetic mapping suggests that an additional 150 predicted genes are located in heterochromatin at the base of the Release 3 euchromatic sequence. Our analysis suggests strategies for improving the sequence and annotation of the heterochromatic portions of the Drosophila and other complex genomes.
With the advancement of new high throughput sequencing technologies, there has been an increase in the number of genome sequencing projects worldwide, which has yielded complete genome sequences of human, animals and plants. Subsequently, several labs have focused on genome annotation, consisting of assigning functions to gene products, mostly using Gene Ontology (GO) terms. As a consequence, there is an increased heterogeneity in annotations across genomes due to different approaches used by different pipelines to infer these annotations and also due to the nature of the GO structure itself. This makes a curator's task difficult, even if they adhere to the established guidelines for assessing these protein annotations. Here we develop a genome-scale approach for integrating GO annotations from different pipelines using semantic similarity measures. We used this approach to identify inconsistencies and similarities in functional annotations between orthologs of human and Drosophila melanogaster, to assess the quality of GO annotations derived from InterPro2GO mappings compared to manually annotated GO annotations for the Drosophila melanogaster proteome from a FlyBase dataset and human, and to filter GO annotation data for these proteomes. Results obtained indicate that an efficient integration of GO annotations eliminates redundancy up to 27.08 and 22.32% in the Drosophila melanogaster and human GO annotation datasets, respectively. Furthermore, we identified lack of and missing annotations for some orthologs, and annotation mismatches between InterPro2GO and manual pipelines in these two proteomes, thus requiring further curation. This simplifies and facilitates tasks of curators in assessing protein annotations, reduces redundancy and eliminates inconsistencies in large annotation datasets for ease of comparative functional genomics.
functional annotation; Gene Ontology annotation; annotation pipeline; manual annotation; electronic annotation
High-quality full-insert sequence for 8,921 putative full-length cDNA clones in the Drosophila Gene Collection has been generated and compared to the annotated Release 3 genomic sequence. More than 5,300 cDNAs have been identifieed that contain a complete and accurate protein-coding sequence, corresponding to at least one splice form for 40% of the predicted D. melanogaster genes.
A collection of sequenced full-length cDNAs is an important resource both for functional genomics studies and for the determination of the intron-exon structure of genes. Providing this resource to the Drosophila melanogaster research community has been a long-term goal of the Berkeley Drosophila Genome Project. We have previously described the Drosophila Gene Collection (DGC), a set of putative full-length cDNAs that was produced by generating and analyzing over 250,000 expressed sequence tags (ESTs) derived from a variety of tissues and developmental stages.
We have generated high-quality full-insert sequence for 8,921 clones in the DGC. We compared the sequence of these clones to the annotated Release 3 genomic sequence, and identified more than 5,300 cDNAs that contain a complete and accurate protein-coding sequence. This corresponds to at least one splice form for 40% of the predicted D. melanogaster genes. We also identified potential new cases of RNA editing.
We show that comparison of cDNA sequences to a high-quality annotated genomic sequence is an effective approach to identifying and eliminating defective clones from a cDNA collection and ensure its utility for experimentation. Clones were eliminated either because they carry single nucleotide discrepancies, which most probably result from reverse transcriptase errors, or because they are truncated and contain only part of the protein-coding sequence.
We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment.
The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified.
This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence.
FlyBase (http://flybase.bio.indiana.edu/) provides an integrated view of the fundamental genomic and genetic data on the major genetic model Drosophila melanogaster and related species. FlyBase has primary responsibility for the continual reannotation of the D. melanogaster genome. The ultimate goal of the reannotation effort is to decorate the euchromatic sequence of the genome with as much biological information as is available from the community and from the major genome project centers. A complete revision of the annotations of the now-finished euchromatic genomic sequence has been completed. There are many points of entry to the genome within FlyBase, most notably through maps, gene products and ontologies, structured phenotypic and gene expression data, and anatomy.
Apollo was developed to enable curators to inspect genome annotations closely and edit them. FlyBase biologists successfully used Apollo to annotate the Drosophila melanogaster genome.
The well-established inaccuracy of purely computational methods for annotating genome sequences necessitates an interactive tool to allow biological experts to refine these approximations by viewing and independently evaluating the data supporting each annotation. Apollo was developed to meet this need, enabling curators to inspect genome annotations closely and edit them. FlyBase biologists successfully used Apollo to annotate the Drosophila melanogaster genome and it is increasingly being used as a starting point for the development of customized annotation editing tools for other genome projects.
FlyBase (http://flybase.bio.indiana.edu/) provides an integrated view of the fundamental genomic and genetic data on the major genetic model Drosophila melanogaster and related species. Following on the success of the Drosophila genome project, FlyBase has primary responsibility for the continual reannotation of the D.melanogaster genome. The ultimate goal of the reannotation effort is to decorate the euchromatic sequence of the genome with as much biological information as is available from the community and from the major genome project centers. The current cycle of reannotation focuses on establishing a comprehensive data set of gene models (i.e. transcription units and CDSs). There are many points of entry to the genome within FlyBase, most notably through maps, gene ontologies, structured phenotypic and gene expression data, and anatomy.
The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the ‘tagtog’ system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation.
Understanding how sets of genes are coordinately regulated in space and time to generate the diversity of cell types that characterise complex metazoans is a major challenge in modern biology. The use of high-throughput approaches, such as large-scale in situ hybridisation and genome-wide expression profiling via DNA microarrays, is beginning to provide insights into the complexities of development. However, in many organisms the collection and annotation of comprehensive in situ localisation data is a difficult and time-consuming task. Here, we present a widely applicable computational approach, integrating developmental time-course microarray data with annotated in situ hybridisation studies, that facilitates the de novo prediction of tissue-specific expression for genes that have no in vivo gene expression localisation data available. Using a classification approach, trained with data from microarray and in situ hybridisation studies of gene expression during Drosophila embryonic development, we made a set of predictions on the tissue-specific expression of Drosophila genes that have not been systematically characterised by in situ hybridisation experiments. The reliability of our predictions is confirmed by literature-derived annotations in FlyBase, by overrepresentation of Gene Ontology biological process annotations, and, in a selected set, by detailed gene-specific studies from the literature. Our novel organism-independent method will be of considerable utility in enriching the annotation of gene function and expression in complex multicellular organisms.
The task of deciphering the complex transcriptional regulatory networks controlling development is one of the major current challenges for molecular biology. The problem is difficult, if not impossible, to solve without a detailed knowledge of the spatiotemporal dynamics of gene expression. Thus, to understand development, we need to identify and functionally characterize all players in regulatory networks. Data on gene expression dynamics obtained from whole transcriptome microarray experiments, combined with in situ hybridization mRNA localisation patterns for a subset of genes, may provide a route for predicting the localisation of gene expression for those genes for which in situ data has not been generated, as well as suggesting functional information for uncharacterised genes. Here, we report the development of one of the first methods for predicting the localisation of gene expression during Drosophila embryogenesis from microarray data. Pooling the subset of genes in the fly genome with in situ data to form functional units, localised in space and time for relevant developmental processes, facilitates the statement of a classification problem, which we address with machine-learning methods. Our approach promotes a richer annotation of biological function for genes in the absence of costly and time-consuming experimental analysis.
We performed benchmarks of phylogenetic grammar-based ncRNA gene prediction, experimenting with eight different models of structural evolution and two different programs for genome alignment. We evaluated our models using alignments of twelve Drosophila genomes. We find that ncRNA prediction performance can vary greatly between different gene predictors and subfamilies of ncRNA gene. Our estimates for false positive rates are based on simulations which preserve local islands of conservation; using these simulations, we predict a higher rate of false positives than previous computational ncRNA screens have reported. Using one of the tested prediction grammars, we provide an updated set of ncRNA predictions for D. melanogaster and compare them to previously-published predictions and experimental data. Many of our predictions show correlations with protein-coding genes. We found significant depletion of intergenic predictions near the 3′ end of coding regions and furthermore depletion of predictions in the first intron of protein-coding genes. Some of our predictions are colocated with larger putative unannotated genes: for example, 17 of our predictions showing homology to the RFAM family snoR28 appear in a tandem array on the X chromosome; the 4.5 Kbp spanned by the predicted tandem array is contained within a FlyBase-annotated cDNA.
FlyTF (http://www.flytf.org) is a database of computationally predicted and/or experimentally verified site-specific transcription factors (TFs) in the fruit fly Drosophila melanogaster. The manual classification of TFs in the initial version of FlyTF that concentrated primarily on the DNA-binding characteristics of the proteins has now been extended to a more fine-grained annotation of both DNA binding and regulatory properties in the new release. Furthermore, experimental evidence from the literature was classified into a defined vocabulary, and in collaboration with FlyBase, translated into Gene Ontology (GO) annotation. While our GO annotations will also be available through FlyBase as they will be incorporated into the genes’ official GO annotation in the future, the entire evidence used for classification including computational predictions and quotes from the literature can be accessed through FlyTF. The FlyTF website now builds upon the InterMine framework, which provides experimental and computational biologists with powerful search and filter functionality, list management tools and access to genomic information associated with the TFs.
Genome assemblies rely on the existence of transcript sequence to stitch together contigs, verify assembly of whole genome shotgun reads, and annotate genes. Functional genomics studies also rely on transcript sequence to create expression microarrays or interpret digital tag data produced by methods such as Serial Analysis of Gene Expression (SAGE). Transcript sequence can be predicted based on reconstruction from overlapping expressed sequence tags (EST) that are obtained by single-pass sequencing of random cDNA clones, but these reconstructions are prone to errors caused by alternative splice forms, transcripts from gene families with related sequences, and expressed pseudogenes. These errors confound genome assembly and annotation. The most useful transcript sequences are derived by complete insert sequencing of clones containing the entire length, or at least the full protein coding sequence (CDS) portion, of the source mRNA. While the bovine genome sequencing initiative is nearing completion, there is currently a paucity of bovine full-CDS mRNA and protein sequence data to support bovine genome assembly and functional genomics studies. Consequently, the production of high-quality bovine full-CDS cDNA sequences will enhance the bovine genome assembly and functional studies of bovine genes and gene products. The goal of this investigation was to identify and characterize the full-CDS sequences of bovine transcripts from clones identified in non-full-length enriched cDNA libraries. In contrast to several recent full-length cDNA investigations, these full-CDS cDNAs were selected, sequenced, and annotated without the benefit of the target organism's genomic sequence, by using comparison of bovine EST sequence to existing human mRNA to identify likely full-CDS clones for full-length insert cDNA (FLIC) sequencing.
The predicted bovine protein lengths, 5' UTR lengths, and Kozak consensus sequences from 954 bovine FLIC sequences (bFLICs; average length 1713 nt, representing 762 distinct loci) are all consistent with previously sequenced mammalian full-length transcripts.
In most cases, the bFLICs span the entire CDS of the genes, providing the basis for creating predicted bovine protein sequences to support proteomics and comparative evolutionary research as well as functional genomics and genome annotation. The results demonstrate the utility of the comparative approach in obtaining predicted protein sequences in other species.
A novel Drosophila microarray constructed on the basis of an integrated in silico/wet biology approach provides evidence for the transcription of approximately 2,600 additional genes. Validation indicates a lower limit of 2,000 novel annotations, thus raising the number of genes that make a fly.
While the genome sequences for a variety of organisms are now available, the precise number of the genes encoded is still a matter of debate. For the human genome several stringent annotation approaches have resulted in the same number of potential genes, but a careful comparison revealed only limited overlap. This indicates that only the combination of different computational prediction methods and experimental evaluation of such in silico data will provide more complete genome annotations. In order to get a more complete gene content of the Drosophila melanogaster genome, we based our new D. melanogaster whole-transcriptome microarray, the Heidelberg FlyArray, on the combination of the Berkeley Drosophila Genome Project (BDGP) annotation and a novel ab initio gene prediction of lower stringency using the Fgenesh software.
Here we provide evidence for the transcription of approximately 2,600 additional genes predicted by Fgenesh. Validation of the developmental profiling data by RT-PCR and in situ hybridization indicates a lower limit of 2,000 novel annotations, thus substantially raising the number of genes that make a fly.
The successful design and application of this novel Drosophila microarray on the basis of our integrated in silico/wet biology approach confirms our expectation that in silico approaches alone will always tend to be incomplete. The identification of at least 2,000 novel genes highlights the importance of gathering experimental evidence to discover all genes within a genome. Moreover, as such an approach is independent of homology criteria, it will allow the discovery of novel genes unrelated to known protein families or those that have not been strictly conserved between species.
In order to provide a set of manually curated and annotated sequences from the Aedes aegypti genome, mapped BAC clones encompassing 1.57 Mb were sequenced, assembled and manually annotated using computational gene-finding, EST matches as well as comparative protein homology.
Aedes aegypti is the principal vector of yellow fever and dengue viruses throughout the tropical world. To provide a set of manually curated and annotated sequences from the Ae. aegypti genome, 14 mapped bacterial artificial chromosome (BAC) clones encompassing 1.57 Mb were sequenced, assembled and manually annotated using a combination of computational gene-finding, expressed sequence tag (EST) matches and comparative protein homology. PCR and sequencing were used to experimentally confirm expression and sequence of a subset of these transcripts.
Of the 51 manual annotations, 50 and 43 demonstrated a high level of similarity to Anopheles gambiae and Drosophila melanogaster genes, respectively. Ten of the 12 BAC sequences with more than one annotated gene exhibited synteny with the A. gambiae genome. Putative transcripts from eight BAC clones were found in multiple copies (two copies in most cases) in the Aedes genome assembly, which point to the probable presence of haplotype polymorphisms and/or misassemblies.
This study not only provides a benchmark set of manually annotated transcripts for this genome that can be used to assess the quality of the auto-annotation pipeline and the assembly, but it also looks at the effect of a high repeat content on the genome assembly and annotation pipeline.
MicroRNAs (miRNAs) are short RNA molecules that regulate gene expression by binding to target messenger RNAs and by controlling protein production or causing RNA cleavage. To date, functions have been assigned to only a few of the hundreds of identified miRNAs, in part because of the difficulty in identifying their targets. The short length of miRNAs and the fact that their complementarity to target sequences is imperfect mean that target identification in animal genomes is not possible by standard sequence comparison methods. Here we screen conserved 3′ UTR sequences from the Drosophila melanogaster genome for potential miRNA targets. The screening procedure combines a sequence search with an evaluation of the predicted miRNA–target heteroduplex structures and energies. We show that this approach successfully identifies the five previously validated let-7, lin-4, and bantam targets from a large database and predict new targets for Drosophila miRNAs. Our target predictions reveal striking clusters of functionally related targets among the top predictions for specific miRNAs. These include Notch target genes for miR-7, proapoptotic genes for the miR-2 family, and enzymes from a metabolic pathway for miR-277. We experimentally verified three predicted targets each for miR-7 and the miR-2 family, doubling the number of validated targets for animal miRNAs. Statistical analysis indicates that the best single predicted target sites are at the border of significance; thus, target predictions should be considered as tentative until experimentally validated. We identify features shared by all validated targets that can be used to evaluate target predictions for animal miRNAs. Our initial evaluation and experimental validation of target predictions suggest functions for two miRNAs. For others, the screen suggests plausible functions, such as a role for miR-277 as a metabolic switch controlling amino acid catabolism. Cross-genome comparison proved essential, as it allows reduction of the sequence search space. Improvements in genome annotation and increased availability of cDNA sequences from other genomes will allow more sensitive screens. An increase in the number of confirmed targets is expected to reveal general structural features that can be used to improve their detection. While the screen is likely to miss some targets, our study shows that valid targets can be identified from sequence alone.
A bioinformatic approach suggests many new target genes for Drosophila microRNAs. A number of them are validated experimentally
Mitochondria are organelles present in the cytoplasm of most eukaryotic cells; although they have their own DNA, the majority of the proteins necessary for a functional mitochondrion are coded by the nuclear DNA and only after transcription and translation they are imported in the mitochondrion as proteins. The primary role of the mitochondrion is electron transport and oxidative phosphorylation. Although it has been studied for a long time, the interest of researchers in mitochondria is still alive thanks to the discovery of mitochondrial role in apoptosis, aging and cancer. Aim of the MitoDrome database is to annotate the Drosophila melanogaster nuclear genes coding for mitochondrial proteins in order to contribute to the functional characterization of nuclear genes coding for mitochondrial proteins and to knowledge of gene diseases related to mitochondrial dysfunctions. Indeed D. melanogaster is one of the most studied organisms and a model for the Human genome. Data are derived from the comparison of Human mitochondrial proteins versus the Drosophila genome, ESTs and cDNA sequence data available in the FlyBase database. Links from the MitoDrome entries to the related homologous entries available in MitoNuC will be soon imple-mented. The MitoDrome database is available at http://bighost.area.ba.cnr.it/BIG/MitoDrome. Data are organised in a flat-file format and can be retrieved using the SRS system.
The structure annotation of a genome is based either on ab initio methodologies or on similaritiy searches versus molecules that have been already annotated. Ab initio gene predictions in a genome are based on a priori knowledge of species-specific features of genes. The training of ab initio gene finders is based on the definition of a data-set of gene models. To accomplish this task the common approach is to align species-specific full length cDNA and EST sequences along the genomic sequences in order to define exon/intron structure of mRNA coding genes.
GeneModelEST is the software here proposed for defining a data-set of candidate gene models using exclusively evidence derived from cDNA/EST sequences.
GeneModelEST requires the genome coordinates of the spliced-alignments of ESTs and of contigs (tentative consensus sequences) generated by an EST clustering/assembling procedure to be formatted in a General Feature Format (GFF) standard file. Moreover, the alignments of the contigs versus a protein database are required as an NCBI BLAST formatted report file.
The GeneModelEST analysis aims to i) evaluate each exon as defined from contig spliced alignments onto the genome sequence; ii) classify the contigs according to quality levels in order to select candidate gene models; iii) assign to the candidate gene models preliminary functional annotations.
We discuss the application of the proposed methodology to build a data-set of gene models of Solanum lycopersicum, whose genome sequencing is an ongoing effort by the International Tomato Genome Sequencing Consortium.
The contig classification procedure used by GeneModelEST supports the detection of candidate gene models, the identification of potential alternative transcripts and it is useful to filter out ambiguous information. An automated procedure, such as the one proposed here, is fundamental to support large scale analysis in order to provide species-specific gene models, that could be useful as a training data-set for ab initio gene finders and/or as a reference gene list for a human curated annotation.
The genus Populus includes poplars, aspens and cottonwoods, which will be collectively referred to as poplars hereafter unless otherwise specified. Poplars are the dominant tree species in many forest ecosystems in the Northern Hemisphere and are of substantial economic value in plantation forestry. Poplar has been established as a model system for genomics studies of growth, development, and adaptation of woody perennial plants including secondary xylem formation, dormancy, adaptation to local environments, and biotic interactions.
As part of the poplar genome sequencing project and the development of genomic resources for poplar, we have generated a full-length (FL)-cDNA collection using the biotinylated CAP trapper method. We constructed four FLcDNA libraries using RNA from xylem, phloem and cambium, and green shoot tips and leaves from the P. trichocarpa Nisqually-1 genotype, as well as insect-attacked leaves of the P. trichocarpa × P. deltoides hybrid. Following careful selection of candidate cDNA clones, we used a combined strategy of paired end reads and primer walking to generate a set of 4,664 high-accuracy, sequence-verified FLcDNAs, which clustered into 3,990 putative unique genes. Mapping FLcDNAs to the poplar genome sequence combined with BLAST comparisons to previously predicted protein coding sequences in the poplar genome identified 39 FLcDNAs that likely localize to gaps in the current genome sequence assembly. Another 173 FLcDNAs mapped to the genome sequence but were not included among the previously predicted genes in the poplar genome. Comparative sequence analysis against Arabidopsis thaliana and other species in the non-redundant database of GenBank revealed that 11.5% of the poplar FLcDNAs display no significant sequence similarity to other plant proteins. By mapping the poplar FLcDNAs against transcriptome data previously obtained with a 15.5 K cDNA microarray, we identified 153 FLcDNA clones for genes that were differentially expressed in poplar leaves attacked by forest tent caterpillars.
This study has generated a high-quality FLcDNA resource for poplar and the third largest FLcDNA collection published to date for any plant species. We successfully used the FLcDNA sequences to reassess gene prediction in the poplar genome sequence, perform comparative sequence annotation, and identify differentially expressed transcripts associated with defense against insects. The FLcDNA sequences will be essential to the ongoing curation and annotation of the poplar genome, in particular for targeting gaps in the current genome assembly and further improvement of gene predictions. The physical FLcDNA clones will serve as useful reagents for functional genomics research in areas such as analysis of gene functions in defense against insects and perennial growth. Sequences from this study have been deposited in NCBI GenBank under the accession numbers EF144175 to EF148838.
Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance.
We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction.
Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort.
The German cDNA Consortium has been cloning full length cDNAs and continued with their exploitation in protein localization experiments and cellular assays. However, the efficient use of large cDNA resources requires the development of strategies that are capable of a speedy selection of truly useful cDNAs from biological and experimental noise. To this end we have developed a new high-throughput analysis tool, CAFTAN, which simplifies these efforts and thus fills the gap between large-scale cDNA collections and their systematic annotation and application in functional genomics.
CAFTAN is built around the mapping of cDNAs to the genome assembly, and the subsequent analysis of their genomic context. It uses sequence features like the presence and type of PolyA signals, inner and flanking repeats, the GC-content, splice site types, etc. All these features are evaluated in individual tests and classify cDNAs according to their sequence quality and likelihood to have been generated from fully processed mRNAs. Additionally, CAFTAN compares the coordinates of mapped cDNAs with the genomic coordinates of reference sets from public available resources (e.g., VEGA, ENSEMBL). This provides detailed information about overlapping exons and the structural classification of cDNAs with respect to the reference set of splice variants.
The evaluation of CAFTAN showed that is able to correctly classify more than 85% of 5950 selected "known protein-coding" VEGA cDNAs as high quality multi- or single-exon. It identified as good 80.6 % of the single exon cDNAs and 85 % of the multiple exon cDNAs.
The program is written in Perl and in a modular way, allowing the adoption of this strategy to other tasks like EST-annotation, or to extend it by adding new classification rules and new organism databases as they become available. We think that it is a very useful program for the annotation and research of unfinished genomes.
CAFTAN is a high-throughput sequence analysis tool, which performs a fast and reliable quality prediction of cDNAs. Several thousands of cDNAs can be analyzed in a short time, giving the curator/scientist a first quick overview about the quality and the already existing annotation of a set of cDNAs. It supports the rejection of low quality cDNAs and helps in the selection of likely novel splice variants, and/or completely novel transcripts for new experiments.
Cowpea [Vigna unguiculata (L.) Walp.] is one of the most important food and forage legumes in the semi-arid tropics because of its ability to tolerate drought and grow on poor soils. It is cultivated mostly by poor farmers in developing countries, with 80% of production taking place in the dry savannah of tropical West and Central Africa. Cowpea is largely an underexploited crop with relatively little genomic information available for use in applied plant breeding. The goal of the Cowpea Genomics Initiative (CGI), funded by the Kirkhouse Trust, a UK-based charitable organization, is to leverage modern molecular genetic tools for gene discovery and cowpea improvement. One aspect of the initiative is the sequencing of the gene-rich region of the cowpea genome (termed the genespace) recovered using methylation filtration technology and providing annotation and analysis of the sequence data.
CGKB is an integrated and annotated resource for cowpea GSS with features of homology-based and HMM-based annotations, enzyme and pathway annotations, GO term annotation, toolkits, and a large number of other facilities to perform complex queries. The cowpea GSS, chloroplast sequences, mitochondrial sequences, retroelements, and SSR sequences are available as FASTA formatted files and downloadable at CGKB. This database and web interface are publicly accessible at .
Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i) conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii) presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii) co-occurrence of extracellular and nuclear domains; (iv) violation of domain integrity; (v) chimeras encoded by two or more genes located on different chromosomes.
Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis) and two protostome species (Caenorhabditis elegans and Drosophila melanogaster) have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries.
MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors. We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.
Motivation: The sequencing of the Plasmodium yoelii genome, a model rodent malaria parasite, has greatly facilitated research for the development of new drug and vaccine candidates against malaria. Unfortunately, only preliminary gene models were annotated on the partially sequenced genome, mostly by in silico gene prediction, and there has been no major improvement of the annotation since 2002.
Results: Here we report on a systematic assessment of the accuracy of the genome annotation based on a detailed analysis of a comprehensive set of cDNA sequences and proteomics data. We found that the coverage of the current annotation tends to be biased toward genes expressed in the blood stages of the parasite life cycle. Based on our proteomic analysis, we estimate that about 15% of the liver stage proteome data we have generated is absent from the current annotation. Through comparative analysis we identified and manually curated a further 510 P. yoelii genes which have clear orthologs in the P. falciparum genome, but were not present or incorrectly annotated in the current annotation. This study suggests that improvements of the current P. yoelii genome annotation should focus on genes expressed in stages other than blood stages. Comparative analysis will be critically helpful for this re-annotation. The addition of newly annotated genes will facilitate the use of P. yoelii as a model system for studying human malaria.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
The oriental fruit fly, Bactrocera dorsalis, is an important pest of fruit and vegetable crops throughout Asia, and is considered a high risk pest for establishment in the mainland United States. It is a member of the family Tephritidae, which are the most agriculturally important family of flies, and can be considered an out-group to well-studied members of the family Drosophilidae. Despite their importance as pests and their relatedness to Drosophila, little information is present on B. dorsalis transcripts and proteins. The objective of this paper is to comprehensively characterize the transcripts present throughout the life history of B. dorsalis and functionally annotate and analyse these transcripts relative to the presence, expression, and function of orthologous sequences present in Drosophila melanogaster.
We present a detailed transcriptome assembly of B. dorsalis from egg through adult stages containing 20,666 transcripts across 10,799 unigene components. Utilizing data available through Flybase and the modENCODE project, we compared expression patterns of these transcripts to putative orthologs in D. melanogaster in terms of timing, abundance, and function. In addition, temporal expression patterns in B. dorsalis were characterized between stages, to establish the constitutive or stage-specific expression patterns of particular transcripts. A fully annotated transcriptome assembly is made available through NCBI, in addition to corresponding expression data.
Through characterizing the transcriptome of B. dorsalis through its life history and comparing the transcriptome of B. dorsalis to the model organism D. melanogaster, a database has been developed that can be used as the foundation to functional genomic research in Bactrocera flies and help identify orthologous genes between B. dorsalis and D. melanogaster. This data provides the foundation for future functional genomic research that will focus on improving our understanding of the physiology and biology of this species at the molecular level. This knowledge can also be applied towards developing improved methods for control, survey, and eradication of this important pest.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-942) contains supplementary material, which is available to authorized users.