PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1018261)

Clipboard (0)
None

Related Articles

1.  Dictionary-driven protein annotation 
Nucleic Acids Research  2002;30(17):3901-3916.
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/.
PMCID: PMC137405  PMID: 12202776
2.  Synergistic use of plant-prokaryote comparative genomics for functional annotations 
BMC Genomics  2011;12(Suppl 1):S2.
Background
Identifying functions for all gene products in all sequenced organisms is a central challenge of the post-genomic era. However, at least 30-50% of the proteins encoded by any given genome are of unknown or vaguely known function, and a large number are wrongly annotated. Many of these ‘unknown’ proteins are common to prokaryotes and plants. We set out to predict and experimentally test the functions of such proteins. Our approach to functional prediction integrates comparative genomics based mainly on microbial genomes with functional genomic data from model microorganisms and post-genomic data from plants. This approach bridges the gap between automated homology-based annotations and the classical gene discovery efforts of experimentalists, and is more powerful than purely computational approaches to identifying gene-function associations.
Results
Among Arabidopsis genes, we focused on those (2,325 in total) that (i) are unique or belong to families with no more than three members, (ii) occur in prokaryotes, and (iii) have unknown or poorly known functions. Computer-assisted selection of promising targets for deeper analysis was based on homology-independent characteristics associated in the SEED database with the prokaryotic members of each family. In-depth comparative genomic analysis was performed for 360 top candidate families. From this pool, 78 families were connected to general areas of metabolism and, of these families, specific functional predictions were made for 41. Twenty-one predicted functions have been experimentally tested or are currently under investigation by our group in at least one prokaryotic organism (nine of them have been validated, four invalidated, and eight are in progress). Ten additional predictions have been independently validated by other groups. Discovering the function of very widespread but hitherto enigmatic proteins such as the YrdC or YgfZ families illustrates the power of our approach.
Conclusions
Our approach correctly predicted functions for 19 uncharacterized protein families from plants and prokaryotes; none of these functions had previously been correctly predicted by computational methods. The resulting annotations could be propagated with confidence to over six thousand homologous proteins encoded in over 900 bacterial, archaeal, and eukaryotic genomes currently available in public databases.
doi:10.1186/1471-2164-12-S1-S2
PMCID: PMC3223725  PMID: 21810204
3.  High precision multi-genome scale reannotation of enzyme function by EFICAz 
BMC Genomics  2006;7:315.
Background
The functional annotation of most genes in newly sequenced genomes is inferred from similarity to previously characterized sequences, an annotation strategy that often leads to erroneous assignments. We have performed a reannotation of 245 genomes using an updated version of EFICAz, a highly precise method for enzyme function prediction.
Results
Based on our three-field EC number predictions, we have obtained lower-bound estimates for the average enzyme content in Archaea (29%), Bacteria (30%) and Eukarya (18%). Most annotations added in KEGG from 2005 to 2006 agree with EFICAz predictions made in 2005. The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes. Thousands of our novel predictions correspond to hypothetical proteins. We have identified a subset of 64 hypothetical proteins with low sequence identity to EFICAz training enzymes, whose biochemical functions have been recently characterized and find that in 96% (84%) of the cases we correctly identified their three-field (four-field) EC numbers. For two of the 64 hypothetical proteins: PA1167 from Pseudomonas aeruginosa, an alginate lyase (EC 4.2.2.3) and Rv1700 of Mycobacterium tuberculosis H37Rv, an ADP-ribose diphosphatase (EC 3.6.1.13), we have detected annotation lag of more than two years in databases. Two examples are presented where EFICAz predictions act as hypothesis generators for understanding the functional roles of hypothetical proteins: FLJ11151, a human protein overexpressed in cancer that EFICAz identifies as an endopolyphosphatase (EC 3.6.1.10), and MW0119, a protein of Staphylococcus aureus strain MW2 that we propose as candidate virulence factor based on its EFICAz predicted activity, sphingomyelin phosphodiesterase (EC 3.1.4.12).
Conclusion
Our results suggest that we have generated enzyme function annotations of high precision and recall. These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.
doi:10.1186/1471-2164-7-315
PMCID: PMC1764738  PMID: 17166279
4.  BG7: A New Approach for Bacterial Genome Annotation Designed for Next Generation Sequencing Data 
PLoS ONE  2012;7(11):e49239.
BG7 is a new system for de novo bacterial, archaeal and viral genome annotation based on a new approach specifically designed for annotating genomes sequenced with next generation sequencing technologies. The system is versatile and able to annotate genes even in the step of preliminary assembly of the genome. It is especially efficient detecting unexpected genes horizontally acquired from bacterial or archaeal distant genomes, phages, plasmids, and mobile elements. From the initial phases of the gene annotation process, BG7 exploits the massive availability of annotated protein sequences in databases. BG7 predicts ORFs and infers their function based on protein similarity with a wide set of reference proteins, integrating ORF prediction and functional annotation phases in just one step. BG7 is especially tolerant to sequencing errors in start and stop codons, to frameshifts, and to assembly or scaffolding errors. The system is also tolerant to the high level of gene fragmentation which is frequently found in not fully assembled genomes. BG7 current version – which is developed in Java, takes advantage of Amazon Web Services (AWS) cloud computing features, but it can also be run locally in any operating system. BG7 is a fast, automated and scalable system that can cope with the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies. Its capabilities and efficiency were demonstrated in the 2011 EHEC Germany outbreak in which BG7 was used to get the first annotations right the next day after the first entero-hemorrhagic E. coli genome sequences were made publicly available. The suitability of BG7 for genome annotation has been proved for Illumina, 454, Ion Torrent, and PacBio sequencing technologies. Besides, thanks to its plasticity, our system could be very easily adapted to work with new technologies in the future.
doi:10.1371/journal.pone.0049239
PMCID: PMC3504008  PMID: 23185310
5.  Experimental-confirmation and functional-annotation of predicted proteins in the chicken genome 
BMC Genomics  2007;8:425.
Background
The chicken genome was sequenced because of its phylogenetic position as a non-mammalian vertebrate, its use as a biomedical model especially to study embryology and development, its role as a source of human disease organisms and its importance as the major source of animal derived food protein. However, genomic sequence data is, in itself, of limited value; generally it is not equivalent to understanding biological function. The benefit of having a genome sequence is that it provides a basis for functional genomics. However, the sequence data currently available is poorly structurally and functionally annotated and many genes do not have standard nomenclature assigned.
Results
We analysed eight chicken tissues and improved the chicken genome structural annotation by providing experimental support for the in vivo expression of 7,809 computationally predicted proteins, including 30 chicken proteins that were only electronically predicted or hypothetical translations in human. To improve functional annotation (based on Gene Ontology), we mapped these identified proteins to their human and mouse orthologs and used this orthology to transfer Gene Ontology (GO) functional annotations to the chicken proteins. The 8,213 orthology-based GO annotations that we produced represent an 8% increase in currently available chicken GO annotations. Orthologous chicken products were also assigned standardized nomenclature based on current chicken nomenclature guidelines.
Conclusion
We demonstrate the utility of high-throughput expression proteomics for rapid experimental structural annotation of a newly sequenced eukaryote genome. These experimentally-supported predicted proteins were further annotated by assigning the proteins with standardized nomenclature and functional annotation. This method is widely applicable to a diverse range of species. Moreover, information from one genome can be used to improve the annotation of other genomes and inform gene prediction algorithms.
doi:10.1186/1471-2164-8-425
PMCID: PMC2204016  PMID: 18021451
6.  The RAST Server: Rapid Annotations using Subsystems Technology 
BMC Genomics  2008;9:75.
Background
The number of prokaryotic genome sequences becoming available is growing steadily and is growing faster than our ability to accurately annotate them.
Description
We describe a fully automated service for annotating bacterial and archaeal genomes. The service identifies protein-encoding, rRNA and tRNA genes, assigns functions to the genes, predicts which subsystems are represented in the genome, uses this information to reconstruct the metabolic network and makes the output easily downloadable for the user. In addition, the annotated genome can be browsed in an environment that supports comparative analysis with the annotated genomes maintained in the SEED environment.
The service normally makes the annotated genome available within 12–24 hours of submission, but ultimately the quality of such a service will be judged in terms of accuracy, consistency, and completeness of the produced annotations. We summarize our attempts to address these issues and discuss plans for incrementally enhancing the service.
Conclusion
By providing accurate, rapid annotation freely to the community we have created an important community resource. The service has now been utilized by over 120 external users annotating over 350 distinct genomes.
doi:10.1186/1471-2164-9-75
PMCID: PMC2265698  PMID: 18261238
7.  AGeS: A Software System for Microbial Genome Sequence Annotation 
PLoS ONE  2011;6(3):e17469.
Background
The annotation of genomes from next-generation sequencing platforms needs to be rapid, high-throughput, and fully integrated and automated. Although a few Web-based annotation services have recently become available, they may not be the best solution for researchers that need to annotate a large number of genomes, possibly including proprietary data, and store them locally for further analysis. To address this need, we developed a standalone software application, the Annotation of microbial Genome Sequences (AGeS) system, which incorporates publicly available and in-house-developed bioinformatics tools and databases, many of which are parallelized for high-throughput performance.
Methodology
The AGeS system supports three main capabilities. The first is the storage of input contig sequences and the resulting annotation data in a central, customized database. The second is the annotation of microbial genomes using an integrated software pipeline, which first analyzes contigs from high-throughput sequencing by locating genomic regions that code for proteins, RNA, and other genomic elements through the Do-It-Yourself Annotation (DIYA) framework. The identified protein-coding regions are then functionally annotated using the in-house-developed Pipeline for Protein Annotation (PIPA). The third capability is the visualization of annotated sequences using GBrowse. To date, we have implemented these capabilities for bacterial genomes. AGeS was evaluated by comparing its genome annotations with those provided by three other methods. Our results indicate that the software tools integrated into AGeS provide annotations that are in general agreement with those provided by the compared methods. This is demonstrated by a >94% overlap in the number of identified genes, a significant number of identical annotated features, and a >90% agreement in enzyme function predictions.
doi:10.1371/journal.pone.0017469
PMCID: PMC3049762  PMID: 21408217
8.  Re-annotation of the Saccharopolyspora erythraea genome using a systems biology approach 
BMC Genomics  2013;14:699.
Background
Accurate bacterial genome annotations provide a framework to understanding cellular functions, behavior and pathogenicity and are essential for metabolic engineering. Annotations based only on in silico predictions are inaccurate, particularly for large, high G + C content genomes due to the lack of similarities in gene length and gene organization to model organisms.
Results
Here we describe a 2D systems biology driven re-annotation of the Saccharopolyspora erythraea genome using proteogenomics, a genome-scale metabolic reconstruction, RNA-sequencing and small-RNA-sequencing. We observed transcription of more than 300 intergenic regions, detected 59 peptides in intergenic regions, confirmed 164 open reading frames previously annotated as hypothetical proteins and reassigned function to open reading frames using the genome-scale metabolic reconstruction. Finally, we present a novel way of mapping ribosomal binding sites across the genome by sequencing small RNAs.
Conclusions
The work presented here describes a novel framework for annotation of the Saccharopolyspora erythraea genome. Based on experimental observations, the 2D annotation framework greatly reduces errors that are commonly made when annotating large-high G + C content genomes using computational prediction algorithms.
doi:10.1186/1471-2164-14-699
PMCID: PMC4008361  PMID: 24118942
Proteogenomics; Saccharopolyspora erythraea; Systems biology; Genome annotation; High G + C content genomes
9.  The essential genome of a bacterium 
This study reports the essential Caulobacter genome at 8 bp resolution determined by saturated transposon mutagenesis and high-throughput sequencing. This strategy is applicable to full genome essentiality studies in a broad class of bacterial species.
The essential Caulobacter genome was determined at 8 bp resolution using hyper-saturated transposon mutagenesis coupled with high-throughput sequencing.Essential protein-coding sequences comprise 90% of the essential genome; the remaining 10% comprising essential non-coding RNA sequences, gene regulatory elements and essential genome replication features.Of the 3876 annotated open reading frames (ORFs), 480 (12.4%) were essential ORFs, 3240 (83.6%) were non-essential ORFs and 156 (4.0%) were ORFs that severely impacted fitness when mutated.The essential elements are preferentially positioned near the origin and terminus of the Caulobacter chromosome.This high-resolution strategy is applicable to high-throughput, full genome essentiality studies and large-scale genetic perturbation experiments in a broad class of bacterial species.
The regulatory events that control polar differentiation and cell-cycle progression in the bacterium Caulobacter crescentus are highly integrated, and they have to occur in the proper order (McAdams and Shapiro, 2011). Components of the core regulatory circuit are largely known. Full discovery of its essential genome, including non-coding, regulatory and coding elements, is a prerequisite for understanding the complete regulatory network of this bacterial cell. We have identified all the essential coding and non-coding elements of the Caulobacter chromosome using a hyper-saturated transposon mutagenesis strategy that is scalable and can be readily extended to obtain rapid and accurate identification of the essential genome elements of any sequenced bacterial species at a resolution of a few base pairs.
We engineered a Tn5 derivative transposon (Tn5Pxyl) that carries at one end an inducible outward pointing Pxyl promoter (Christen et al, 2010). We showed that this transposon construct inserts into the genome randomly where it can activate or disrupt transcription at the site of integration, depending on the insertion orientation. DNA from hundred of thousands of transposon insertion sites reading outward into flanking genomic regions was parallel PCR amplified and sequenced by Illumina paired-end sequencing to locate the insertion site in each mutant strain (Figure 1). A single sequencing run on DNA from a mutagenized cell population yielded 118 million raw sequencing reads. Of these, >90 million (>80%) read outward from the transposon element into adjacent genomic DNA regions and the insertion site could be mapped with single nucleotide resolution. This yielded the location and orientation of 428 735 independent transposon insertions in the 4-Mbp Caulobacter genome.
Within non-coding sequences of the Caulobacter genome, we detected 130 non-disruptable DNA segments between 90 and 393 bp long in addition to all essential promoter elements. Among 27 previously identified and validated sRNAs (Landt et al, 2008), three were contained within non-disruptable DNA segments and another three were partially disruptable, that is, insertions caused a notable growth defect. Two additional small RNAs found to be essential are the transfer-messenger RNA (tmRNA) and the ribozyme RNAseP (Landt et al, 2008). In addition to the 8 non-disruptable sRNAs, 29 out of the 130 intergenic essential non-coding sequences contained non-redundant tRNA genes; duplicated tRNA genes were non-essential. We also identified two non-disruptable DNA segments within the chromosomal origin of replication. Thus, we resolved essential non-coding RNAs, tRNAs and essential replication elements within the origin region of the chromosome. An additional 90 non-disruptable small genome elements of currently unknown function were identified. Eighteen of these are conserved in at least one closely related species. Only 2 could encode a protein of over 50 amino acids.
For each of the 3876 annotated open reading frames (ORFs), we analyzed the distribution, orientation, and genetic context of transposon insertions. There are 480 essential ORFs and 3240 non-essential ORFs. In addition, there were 156 ORFs that severely impacted fitness when mutated. The 8-bp resolution allowed a dissection of the essential and non-essential regions of the coding sequences. Sixty ORFs had transposon insertions within a significant portion of their 3′ region but lacked insertions in the essential 5′ coding region, allowing the identification of non-essential protein segments. For example, transposon insertions in the essential cell-cycle regulatory gene divL, a tyrosine kinase, showed that the last 204 C-terminal amino acids did not impact viability, confirming previous reports that the C-terminal ATPase domain of DivL is dispensable for viability (Reisinger et al, 2007; Iniesta et al, 2010). In addition, we found that 30 out of 480 (6.3%) of the essential ORFs appear to be shorter than the annotated ORF, suggesting that these are probably mis-annotated.
Among the 480 ORFs essential for growth on rich media, there were 10 essential transcriptional regulatory proteins, including 5 previously identified cell-cycle regulators (McAdams and Shapiro, 2003; Holtzendorff et al, 2004; Collier and Shapiro, 2007; Gora et al, 2010; Tan et al, 2010) and 5 uncharacterized predicted transcription factors. In addition, two RNA polymerase sigma factors RpoH and RpoD, as well as the anti-sigma factor ChrR, which mitigates rpoE-dependent stress response under physiological growth conditions (Lourenco and Gomes, 2009), were also found to be essential. Thus, a set of 10 transcription factors, 2 RNA polymerase sigma factors and 1 anti-sigma factor are the core essential transcriptional regulators for growth on rich media. To further characterize the core components of the Caulobacter cell-cycle control network, we identified all essential regulatory sequences and operon transcripts. Altogether, the 480 essential protein-coding and 37 essential RNA-coding Caulobacter genes are organized into operons such that 402 individual promoter regions are sufficient to regulate their expression. Of these 402 essential promoters, the transcription start sites (TSSs) of 105 were previously identified (McGrath et al, 2007).
The essential genome features are non-uniformly distributed on the Caulobacter genome and enriched near the origin and the terminus regions. In contrast, the chromosomal positions of the published E. coli essential coding sequences (Rocha, 2004) are preferentially located at either side of the origin (Figure 4A). This indicates that there are selective pressures on chromosomal positioning of some essential elements (Figure 4A).
The strategy described in this report could be readily extended to quickly determine the essential genome for a large class of bacterial species.
Caulobacter crescentus is a model organism for the integrated circuitry that runs a bacterial cell cycle. Full discovery of its essential genome, including non-coding, regulatory and coding elements, is a prerequisite for understanding the complete regulatory network of a bacterial cell. Using hyper-saturated transposon mutagenesis coupled with high-throughput sequencing, we determined the essential Caulobacter genome at 8 bp resolution, including 1012 essential genome features: 480 ORFs, 402 regulatory sequences and 130 non-coding elements, including 90 intergenic segments of unknown function. The essential transcriptional circuitry for growth on rich media includes 10 transcription factors, 2 RNA polymerase sigma factors and 1 anti-sigma factor. We identified all essential promoter elements for the cell cycle-regulated genes. The essential elements are preferentially positioned near the origin and terminus of the chromosome. The high-resolution strategy used here is applicable to high-throughput, full genome essentiality studies and large-scale genetic perturbation experiments in a broad class of bacterial species.
doi:10.1038/msb.2011.58
PMCID: PMC3202797  PMID: 21878915
functional genomics; next-generation sequencing; systems biology; transposon mutagenesis
10.  Bioinformatics in microbial biotechnology – a mini review 
The revolutionary growth in the computation speed and memory storage capability has fueled a new era in the analysis of biological data. Hundreds of microbial genomes and many eukaryotic genomes including a cleaner draft of human genome have been sequenced raising the expectation of better control of microorganisms. The goals are as lofty as the development of rational drugs and antimicrobial agents, development of new enhanced bacterial strains for bioremediation and pollution control, development of better and easy to administer vaccines, the development of protein biomarkers for various bacterial diseases, and better understanding of host-bacteria interaction to prevent bacterial infections. In the last decade the development of many new bioinformatics techniques and integrated databases has facilitated the realization of these goals. Current research in bioinformatics can be classified into: (i) genomics – sequencing and comparative study of genomes to identify gene and genome functionality, (ii) proteomics – identification and characterization of protein related properties and reconstruction of metabolic and regulatory pathways, (iii) cell visualization and simulation to study and model cell behavior, and (iv) application to the development of drugs and anti-microbial agents. In this article, we will focus on the techniques and their limitations in genomics and proteomics. Bioinformatics research can be classified under three major approaches: (1) analysis based upon the available experimental wet-lab data, (2) the use of mathematical modeling to derive new information, and (3) an integrated approach that integrates search techniques with mathematical modeling. The major impact of bioinformatics research has been to automate the genome sequencing, automated development of integrated genomics and proteomics databases, automated genome comparisons to identify the genome function, automated derivation of metabolic pathways, gene expression analysis to derive regulatory pathways, the development of statistical techniques, clustering techniques and data mining techniques to derive protein-protein and protein-DNA interactions, and modeling of 3D structure of proteins and 3D docking between proteins and biochemicals for rational drug design, difference analysis between pathogenic and non-pathogenic strains to identify candidate genes for vaccines and anti-microbial agents, and the whole genome comparison to understand the microbial evolution. The development of bioinformatics techniques has enhanced the pace of biological discovery by automated analysis of large number of microbial genomes. We are on the verge of using all this knowledge to understand cellular mechanisms at the systemic level. The developed bioinformatics techniques have potential to facilitate (i) the discovery of causes of diseases, (ii) vaccine and rational drug design, and (iii) improved cost effective agents for bioremediation by pruning out the dead ends. Despite the fast paced global effort, the current analysis is limited by the lack of available gene-functionality from the wet-lab data, the lack of computer algorithms to explore vast amount of data with unknown functionality, limited availability of protein-protein and protein-DNA interactions, and the lack of knowledge of temporal and transient behavior of genes and pathways.
doi:10.1186/1475-2859-4-19
PMCID: PMC1182391  PMID: 15985162
11.  A Systematic Survey of Mini-Proteins in Bacteria and Archaea 
PLoS ONE  2008;3(12):e4027.
Background
Mini-proteins, defined as polypeptides containing no more than 100 amino acids, are ubiquitous in prokaryotes and eukaryotes. They play significant roles in various biological processes, and their regulatory functions gradually attract the attentions of scientists. However, the functions of the majority of mini-proteins are still largely unknown due to the constraints of experimental methods and bioinformatic analysis.
Methodology/Principal Findings
In this article, we extracted a total of 180,879 mini-proteins from the annotations of 532 sequenced genomes, including 491 strains of Bacteria and 41 strains of Archaea. The average proportion of mini-proteins among all genomic proteins is approximately 10.99%, but different strains exhibit remarkable fluctuations. These mini-proteins display two notable characteristics. First, the majority are species-specific proteins with an average proportion of 58.79% among six representative phyla. Second, an even larger proportion (70.03% among all strains) is hypothetical proteins. However, a fraction of highly conserved hypothetical proteins potentially play crucial roles in organisms. Among mini-proteins with known functions, it seems that regulatory and metabolic proteins are more abundant than essential structural proteins. Furthermore, domains in mini-proteins seem to have greater distributions in Bacteria than Eukarya. Analysis of the evolutionary progression of these domains reveals that they have diverged to new patterns from a single ancestor.
Conclusions/Significance
Mini-proteins are ubiquitous in bacterial and archaeal species and play significant roles in various functions. The number of mini-proteins in each genome displays remarkable fluctuation, likely resulting from the differential selective pressures that reflect the respective life-styles of the organisms. The answers to many questions surrounding mini-proteins remain elusive and need to be resolved experimentally.
doi:10.1371/journal.pone.0004027
PMCID: PMC2602986  PMID: 19107199
12.  Argot2: a large scale function prediction tool relying on semantic similarity of weighted Gene Ontology terms 
BMC Bioinformatics  2012;13(Suppl 4):S14.
Background
Predicting protein function has become increasingly demanding in the era of next generation sequencing technology. The task to assign a curator-reviewed function to every single sequence is impracticable. Bioinformatics tools, easy to use and able to provide automatic and reliable annotations at a genomic scale, are necessary and urgent. In this scenario, the Gene Ontology has provided the means to standardize the annotation classification with a structured vocabulary which can be easily exploited by computational methods.
Results
Argot2 is a web-based function prediction tool able to annotate nucleic or protein sequences from small datasets up to entire genomes. It accepts as input a list of sequences in FASTA format, which are processed using BLAST and HMMER searches vs UniProKB and Pfam databases respectively; these sequences are then annotated with GO terms retrieved from the UniProtKB-GOA database and the terms are weighted using the e-values from BLAST and HMMER. The weighted GO terms are processed according to both their semantic similarity relations described by the Gene Ontology and their associated score. The algorithm is based on the original idea developed in a previous tool called Argot. The entire engine has been completely rewritten to improve both accuracy and computational efficiency, thus allowing for the annotation of complete genomes.
Conclusions
The revised algorithm has been already employed and successfully tested during in-house genome projects of grape and apple, and has proven to have a high precision and recall in all our benchmark conditions. It has also been successfully compared with Blast2GO, one of the methods most commonly employed for sequence annotation. The server is freely accessible at http://www.medcomp.medicina.unipd.it/Argot2.
doi:10.1186/1471-2105-13-S4-S14
PMCID: PMC3314586  PMID: 22536960
13.  Thousands of Rab GTPases for the Cell Biologist 
PLoS Computational Biology  2011;7(10):e1002217.
Rab proteins are small GTPases that act as essential regulators of vesicular trafficking. 44 subfamilies are known in humans, performing specific sets of functions at distinct subcellular localisations and tissues. Rab function is conserved even amongst distant orthologs. Hence, the annotation of Rabs yields functional predictions about the cell biology of trafficking. So far, annotating Rabs has been a laborious manual task not feasible for current and future genomic output of deep sequencing technologies. We developed, validated and benchmarked the Rabifier, an automated bioinformatic pipeline for the identification and classification of Rabs, which achieves up to 90% classification accuracy. We cataloged roughly 8.000 Rabs from 247 genomes covering the entire eukaryotic tree. The full Rab database and a web tool implementing the pipeline are publicly available at www.RabDB.org. For the first time, we describe and analyse the evolution of Rabs in a dataset covering the whole eukaryotic phylogeny. We found a highly dynamic family undergoing frequent taxon-specific expansions and losses. We dated the origin of human subfamilies using phylogenetic profiling, which enlarged the Rab repertoire of the Last Eukaryotic Common Ancestor with Rab14, 32 and RabL4. Furthermore, a detailed analysis of the Choanoflagellate Monosiga brevicollis Rab family pinpointed the changes that accompanied the emergence of Metazoan multicellularity, mainly an important expansion and specialisation of the secretory pathway. Lastly, we experimentally establish tissue specificity in expression of mouse Rabs and show that neo-functionalisation best explains the emergence of new human Rab subfamilies. With the Rabifier and RabDB, we provide tools that easily allows non-bioinformaticians to integrate thousands of Rabs in their analyses. RabDB is designed to enable the cell biology community to keep pace with the increasing number of fully-sequenced genomes and change the scale at which we perform comparative analysis in cell biology.
Author Summary
Intracellular compartmentalisation via membrane-delimited organelles is a fundamental feature of the eukaryotic cell. Understanding its origins and specialisation into functionally distinct compartments is a major challenge in evolutionary cell biology. We focus on the Rab enzymes, critical organisers of the trafficking pathways that link the endomembrane system. Rabs form a large family of evolutionarily related proteins, regulating distinct steps in vesicle transport. They mark pathways and organelles due to their specific subcellular and tissue localisation. We propose a solution to the problem of identifying and annotating Rabs in hundreds of sequenced genomes. We developed an accurate bioinformatics pipeline that is able to take into account pre-existing and often inconsistent, manual annotations. We made it available to the community in form of a web tool, as well as a database containing thousands of Rabs assigned to sub-families, which yields clear functional predictions. Thousands of Rabs allow for a new level of analysis. We illustrate this by characterising for the first time the global evolutionary dynamics of the Rab family. We dated the emergence of subfamilies and suggest that the Rab family expands by duplicates acquiring new functions.
doi:10.1371/journal.pcbi.1002217
PMCID: PMC3192815  PMID: 22022256
14.  Insertion Sequence–Driven Diversification Creates a Globally Dispersed Emerging Multiresistant Subspecies of E. faecium 
PLoS Pathogens  2007;3(1):e7.
Enterococcus faecium, an ubiquous colonizer of humans and animals, has evolved in the last 15 years from an avirulent commensal to the third most frequently isolated nosocomial pathogen among intensive care unit patients in the United States. E. faecium combines multidrug resistance with the potential of horizontal resistance gene transfer to even more pathogenic bacteria. Little is known about the evolution and virulence of E. faecium, and genomic studies are hampered by the absence of a completely annotated genome sequence. To further unravel its evolution, we used a mixed whole-genome microarray and hybridized 97 E. faecium isolates from different backgrounds (hospital outbreaks (n = 18), documented infections (n = 34) and asymptomatic carriage of hospitalized patients (n = 15), and healthy persons (n = 15) and animals (n = 21)). Supported by Bayesian posterior probabilities (PP = 1.0), a specific clade containing all outbreak-associated strains and 63% of clinical isolates was identified. Sequencing of 146 of 437 clade-specific inserts revealed mobile elements (n = 74), including insertion sequence (IS) elements (n = 42), phage genes (n = 6) and plasmid sequences (n = 26), hypothetical (n = 58) and membrane proteins (n = 10), and antibiotic resistance (n = 9) and regulatory genes (n = 11), mainly located on two contigs of the unfinished E. faecium DO genome. Split decomposition analysis, varying guanine cytosine content, and aberrant codon adaptation indices all supported acquisition of these genes through horizontal gene transfer with IS16 as the predicted most prominent insert (98% sensitive, 100% specific). These findings suggest that acquisition of IS elements has facilitated niche adaptation of a distinct E. faecium subpopulation by increasing its genome plasticity. Increased genome plasticity was supported by higher diversity indices (ratio of average genetic similarities of pulsed-field gel electrophoresis and multi locus sequence typing) for clade-specific isolates. Interestingly, the previously described multi locus sequence typing–based clonal complex 17 largely overlapped with this clade. The present data imply that the global emergence of E. faecium, as observed since 1990, represents the evolution of a subspecies with a presumably better adaptation than other E. faecium isolates to the constraints of a hospital environment.
Author Summary
Whole-genome sequencing has become instrumental in investigating the genome contents of bacteria. However, there is enormous diversity within bacterial populations, and annotation of multiple genomes is costly and elaborate. For investigating diversity and phylogeny within bacterial species, comparative genomic hybridization is an attractive alternative that may provide fundamental insights into the factors (genes) distinguishing bacterial subpopulations. Enterococcus faecium, a worldwide emerging nosocomial pathogen usually resistant to multiple antibiotics, causes infections in immunocompromised patients. Using comparative genomic hybridization of 97 E. faecium strains isolated from different epidemiological niches worldwide, a subpopulation of E. faecium strains was identified that was associated with invasive infections and hospital outbreaks. Approximately 13% of the E. faecium pangenome was highly specific for this subpopulation, and, based on phylogenetic clustering, it should be considered a subspecies. We hypothesize that extensive variation within specific functional genes and high prevalence of mobile elements, mostly insertion sequence elements, contributed to the success of this genetic subset in its competition with other enterococci in hospital settings, creating a novel globally dispersed nosocomial subspecies. These findings fully confirmed previous phylogenetic studies based on multi locus sequence typing that had also revealed a genetic subset of E. faecium, clonal complex 17. Identification of genes specific for clonal complex 17 is a first step in elucidating how global spread and adaptation to the hospital environment of this emerging nosocomial pathogen has occurred.
doi:10.1371/journal.ppat.0030007
PMCID: PMC1781477  PMID: 17257059
15.  Re-Annotation of Protein-Coding Genes in 10 Complete Genomes of Neisseriaceae Family by Combining Similarity-Based and Composition-Based Methods 
In this paper, we performed a comprehensive re-annotation of protein-coding genes by a systematic method combining composition- and similarity-based approaches in 10 complete bacterial genomes of the family Neisseriaceae. First, 418 hypothetical genes were predicted as non-coding using the composition-based method and 413 were eliminated from the gene list. Both the scatter plot and cluster of orthologous groups (COG) fraction analyses supported the result. Second, from 20 to 400 hypothetical proteins were assigned with functions in each of the 10 strains based on the homology search. Among newly assigned functions, 397 are so detailed to have definite gene names. Third, 106 genes missed by the original annotations were picked up by an ab initio gene finder combined with similarity alignment. Transcriptional experiments validated the effectiveness of this method in Laribacter hongkongensis and Chromobacterium violaceum. Among the 106 newly found genes, some deserve particular interests. For example, 27 transposases were newly found in Neiserria meningitidis alpha14. In Neiserria gonorrhoeae NCCP11945, four new genes with putative functions and definite names (nusG, rpsN, rpmD and infA) were found and homologues of them usually are essential for survival in bacteria. The updated annotations for the 10 Neisseriaceae genomes provide a more accurate prediction of protein-coding genes and a more detailed functional information of hypothetical proteins. It will benefit research into the lifestyle, metabolism, environmental adaption and pathogenicity of the Neisseriaceae species. The re-annotation procedure could be used directly, or after the adaption of detailed methods, for checking annotations of any other bacterial or archaeal genomes.
doi:10.1093/dnares/dst009
PMCID: PMC3686433  PMID: 23571676
the Neisseriaceae family; re-annotation; newly found genes; eliminated non-coding ORFs; newly assigned functions
16.  A Semi-Quantitative, Synteny-Based Method to Improve Functional Predictions for Hypothetical and Poorly Annotated Bacterial and Archaeal Genes 
PLoS Computational Biology  2011;7(10):e1002230.
During microbial evolution, genome rearrangement increases with increasing sequence divergence. If the relationship between synteny and sequence divergence can be modeled, gene clusters in genomes of distantly related organisms exhibiting anomalous synteny can be identified and used to infer functional conservation. We applied the phylogenetic pairwise comparison method to establish and model a strong correlation between synteny and sequence divergence in all 634 available Archaeal and Bacterial genomes from the NCBI database and four newly assembled genomes of uncultivated Archaea from an acid mine drainage (AMD) community. In parallel, we established and modeled the trend between synteny and functional relatedness in the 118 genomes available in the STRING database. By combining these models, we developed a gene functional annotation method that weights evolutionary distance to estimate the probability of functional associations of syntenous proteins between genome pairs. The method was applied to the hypothetical proteins and poorly annotated genes in newly assembled acid mine drainage Archaeal genomes to add or improve gene annotations. This is the first method to assign possible functions to poorly annotated genes through quantification of the probability of gene functional relationships based on synteny at a significant evolutionary distance, and has the potential for broad application.
Author Summary
Based on trends between gene sequence divergence and gene order divergence over time, we developed a new synteny-based method to refine functional annotation. This method uses these trends to determine the probability that any two syntenous genes (genes that are sequential in two organisms) are functionally related. Organisms that are distant relatives have few syntenous genes, but these syntenous genes have a very high probability of functional relatedness. We applied this method to newly assembled genomes of co-occurring, uncultivated acid mine drainage Archaea in order to improve their gene annotations. This application revealed important physiological differences between the co-occurring organisms in this clade, including the ability of some but not all of the Archaea to manufacture vitamin B12 and to carry out anaerobic energy metabolism. We also used this method to identify new genes possibly involved in vitamin B12 synthesis, ether lipid synthesis, molybdopterin synthesis and utilization, and microbial immunity through the CRISPR system.
doi:10.1371/journal.pcbi.1002230
PMCID: PMC3197636  PMID: 22028637
17.  Functional Annotation of Conserved Hypothetical Proteins from Haemophilus influenzae Rd KW20 
PLoS ONE  2013;8(12):e84263.
Haemophilus influenzae is a Gram negative bacterium that belongs to the family Pasteurellaceae, causes bacteremia, pneumonia and acute bacterial meningitis in infants. The emergence of multi-drug resistance H. influenzae strain in clinical isolates demands the development of better/new drugs against this pathogen. Our study combines a number of bioinformatics tools for function predictions of previously not assigned proteins in the genome of H. influenzae. This genome was extensively analyzed and found 1,657 functional proteins in which function of 429 proteins are unknown, termed as hypothetical proteins (HPs). Amino acid sequences of all 429 HPs were extensively annotated and we successfully assigned the function to 296 HPs with high confidence. We also characterized the function of 124 HPs precisely, but with less confidence. We believed that sequence of a protein can be used as a framework to explain known functional properties. Here we have combined the latest versions of protein family databases, protein motifs, intrinsic features from the amino acid sequence, pathway and genome context methods to assign a precise function to hypothetical proteins for which no experimental information is available. We found these HPs belong to various classes of proteins such as enzymes, transporters, carriers, receptors, signal transducers, binding proteins, virulence and other proteins. The outcome of this work will be helpful for a better understanding of the mechanism of pathogenesis and in finding novel therapeutic targets for H. influenzae.
doi:10.1371/journal.pone.0084263
PMCID: PMC3877243  PMID: 24391926
18.  BLANNOTATOR: enhanced homology-based function prediction of bacterial proteins 
BMC Bioinformatics  2012;13:33.
Background
Automated function prediction has played a central role in determining the biological functions of bacterial proteins. Typically, protein function annotation relies on homology, and function is inferred from other proteins with similar sequences. This approach has become popular in bacterial genomics because it is one of the few methods that is practical for large datasets and because it does not require additional functional genomics experiments. However, the existing solutions produce erroneous predictions in many cases, especially when query sequences have low levels of identity with the annotated source protein. This problem has created a pressing need for improvements in homology-based annotation.
Results
We present an automated method for the functional annotation of bacterial protein sequences. Based on sequence similarity searches, BLANNOTATOR accurately annotates query sequences with one-line summary descriptions of protein function. It groups sequences identified by BLAST into subsets according to their annotation and bases its prediction on a set of sequences with consistent functional information. We show the results of BLANNOTATOR's performance in sets of bacterial proteins with known functions. We simulated the annotation process for 3090 SWISS-PROT proteins using a database in its state preceding the functional characterisation of the query protein. For this dataset, our method outperformed the five others that we tested, and the improved performance was maintained even in the absence of highly related sequence hits. We further demonstrate the value of our tool by analysing the putative proteome of Lactobacillus crispatus strain ST1.
Conclusions
BLANNOTATOR is an accurate method for bacterial protein function prediction. It is practical for genome-scale data and does not require pre-existing sequence clustering; thus, this method suits the needs of bacterial genome and metagenome researchers. The method and a web-server are available at http://ekhidna.biocenter.helsinki.fi/poxo/blannotator/.
doi:10.1186/1471-2105-13-33
PMCID: PMC3386020  PMID: 22335941
19.  A domain-centric solution to functional genomics via dcGO Predictor 
BMC Bioinformatics  2013;14(Suppl 3):S9.
Background
Computational/manual annotations of protein functions are one of the first routes to making sense of a newly sequenced genome. Protein domain predictions form an essential part of this annotation process. This is due to the natural modularity of proteins with domains as structural, evolutionary and functional units. Sometimes two, three, or more adjacent domains (called supra-domains) are the operational unit responsible for a function, e.g. via a binding site at the interface. These supra-domains have contributed to functional diversification in higher organisms. Traditionally functional ontologies have been applied to individual proteins, rather than families of related domains and supra-domains. We expect, however, to some extent functional signals can be carried by protein domains and supra-domains, and consequently used in function prediction and functional genomics.
Results
Here we present a domain-centric Gene Ontology (dcGO) perspective. We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations. This general framework has been applied specifically to primary protein-level annotations from UniProtKB-GOA, generating GO term associations with SCOP domains and supra-domains. The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences. The functional annotation of sequences in the Critical Assessment of Function Annotation (CAFA) has been used as a valuable opportunity to validate our method and to be assessed by the community. The functional annotation of all completely sequenced genomes has demonstrated the potential for domain-centric GO enrichment analysis to yield functional insights into newly sequenced or yet-to-be-annotated genomes. This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology. The dcGO and its predictor are available at http://supfam.org/SUPERFAMILY/dcGO including an enrichment analysis tool.
Conclusions
As functional units, domains offer a unique perspective on function prediction regardless of whether proteins are multi-domain or single-domain. The 'dcGO Predictor' holds great promise for contributing to a domain-centric functional understanding of genomes in the next generation sequencing era.
doi:10.1186/1471-2105-14-S3-S9
PMCID: PMC3584936  PMID: 23514627
20.  An improved hypergeometric probability method for identification of functionally linked proteins using phylogenetic profiles 
Bioinformation  2013;9(7):368-374.
Predicting functions of proteins and alternatively spliced isoforms encoded in a genome is one of the important applications of bioinformatics in the post-genome era. Due to the practical limitation of experimental characterization of all proteins encoded in a genome using biochemical studies, bioinformatics methods provide powerful tools for function annotation and prediction. These methods also help minimize the growing sequence-to-function gap. Phylogenetic profiling is a bioinformatics approach to identify the influence of a trait across species and can be employed to infer the evolutionary history of proteins encoded in genomes. Here we propose an improved phylogenetic profile-based method which considers the co-evolution of the reference genome to derive the basic similarity measure, the background phylogeny of target genomes for profile generation and assigning weights to target genomes. The ordering of genomes and the runs of consecutive matches between the proteins were used to define phylogenetic relationships in the approach. We used Escherichia coli K12 genome as the reference genome and its 4195 proteins were used in the current analysis. We compared our approach with two existing methods and our initial results show that the predictions have outperformed two of the existing approaches. In addition, we have validated our method using a targeted protein-protein interaction network derived from protein-protein interaction database STRING. Our preliminary results indicates that improvement in function prediction can be attained by using coevolution-based similarity measures and the runs on to the same scale instead of computing them in different scales. Our method can be applied at the whole-genome level for annotating hypothetical proteins from prokaryotic genomes.
doi:10.6026/97320630009368
PMCID: PMC3669790  PMID: 23750082
Protein function prediction; phylogenetic profiles; functional annotation; functional similarity
21.  Phylogenomics of Prokaryotic Ribosomal Proteins 
PLoS ONE  2012;7(5):e36972.
Archaeal and bacterial ribosomes contain more than 50 proteins, including 34 that are universally conserved in the three domains of cellular life (bacteria, archaea, and eukaryotes). Despite the high sequence conservation, annotation of ribosomal (r-) protein genes is often difficult because of their short lengths and biased sequence composition. We developed an automated computational pipeline for identification of r-protein genes and applied it to 995 completely sequenced bacterial and 87 archaeal genomes available in the RefSeq database. The pipeline employs curated seed alignments of r-proteins to run position-specific scoring matrix (PSSM)-based BLAST searches against six-frame genome translations, mitigating possible gene annotation errors. As a result of this analysis, we performed a census of prokaryotic r-protein complements, enumerated missing and paralogous r-proteins, and analyzed the distributions of ribosomal protein genes among chromosomal partitions. Phyletic patterns of bacterial and archaeal r-protein genes were mapped to phylogenetic trees reconstructed from concatenated alignments of r-proteins to reveal the history of likely multiple independent gains and losses. These alignments, available for download, can be used as search profiles to improve genome annotation of r-proteins and for further comparative genomics studies.
doi:10.1371/journal.pone.0036972
PMCID: PMC3353972  PMID: 22615861
22.  The COG database: an updated version includes eukaryotes 
BMC Bioinformatics  2003;4:41.
Background
The availability of multiple, essentially complete genome sequences of prokaryotes and eukaryotes spurred both the demand and the opportunity for the construction of an evolutionary classification of genes from these genomes. Such a classification system based on orthologous relationships between genes appears to be a natural framework for comparative genomics and should facilitate both functional annotation of genomes and large-scale evolutionary studies.
Results
We describe here a major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes and the construction of clusters of predicted orthologs for 7 eukaryotic genomes, which we named KOGs after eukaryotic orthologous groups. The COG collection currently consists of 138,458 proteins, which form 4873 COGs and comprise 75% of the 185,505 (predicted) proteins encoded in 66 genomes of unicellular organisms. The eukaryotic orthologous groups (KOGs) include proteins from 7 eukaryotic genomes: three animals (the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster and Homo sapiens), one plant, Arabidopsis thaliana, two fungi (Saccharomyces cerevisiae and Schizosaccharomyces pombe), and the intracellular microsporidian parasite Encephalitozoon cuniculi. The current KOG set consists of 4852 clusters of orthologs, which include 59,838 proteins, or ~54% of the analyzed eukaryotic 110,655 gene products. Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increase in the coverage of eukaryotic genomes with KOGs. Examination of the phyletic patterns of KOGs reveals a conserved core represented in all analyzed species and consisting of ~20% of the KOG set. This conserved portion of the KOG set is much greater than the ubiquitous portion of the COG set (~1% of the COGs). In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes.
Conclusion
The updated collection of orthologous protein sets for prokaryotes and eukaryotes is expected to be a useful platform for functional annotation of newly sequenced genomes, including those of complex eukaryotes, and genome-wide evolutionary studies.
doi:10.1186/1471-2105-4-41
PMCID: PMC222959  PMID: 12969510
23.  Homology modeling, comparative genomics and functional annotation of Mycoplasma genitalium hypothetical protein MG_237 
Bioinformation  2011;7(6):299-303.
Mycoplasma genitalium is a human pathogen associated with several sexually transmitted diseases. The complete genome of M. genitalium G37 has been sequenced and provides an opportunity to understand the pathogenesis and identification of therapeutic targets. However, complete understanding of bacterial function requires proper annotation of its proteins. The genome of M. genitalium consists of 475 proteins. Among these, 94 are without any known function and are described as ‘hypothetical proteins’. We selected MG_237 for sequence and structural analysis using a bioinformatics approach. Primary and secondary structure analysis suggested that MG_237 is a hydrophilic protein containing a significant proportion of alpha helices, and subcellular localization predictions suggested it is a cytoplasmic protein. Homology modeling was used to define the three-dimensional (3D) structure of MG-237. A search for templates revealed that MG_237 shares 63% homology to a hypothetical protein of Mycoplasma pneumoniae, indicating this protein is evolutionary conserved. The refined 3D model was generated using (PS)2­v2 sever that incorporates MODELLER. Several quality assessment and validation parameters were computed and indicated that the homology model is reliable. Furthermore, comparative genomics analysis suggested MG_237 as non-homologous protein and involved in four different metabolic pathways. Experimental validation will provide more insight into the actual function of this protein in microbial pathways.
PMCID: PMC3280499  PMID: 22355225
Mycoplasma genitalium; homology modelling; hypothetical proteins; comparative genomics; metabolic pathways
24.  Gene Ontology annotation of the rice blast fungus, Magnaporthe oryzae 
BMC Microbiology  2009;9(Suppl 1):S8.
Background
Magnaporthe oryzae, the causal agent of blast disease of rice, is the most destructive disease of rice worldwide. The genome of this fungal pathogen has been sequenced and an automated annotation has recently been updated to Version 6 . However, a comprehensive manual curation remains to be performed. Gene Ontology (GO) annotation is a valuable means of assigning functional information using standardized vocabulary. We report an overview of the GO annotation for Version 5 of M. oryzae genome assembly.
Methods
A similarity-based (i.e., computational) GO annotation with manual review was conducted, which was then integrated with a literature-based GO annotation with computational assistance. For similarity-based GO annotation a stringent reciprocal best hits method was used to identify similarity between predicted proteins of M. oryzae and GO proteins from multiple organisms with published associations to GO terms. Significant alignment pairs were manually reviewed. Functional assignments were further cross-validated with manually reviewed data, conserved domains, or data determined by wet lab experiments. Additionally, biological appropriateness of the functional assignments was manually checked.
Results
In total, 6,286 proteins received GO term assignment via the homology-based annotation, including 2,870 hypothetical proteins. Literature-based experimental evidence, such as microarray, MPSS, T-DNA insertion mutation, or gene knockout mutation, resulted in 2,810 proteins being annotated with GO terms. Of these, 1,673 proteins were annotated with new terms developed for Plant-Associated Microbe Gene Ontology (PAMGO). In addition, 67 experiment-determined secreted proteins were annotated with PAMGO terms. Integration of the two data sets resulted in 7,412 proteins (57%) being annotated with 1,957 distinct and specific GO terms. Unannotated proteins were assigned to the 3 root terms. The Version 5 GO annotation is publically queryable via the GO site . Additionally, the genome of M. oryzae is constantly being refined and updated as new information is incorporated. For the latest GO annotation of Version 6 genome, please visit our website . The preliminary GO annotation of Version 6 genome is placed at a local MySql database that is publically queryable via a user-friendly interface Adhoc Query System.
Conclusion
Our analysis provides comprehensive and robust GO annotations of the M. oryzae genome assemblies that will be solid foundations for further functional interrogation of M. oryzae.
doi:10.1186/1471-2180-9-S1-S8
PMCID: PMC2654668  PMID: 19278556
25.  New Assembly, Reannotation and Analysis of the Entamoeba histolytica Genome Reveal New Genomic Features and Protein Content Information 
Background
In order to maintain genome information accurately and relevantly, original genome annotations need to be updated and evaluated regularly. Manual reannotation of genomes is important as it can significantly reduce the propagation of errors and consequently diminishes the time spent on mistaken research. For this reason, after five years from the initial submission of the Entamoeba histolytica draft genome publication, we have re-examined the original 23 Mb assembly and the annotation of the predicted genes.
Principal Findings
The evaluation of the genomic sequence led to the identification of more than one hundred artifactual tandem duplications that were eliminated by re-assembling the genome. The reannotation was done using a combination of manual and automated genome analysis. The new 20 Mb assembly contains 1,496 scaffolds and 8,201 predicted genes, of which 60% are identical to the initial annotation and the remaining 40% underwent structural changes. Functional classification of 60% of the genes was modified based on recent sequence comparisons and new experimental data. We have assigned putative function to 3,788 proteins (46% of the predicted proteome) based on the annotation of predicted gene families, and have identified 58 protein families of five or more members that share no homology with known proteins and thus could be entamoeba specific. Genome analysis also revealed new features such as the presence of segmental duplications of up to 16 kb flanked by inverted repeats, and the tight association of some gene families with transposable elements.
Significance
This new genome annotation and analysis represents a more refined and accurate blueprint of the pathogen genome, and provides an upgraded tool as reference for the study of many important aspects of E. histolytica biology, such as genome evolution and pathogenesis.
Author Summary
Entamoeba histolytica is an anaerobic parasitic protozoan that causes amoebic dysentery. The parasites colonize the large intestine, but under some circumstances may invade the intestinal mucosa, enter the bloodstream and lead to the formation of abscesses such amoebic liver abscesses. The draft genome of E. histolytica, published in 2005, provided the scientific community with the first comprehensive view of the gene set for this parasite and important tools for elucidating the genetic basis of Entamoeba pathogenicity. Because complete genetic knowledge is critical for drug discovery and potential vaccine development for amoebiases, we have re-examined the original draft genome for E. histolytica. We have corrected the sequence assembly, improved the gene predictions and refreshed the functional gene assignments. As a result, this effort has led to a more accurate gene annotation, and the discovery of novel features, such as the presence of genome segmental duplications and the close association of some gene families with transposable elements. We believe that continuing efforts to improve genomic data will undoubtedly help to identify and characterize potential targets for amoebiasis control, as well as to contribute to a better understanding of genome evolution and pathogenesis for this parasite.
doi:10.1371/journal.pntd.0000716
PMCID: PMC2886108  PMID: 20559563

Results 1-25 (1018261)