|Home | About | Journals | Submit | Contact Us | Français|
The integrated microbial genomes (IMG) system serves as a community resource for comparative analysis of publicly available genomes in a comprehensive integrated context. IMG contains both draft and complete microbial genomes integrated with other publicly available genomes from all three domains of life, together with a large number of plasmids and viruses. IMG provides tools and viewers for analyzing and reviewing the annotations of genes and genomes in a comparative context. Since its first release in 2005, IMG’s data content and analytical capabilities have been constantly expanded through regular releases. Several companion IMG systems have been set up in order to serve domain specific needs, such as expert review of genome annotations. IMG is available at http://img.jgi.doe.gov.
The integrated microbial genomes (IMG) system serves as a community resource for comparative analysis of publicly available genomes in a comprehensive integrated context. IMG employs NCBI’s RefSeq resource (1) as its main source of public genome sequence data, and ‘primary’ annotations consisting of predicted genes and protein products. IMG genomes are classified using NCBI’s (domain, phylum, class, order, family, genus, species, strain) taxonomy. For every genome, IMG records its primary genome sequence information from RefSeq including its organization into chromosomal replicons (for finished genomes) and scaffolds and/or contigs (for draft genomes), together with predicted protein-coding sequences (CDSs), some RNA-coding genes and protein product names that are provided by the genome sequence centers. Every genome included in IMG is associated with metadata attributes, available from GOLD (2).
IMG’s data integration pipeline computes CRISPR repeats (3), signal peptides using SignalP (4) and transmembrane helices using TMHMM (5), and associates genes with ‘secondary’ functional annotations and lists of related (e.g. homolog, paralog) genes. IMG-generated annotations consist of protein family and domain characterizations based on COG clusters and functional categories (6), Pfam (7), TIGRfam and TIGR role categories (8), InterPro domains (9), Gene Ontology terms (10) and KEGG Ortholog (KO) terms and pathways (11) (for more details, see the Data processing section of about IMG at: http://img.jgi.doe.gov/w/doc/dataprep.html). Genes are further characterized using an IMG native collection of generic (protein cluster-independent) functional roles called IMG terms that are defined by their association with generic (organism-independent) functional hierarchies, called IMG pathways (12). IMG terms and pathways are specified by domain experts at DOE-JGI as part of the process of annotating specific genomes of interest, and are subsequently propagated to all the genomes in IMG using a rule-based methodology (13).
Gene relationships in IMG are based on sequence similarities computed using NCBI BLASTp for protein coding genes and BLASTn for RNA genes). For each gene, IMG provides lists of related (e.g. candidate homolog, paralog, ortholog) genes that can be filtered using percent identity, bit score and more stringent E-values, or using metadata attributes such as phenotype and habitat.
IMG has expanded regularly its collection of genomes and aims at improving gradually the coverage and consistency of its functional annotations. IMG’s analytical tools have been continuously enhanced in terms of their usability, analysis flow and performance. Several companion IMG systems have been set up in order to serve domain specific needs, including expert review of genome annotations prior to their publication (IMG/ER: http://img.jgi.doe.gov/er), teaching courses and training in microbial genome analysis (IMG/EDU: http://img.jgi.doe.gov/edu), and analysis of genomes related to the Human Microbiome Project (IMG/HMP: http://www.hmpdacc-resources.org/img_hmp) (The Human Microbiome Project is part of NIH’s Roadmap for Medical Research: http://nihroadmap.nih.gov/hmp/). We review below IMG’s data content and analysis tool extensions since the last published report on IMG (14).
IMG’s initial collection of 296 bacterial, archaeal and eukaryotic genomes in its first version (March 2005) grew to 825 genomes in IMG 2.3 (September 2007) and then more than doubled to 1655 genomes in IMG 2.9 (August 2009). In addition, IMG 2.9 includes 2490 virus genomes and 970 plasmids that did not come from a specific microbial genome sequencing project, bringing its total genome content to 5115 genomes with over 6.5 million genes (a Content History link on IMG’s home page provides an overview of its content growth).
Prior to their inclusion into IMG, RefSeq genomes undergo a review process. First, the taxonomic classification for genomes and the names and host information for plasmids are reviewed. In particular, plasmid names are curated by adding strain names to organism name when available from publications or other sources, and plasmid sequences are added to host genome sequences when appropriate. Next, missing RNAs are identified using tRNAS-can-SE-1.23 (15) for tRNAs, RNAmmer (16) for rRNAs and Rfam (17) and INFERNAL (18) for small RNAs. Finally, for genomes without any functional annotation in RefSeq, protein product names are assigned to genes using the procedure described in ref. (13): such annotations are performed only by request, for example from a centre such as HMP-DACC (http://www.hmpdacc.org/).
The functional annotations generated by IMG’s data integration pipeline are regularly reviewed by scientists in JGI’s Genome Biology Program with the goal of improving their coverage. Following such a review, the KEGG collection of pathways in IMG has been reorganized and updated using the enhanced collection of KEGG resources, including KO terms and KEGG pathway modules (9). The association of KEGG pathways with IMG genomes is based on the assignment of KO terms to IMG genes via a mapping of IMG genes to KEGG genes. The MetaCyc collection of pathways (19) has been also included into IMG, whereby the association of MetaCyc pathways with IMG genomes is based on correlating enzyme EC numbers in MetaCyc reactions with EC numbers associated with IMG genes via KO terms.
Two interactive reports regarding the KO term distribution in IMG across protein families, genomes and paralog clusters, are provided for assessing the consistency of protein family annotations in IMG. For a specific (query) KO term, the first report lists: (i) the number of genes associated with the query KO term and the number of genomes that have genes associated with this KO term; (ii) the ‘average number of genes’ associated with the query KO term per genome, whereby this metric helps identify KO terms that were assigned to multiple genes in the same genome either by mistake or because these terms correspond to sequence similarity-based families rather than function-based groups; (iii) the number of genes associated with the query KO term that belong to paralog clusters, whereby this metric indicates the likelihood of incorrect annotations due to the presence of paralogs; and (iv) the number of genes associated with the query KO term and that have a paralog annotated with the same KO term, whereby this number helps identifying incorrectly annotated paralogous genes.
The second report lists for each unique (COG, Pfam, TIGRfam) combination: (i) the number of genes associated with the query KO term and this combination; (ii) the number of genes associated with this combination and a KO term different from the query KO term, including genes associated with multiple KO terms and a query KO term as one of them; (iii) the number of genes associated with this combination and a KO term different from the query KO term, and not associated with the query KO term; and (iv) the number of genes associated with this combination and not associated with any KO term.
The gene correlations computed by IMG’s data integration pipeline have been extended from pair-wise relationships to include gene fusions and cassettes. A fused gene (fusion) is defined as a gene that is formed from the composition (fusion) of two or more previously separate genes (component genes). The identification of fusions employs well-established methods based on pair-wise similarities between genes (20) (fusion computation is described at: http://img.jgi.doe.gov/w/doc/fusions.html). Genes, such as transposases and integrases, pseudogenes and genes from draft genomes are not considered as putative fusion components in order to avoid false positives caused by gene fragmentation.
A chromosomal neighbourhood, also known as chromosomal cassette, is defined as a stretch of genes with intergenic distance smaller or equal to 300 bp (21), whereby the genes can be on the same or different strands. Chromosomal cassettes with a minimum size of two genes common in at least two separate genomes are defined as conserved chromosomal cassettes. The identification of common genes across organisms is based on three gene clustering methods, namely participation in COG, Pfam and IMG ortholog clusters. The computation of gene cassettes and their support for context analysis in IMG is described in detail in ref. (22).
Genome data analysis in IMG consists of operations involving genomes, genes and functions which can be selected, explored individually and compared. The composition of analysis operations is facilitated by gene and function ‘carts’ that handle lists of genes and functions, respectively.
Genomes, genes and functions can be selected using browsers and search tools. Browsers allow users to select genomes and functions organized as alphabetical lists or using domain specific hierarchical classifications. Keyword search tools allow identifying genomes, genes and functions of interest using a variety of selection filters. Genomes can be also selected using a search tool which allows specifying conditions involving metadata attributes, while genes can be also selected using BLAST search tools against various datasets.
IMG’s data selection tools have been extended in order to improve their efficiency and usability. In particular, genomes can be selected using a new phylogenetic tree based ‘Genome Browser’, a geographical location based project map, and a metadata based classification, as illustrated in Figure 1. The phylogenetic tree based ‘Genome Browser’ starts with a display of the three genome domains, as illustrated in Figure 1(i), which can be expanded using open/close icons available at each level of the tree, as illustrated in Figure 1(ii). Genomes can be selected either individually or in groups using the green dot ‘select all’ icons available at each level of the tree. For example, clicking the ‘select all’ (green dot) icon associated with Crenarchaeota, as illustrated in Figure 1(ii), will both expand the sub-tree under this phylum down to individual genomes and select all these genomes, as illustrated in Figure 1(iii). Genomes can be unselected (cleared) either individually or in groups using the red dot ‘clear all’ icons available at each level of the tree.
The ‘Genome by Metadata’ link on IMG’s home page provides access to a classification of the archaeal, bacterial and eukaryotic genomes by several metadata attributes, as illustrated in Figure 1(iv). The metadata attributes and values are taken from GOLD (2) and reflect the continuously increasing level of information collection and curation in this resource.
Individual genomes can be explored using the ‘Organism Details’ page, which includes information on the organism together with various genome statistics of interest, such as the number of genes that are associated with KEGG, COG, Pfam, InterPro or enzyme information. Individual genes can be analyzed using the ‘Gene Details’ page, which includes Gene Information, Protein Information, and Pathway Information tables, evidence for functional prediction, COG, Pfam and precomputed homologs. New graphical viewers, such as graphical displays of the distribution of genes associated with COG, Pfam, TIGRfam and KEGG for each genome, have been added to ‘Organism Details’ and ‘Gene Details’ in order to facilitate genome and gene exploration. Individual functional categories, such as KEGG Orthology terms and pathways, MetaCyc pathways, can be explored using functional category specific browsers.
Several new IMG tools allow users to search and explore gene cassette information. A chromosomal cassette involving a specific (query) gene can be examined using a ‘Chromosomal Cassette Details’ page available via the ‘Gene Information’ section of ‘Gene Details’ for that gene. This page provides information on the protein clusters (e.g. COGs) of all the genes in the cassette, as well as information on other cassettes that share at least two protein clusters with the cassette that includes the query gene. Gene cassettes can be searched using ‘Cassette Search’ and ‘Phylogenetic Profiler for Gene Cassettes’. ‘Cassette Search’ allows users to find genes that are part of chromosomal cassettes involving specific protein clusters, as illustrated in Figure 2(i), where the search involves COG clusters. By default, the search is carried out across all the genomes in IMG, with various filters provided for limiting the search to specific genomes. The result of ‘Cassette Search’ consists of genes that satisfy the search condition, together with the identifiers of the cassettes they are part of, their associated protein cluster identifiers and names, and their genomes, as illustrated in Figure 2(ii). Cassette identifiers provide links to the ‘Chromosomal Cassette’ details page, as illustrated in Figure 2(iii).
The genomes that result from browsing and search operations are displayed as a list from which they can be selected and saved for further analysis. The genes and functions that result from search operations are displayed as lists from which genes and functions can be selected for inclusion into the ‘Gene Cart’ and ‘Function Cart’, respectively.
IMG comparative analysis tools allow comparing genomes in terms of gene content, functional and metabolic capabilities, and sequence conservation.
Genomes can be compared in terms of gene content using the ‘Phylogenetic Profiler’ tool, which allows users to identify genes in a query genome in terms of presence or absence of homologs in other genomes. This tool can be used, for example for finding unique genes in the query genome with respect to other genomes of interest. The ‘Phylogenetic Profiler for Gene Cassettes’ extends its counterpart for single genes by allowing users to find genes that are part of a gene cassette in a query genome as well as part of related (conserved part of) gene cassettes in other genomes, as illustrated in Figure 2(iv). The result of such a search includes a summary, as shown in the left side pane of Figure 2(v), and a details part that displays groups of collocated genes in each chromosomal cassette in the query genome that satisfy the search condition, as illustrated in Figure 2(v). The conserved part of a chromosomal cassette involving an individual gene in the query genome can be examined using the links provided in the ‘Conserved Neighbourhood Viewer Centred on this Gene’ column of results table, as shown in Figure 2(vi). More details on context analysis based on IMG’s gene cassettes can be found in (22).
The gene content of a genome can be examined from an evolutionary point of view using tools available as part of a genome’s ‘Organism Details’. The ‘Phylogenetic Distribution of Genes’ provides a glimpse into the evolutionary history of the genes in a genome based on the distribution of best BLAST hits of its protein-coding genes. The genes that were likely vertically inherited are expected to have higher sequence similarity to the genes in the genomes within the same taxonomic group, while those horizontally transferred may have their best BLAST hits to the genes in distantly related organisms. Since this tool considers best BLAST hits and does not perform phylogenetic tree reconstruction and analysis, the results can be used as a first approximation of the evolutionary history of the genes and require manual analysis to establish whether the genes of interest were indeed horizontally transferred. The phylogenetic distribution of best BLAST hits of protein-coding genes in a selected genome is displayed as a histogram, as shown in Figure 3(i); counts correspond to the number of genes that have best BLASTp hits to proteins of other genomes in a specific phylum or class with >90% identity (right column), 60–90% identity (middle column) and 30–60% identity (left column). The phylogenetic distribution of best BLAST hits can be further projected onto the families in a phylum/class. Gene counts in the histogram are linked to the lists of genes in the selected genome that have best BLAST hit in a certain phylum/class with specified percent identity. The genes in the table can be selected and added to ‘Gene Cart’ or analyzed through the corresponding ‘Gene Details’.
‘Putative Horizontally Transferred Genes’, also available as part of a genome’s ‘Organism Details’, allows users to explore genes in a query genome that are likely horizontally transferred from genomes in phylogenetic groups that are different than the group the query genome belongs to. Putative horizontally transferred genes are defined as genes that have best hits (best bitscores) to genes that do not belong to the phylogenetic group of the query genome. In this calculation, we use not only the best hit (i.e. the hit with the best bitscore) but also all the hits that have bitscore equal or >90% of the best hit. For a query genome, such as Methanosaeta thermophila PT, two lists of genes are provided, as illustrated in Figure 3(ii). The first list consists of genes with best hits (best bit score) to genes of genomes within a phylogenetic group (domain, phylum, class, etc.) that is different than the analogous group the query genome belongs to. For example, as an archaeal genome, M. thermophila PT has 228 genes with best hits to bacterial genomes, 17 genes with best hits to eukaryotic genomes, and 1 gene with best hits to viral genomes. These genes may be horizontally transferred genes from bacterial, eukaryotic or viral genomes, respectively. The second lists consists of genes with best hits to genomes within a phylogenetic group (domain, phylum, class, etc.) that is different than the analogous group the query genome belongs to, and no hits to genes of genomes within the same phylogenetic group (domain, phylum, class, etc.) as the group the query genome belongs to. For example, M. thermophila PT has two genes with best hits to bacterial genomes and no hits to other archaeal genomes, as illustrated in Figure 3(iii), with a higher likelihood of being horizontally transferred from bacterial genomes.
Genomes can be compared in terms of functional capabilities using a number of functional profile tools. The ‘Abundance Profile Overview’ allows users to compare the relative abundance of protein families (COGs, Pfams, TIGRfams) and functional families (enzymes) across selected genomes, as illustrated in Figure 4(i) where the T. volcanium and T. Acidophilum genomes are compared in terms of enzymes assigned to their genes. The abundance of protein/functional families is displayed either as a heat map or a matrix, as illustrated in Figure 4(ii), where each column corresponds to a genome, and each row corresponds to a family. The abundance of protein/functional families is displayed either as a heat color map with red corresponding to the most abundant families, or in a tabular format, where each cell contains the number of genes associated with a family for a specific genome. Cells in the heat map and matrix are linked to the list of genes assigned to a particular family in a genome. Families of interest can be selected for inclusion into the ‘Function Cart’. The results in matrix format can be exported to a tab-delimited Excel file. The functional capabilities of genomes can be also compared using the ‘Function Profile’, which is a selective version of the ‘Abundance Profile Overview’, with functions of interest first selected with the ‘Function Cart’. The ‘Function Profile’ result is displayed in a matrix format, as illustrated Figure 4(iii), similar to the matrix display for ‘Abundance Profile Overview’ results.
The metabolic capabilities of genomes can be analyzed using functional profile tools applied on enzymes (e.g. the enzymes involved in a pathway of interest) together with a tool for finding ‘missing’ enzyme that are marked by a null abundance in the function profile result. Such a null abundance for an specific ‘missing’ enzyme leads to the ‘Find Candidate Genes for Missing Function’ tool, as illustrated in Figure 4(iv), which allows users to search for candidate genes that could be associated with this missing enzyme either via KO terms or homolog/ortholog genes associated with it. The result of the search for candidate genes, illustrated in Figure 4(v), consists of a list of genes that can be selected and included into the ‘Gene Cart’ and further examined using various tools, such as gene neighbourhood analysis and multiple sequence alignment tools.
Sequences of genomes can be compared using VISTA tools (23) and a ‘Dotplot’ ” tool. Users can select an organism from a predefined list in order to invoke the VISTA browser that can be then employed for examining the sequence conservation of closely related organisms in IMG. ‘Dotplot’, a recent addition to IMG’s comparative analysis toolkit, employs the program Mummer to generate dotplot diagrams between two genomes, whereby nucleotide sequences are used for genomes with fairly similar sequences and protein sequences are used for genomes with less similar nucleotide sequences.
The initial IMG system has expanded into a family of four related systems covering two application domains: microbial genome analysis (IMG, IMG ER) and metagenome analysis (IMG/M, IMG/M ER).
The ‘Expert Review’ version of IMG (IMG/ER) allows individual scientists or groups of scientists to review and curate the functional annotation of microbial genomes in the context of IMG’s public genomes. Scientists include their genome datasets into IMG ER prior to their public release either with their original annotations or with annotations generated by IMG’s annotation pipeline (13). IMG ER provides tools for identifying and correcting annotation anomalies, such as dubious protein product names, and for filling annotation gaps detected using IMG’s comparative analysis tools, such as genes that may have been missed by gene prediction tools or genes without predicted functions (24). The development of the IMG ER tools was driven by and applied to the genome analysis and curation needs of over 150 microbial genomes, such as Halothermothrix orenii (25). In addition to individual genome reviews, the annotations of a group of 56 Genomic Encyclopedia for Bacteria and Archaea (GEBA) genomes (http://www.jgi.doe.gov/programs/GEBA/pilot.html) were revised by JGI scientists using IMG ER (26). Gene annotations that result from expert review and curation are captured in IMG ER as so called ‘MyIMG’ annotations associated with individual scientist or group accounts. Genomes curated with IMG ER are included into Genbank either as new submissions or as revisions of previously submitted datasets, thus contributing to a coordinated improvement of the public genome data resources.
The ‘Integrated Microbial Genomes with Microbiome Samples’ (IMG/M) system provides support for the comparative analysis of metagenomic sequences generated with various sequencing technology platforms and data processing methods in the context of the reference isolate genomes from IMG. IMG/M’s analysis tools extend IMG’s comparative analysis tools with metagenome-specific analysis tools (27). Similar to IMG ER, an ‘Expert Review’ version of IMG/M (IMG/M ER) provides support for annotation review and curation of metagenome datasets prior to their public release.
IMG HMP is an auxiliary resource based on IMG focusing on analysis of genomes related to the Human Microbiome Project (HMP) in the context of all publicly available genomes in IMG. IMG-HMP is part of the HMP Data Analysis and Coordination Center (DACC) funded by the National Institutes of Health (http://www.hmpdacc.org/).
IMG’s genome sequence data content is maintained through regular updates from RefSeq and other public sequence data resources. IMG’s functional annotations are gradually extended by including annotations from systems, such as SEED (http://www.theseed.org/wiki/Home_of_the_SEED), or by providing links to systems such as CMR (http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi), thus providing extensive corroboration of annotations from multiple microbial genome data resources.
IMG has been recently extended to include protein expression data from a recent Arthrobacter chlorophenolicus study (28). Protein expression studies for a genome of interest are provided via the genome’s ‘Organism Details’, whereby each study is associated with the number of expressed genes, observed peptides, and a list of experiments/samples. The description for each sample consists of the experimental conditions and provides a link to the protein expression data for the sample organized per expressed gene. For each expressed gene, the number of observed peptides leads to the peptide details page, where the peptide sequences are displayed aligned on the gene’s protein sequence. For an expressed gene, the ‘Protein Information’ section of its ‘Gene Detail’ provides a link to a ‘Proteomic Data’ page which displays the list of experiments/samples involving the expressed gene and the peptides observed for the expressed gene as part of each experiment. We plan to follow a similar strategy for including into IMG results from microarray experiments, as well as information on transcriptional regulatory binding sites.
In order to facilitate the exploration of a rapidly increasing number of genomes, genes and annotations, IMG will be extended with pangenomes, where a pangenome represents the sum of all the genes present in the genomes of different strains belonging to a given species (29). Pangenome analysis tools and viewers will allow users to explore individual pangenomes and compare pangenomes and genomes.
Director, Office of Science, Office of Biological and Environmental Research, Life Sciences Division, U.S. Department of Energy (Contract No. DE-AC02-05CH11231). Funding for open access charge: Lawrence Berkeley National Laboratory.
Conflict of interest statement. None declared.
We thank Philip Hugenholtz, Alla Lapidus, Amrita Pati, Sean Hooper and Inna Dubchak for their contribution to the development and maintenance of IMG. The work of JGI’s production, cloning, sequencing, assembly, finishing and annotation teams is an essential prerequisite for IMG. Eddy Rubin and James Bristow provided, support, advice and encouragement throughout this project.