Genome data analysis in IMG consists of operations involving genomes, genes and functions which can be selected, explored individually, and compared. The composition of analysis operations is facilitated by genome, scaffold, gene and function ‘carts' that handle lists of genomes, scaffolds, genes and functions, respectively.
Data selection tools
Genomes, genes and functions can be selected using browsers and search tools. Browsers allow users to select genomes and functions organized as alphabetical lists or using domain specific hierarchical classifications. Keyword search tools allow identifying genomes, genes and functions of interest using a variety of selection filters. Genomes can be also selected using a search tool which allows specifying conditions involving metadata attributes, such as temperature range, oxygen requirement or ecosystem, while genes can be also selected using BLAST search tools against various data sets.
IMG’s data selection tools have been extended in order to improve their efficiency and usability. For example, genomes can be selected using ‘Genome Browser’ or ‘Genome Search’, as illustrated in .
Figure 2. Genome browser and search tools. The ‘Genome Browser’ displays the genomes organized in a phylogenetic tree or (i) in a tabular list that can be configured by (ii) adding or removing genome, metadata or annotation specific columns. (iii (more ...)
The ‘Genome Browser’ displays the genomes organized in a phylogenetic tree or in a tabular format as illustrated in (i). The tabular display of genomes has a dynamic layout, with columns than can be resized, reordered and sorted on content, configurable page display size, and an export capability for saving tables as Excel spreadsheets or tab delimited files. A ‘Column Selector’ allows to hide columns. The genome table can be also reconfigured by adding or removing genome, metadata or annotation specific columns, as illustrated in (ii). Note that the number of metadata attributes associated with genomes has increased substantially in the past few years, whereby the data for these attributes is collected from GOLD (2
). ‘Genome Search’ allows searching genomes on genome or metadata specific fields, as illustrated in (iii).
Individual genomes can be explored using the ‘Organism Details’ page which provides a variety of tools for browsing, searching for the presence of specific genes, or downloading genome data sets, as illustrated in (iv). This page also provides information (metadata) on the genome together with various genome statistics of interest, such as the number of genes that are associated with KEGG, COG, Pfam, InterPro or enzyme information. Individual genes can be analyzed using the ‘Gene Details’ page which includes Gene Information, Protein Information and Pathway Information tables, evidence for functional prediction, COG, Pfam and pre-computed homologues.
Tabular and graphical displays, such as graphical viewers for the distribution of genes associated with COG, Pfam, TIGRfam and KEGG for each genome, have been extended in order to facilitate genome and gene exploration. Individual functional categories, such as COG, Pfam, TIGRfam, KEGG Orthology terms and pathways, can be explored using functional category specific browsers.
New IMG tools provide support for examining protein expression data as illustrated in . Protein expression studies are listed on the ‘Experiments Statistics’ section of the ‘IMG Statistics’ page and are available on the ‘Organism Details’ page of the genome they are associated with. A protein expression study, such as ‘Impact of Phenolic Substrate and Growth Temperature on the Arthrobacter chlorophenolicus’ study shown in (i), is associated with a list of samples (experiments). Summaries for samples include a description, the number of associated genes, the peptide count and the total and average coverage for the sample (The total coverage is the sum of coverages for the genes in a sample, where the coverage for a gene consists of the count of its associated peptides divided by the size of the gene.), as illustrated in (ii). Samples can be selected for further analysis. Expressed genes of a single sample can be examined in the context of pathways, as illustrated in (iv), whereby enzymes are displayed with colours representing the level of expression for the associated genes. Expressed genes of multiple samples can be also examined in the context of pathways, whereby enzymes are displayed with colours representing the percentage of samples with expressed genes associated with the enzymes. Samples (experiments) can be clustered based on coverage values for the genes expressed in each sample, with a choice of clustering methods, such as pairwise complete linkage and centroid linkage, and distance measure, such as Pearson correlation, Spearman’s rank correlation and Euclidean distance. The result of clustering is displayed as a hierarchical tree of samples and a normalized heat map of coverage values for each gene for each sample.
Figure 3. Protein expression exploration tools. (i) ‘Protein Expression Studies’ are listed on the IMG Statistics page, with each study associated with (ii) a list of ‘Protein Expression Experiments’ (samples). (iii) Samples can (more ...)
Sample pairs can be compared in terms of genes up or down regulation, with a threshold specified for the difference in gene expression. The difference in expression is computed using either the logR = log2(query/reference) or the RelDiff = 2(query − reference)/(query + reference) metric. The result of the comparison can be displayed as a histogram, as illustrated in (v), or in a tabular format. This histogram can be used to identify and set thresholds for the search of over expressed or under expressed genes between any pair of selected conditions.
The genomes, genes and functions that result from search operations are displayed as lists from which genomes, genes and functions can be selected for inclusion into the ‘Genome Cart’, ‘Gene Cart’ and ‘Function Cart’, respectively. These carts have been extended in order to facilitate the composition of analysis tools in IMG. Thus, genes selected in ‘Gene Cart’ can be added directly to ‘Function Cart’ via their associated functions, such as COG, Pfam, TIGRfam. In a similar manner, functions selected in ‘Function Cart’ can be added directly to ‘Gene Cart’ via the genes associated with the selected functions, where the genes included into the ‘Gene Cart’ can be restricted to specific genomes.
Comparative analysis tools
Genomes can be compared in terms of gene content using the ‘Phylogenetic Profiler’ and ‘Phylogenetic Profiler for Gene Cassettes’ tools. The ‘Phylogenetic Profiler’ allows users to identify genes in a query genome in terms of presence or absence of homologues in other genomes. The ‘Phylogenetic Profiler for Gene Cassettes’ allows users to find genes that are part of a gene cassette in a query genome as well as part of related (conserved part of) gene cassettes in other genomes, whereby the result of such a search includes groups of collocated genes in each chromosomal cassette in the query genome that satisfy the search condition. More details on context analysis based on IMG’s gene cassettes can be found in (22
Genomes can be compared in terms of functional capabilities using the ‘Abundance Profile Overview’ and ‘Function Profile’ tools. The ‘Abundance Profile Overview’ allows users to compare the relative abundance of protein families (COGs, Pfams, TIGRfams) and functional families (enzymes) across selected genomes, whereby the results are displayed either as a heat map or a matrix, with the cells in the heat map and matrix linked to the list of genes assigned to a particular family in a genome. The ‘Function Profile’ is a selective version of the ‘Abundance Profile Overview’, with functions of interest first selected with the ‘Function Cart’.
The metabolic capabilities of genomes can be compared using the ‘Abundance Profile Overview’ and ‘Function Profile’ tools applied on enzymes involved in a pathway of interest. Alternatively, the metabolic capabilities of genomes can be compared in the context of KEGG pathways, as illustrated in . Once a pathway is selected from the list of KEGG pathways via the KEGG option of the ‘Find Functions’ menu, as shown in (i), the ‘KEGG Pathway Details’ lists the associated enzymes of KO terms, as illustrated in (ii). Genomes for comparison are selected from a phylogenetically organized list, with the comparison result displayed on the KEGG pathway map, as illustrated in (iii). Each enzyme number on the map is coloured depending on the percentage of genomes with a gene associated with that enzyme, whereby the tooltip for a coloured enzyme displays the number of these genomes.
Figure 4. Comparative analysis tools. (i) A pathway is selected from the list of KEGG pathways via the KEGG option of the ‘Find Functions’ menu, and subsequently (ii) the ‘KEGG Pathway Details’ lists its associated enzymes and the (more ...)
Genomes can be compared using two open source graphical viewers, ‘Phylogenetic Distance Tree’ and ‘Radial Phylogenetic Tree’, available under the ‘Compare Genomes’ main menu, as illustrated in (iv). For both tools, genomes are selected for comparison from a list of genomes similar to that shown in (ii). The ‘Phylogenetic Distance Tree’ computes the phylogenetic distance between genomes selected for comparison based on the 16S alignment derived from the SILVA database (29
). For genes whose sequence is not included in the alignment the closest match is used, if the identify of it to the 16S gene of the IMG taxon is >97%. The distance tree is displayed using the Archaeopteryx tool (http://www.phylosoft.org/archaeopteryx/
), which uses phyloXML for data exchange (30
). Each node in the tree hyperlinked to the IMG genome page for that node.
The ‘Radial Phylogenetic Tree’ tool originally developed for MG-RAST (31
), allows comparing the BLAST hits of the genes of up to 5 user selected genomes to the genes of all the genomes in the database using a colour-coded hierarchical circular tree viewer. This viewer displays the BLAST hits at different taxonomic levels, with more statistics for the hits for each genome provided by hovering the mouse over the nodes of the tree.
Genomes can be compared in terms of sequence conservation using VISTA tools (32
), the Artemis comparison tool (33
) and a ‘Dotplot’ tool which employs the program ‘Mummer’ to generate dotplot diagrams between two genomes.
In addition to the analysis tools available in IMG, IMG/ER provides tools for identifying and correcting annotation anomalies, such as dubious protein product names, and for filling annotation gaps detected using IMG’s comparative analysis tools, such as genes that may have been missed by gene prediction tools or genes without predicted functions (24
). Gene annotations that result from expert review and curation are captured in IMG/ER as so called ‘MyIMG’ annotations associated with individual scientist or group accounts, with curated genomes included into Genbank either as new submissions or as revisions of previously submitted data sets.