The MaGe web interface consists of numerous dynamic web pages containing textual and graphical representations for accessing and querying data (Supplementary Figure 3). A specific effort has been made in terms of graphical representations of available analysis results, to make the manual expert annotation easier and more efficient.
Genome browser and synteny maps
MaGe's main innovative functionality is a cartographic gene context exploration of the studied genome compared against all the available microbial genomes. This comparative genomics environment provides quality checks for both the automatic annotations and manual analysis. In , the first graphic map (genome browser) contains the complete Acinetobacter baylyi ADP1 chromosome, over which the user can navigate with complete freedom (moving and zooming functionalities). The predicted coding genes are drawn, on the six reading frames, in red rectangles together with the coding prediction curves which are computed with the selected gene model.
The two following maps are representations of the synteny results (): each line shows the similarity results between the genome being annotated (i.e. Acinetobacter ADP1) and a given genome (i.e. the first three lines of the second synteny map are with three Pseudomonas species). The first synteny map is a selection of the hundred curated genomes in PkGDB to date (see ‘The relational database’), and the second one is a selection of the 280 complete prokaryotic proteomes available in public databanks. On these maps, a rectangle flags the existence of a gene in a compared organism which is similar to the opposite gene in the annotated genome. If, for several co-localized CDSs on the annotated genome, there are several co-localized homologs on the compared genome, the rectangles will all be of the same color; otherwise, the rectangle is white. A group of rectangles of the same color thus indicates a synteny group. This graphical representation allows the user to quickly see if the part of the genome being annotated shares similarities and locally conserved organization with the selected bacterial sequences (‘Options’ functionality). As shown in , this is the case with Acinetobacter baumannii, and the two selected Pseudomonas species, with the P.aeruginosa genome sharing the most important number of synteny groups in this part of the ADP1 genome.
In contrast with the genome browser, there is no notion of scale on the synteny maps: to see how homologous genes are organized in a synteny group, the user can simply interact on one gene in a given synteny group. For example, by clicking on one rectangle of the green synteny group between P.aeruginosa and Acinetobacter ADP1, both corresponding genome regions of the compared organisms are shown and orthologs are linked, allowing the user to explore fusion/fission, duplication, inversion and insertion/deletion of genes (). In our example, one interesting rearrangement appears clearly: the two P.aeruginosa homologs of the ADP1 CDS named ACIAD1137 are co-localized and transcribed on the same strand, showing that the corresponding biological functions (i.e. ribonuclease H and epsilon subunit of the DNA polymerase III) have been fused in the genome of Acinetobacter ADP1. Actually, the graphical representation of the synteny maps itself is also useful for detecting this kind of interesting feature: on each line, a rectangle has the same size as the corresponding annotated CDS in the studied genome. In addition, rectangles are colored depending on the part of the protein which aligns with the corresponding ADP1 protein (). It then becomes easy to see that ACIAD1137 has always two homologous genes in all the selected compared genomes (except with A.baumannii). However, the corresponding ADP1 protein aligns only on its N-terminal part with the first corresponding genes (rnhA gene), and on its C-terminal part with the second corresponding genes (dnaQ gene). Finally, these two homologous genes are involved in a synteny group containing eight genes in Pseudomonas species, six genes in Ralstonia solanacearum, three genes in E.coli, Pseudoalteromonas haloplanktis and Xanthomonas axonopodis, and only two genes in Shewanella oneidensis. In these last four bacteria, dnaQ and rnhA genes are transcribed anti-clockwise and in R.solanacearum, dnaQ gene is not co-localized with the rnhA gene (white rectangle). This raises interesting evolutionary questions concerning the fusion of these two biological functions involved in DNA replication.
Just below the three maps, several functionalities are available, such as the exploration of synteny results or annotated data using keywords (‘Explore’), the search for similarities using blast functionalities (
26), or for patterns in DNA or protein sequences (‘Search’). At any time the user can download data in different common file formats (FASTA, EMBL, GenBank, etc.) or extract part of its DNA sequence (‘Export Data’). He/she can work with Artemis software (
8) which is very useful for modifying erroneous start codon positions, for example, or explore KEGG (
45), BioCyc (
47) or PHT (
50) metabolic pathways with MaGe annotations as input (see ‘Metabolic pathway reconstruction’ and ‘Metabolic pathway visualization’).
Automatic versus manual annotation
In spite of the continuous improvement in the overall quality of bioinformatic methods, some difficulties in gene functional assignment can hardly be addressed in a completely automatic way. Most notably, the problem of error propagation in databases (
60), which is today very strong in the context of common ‘industrial’ production of genome data, can only be solved with human intervention. Thus, the set of automatic annotations produced by any system should be considered only as a useful first approximation.
In MaGe, automatic annotation is always available in the gene editor (‘Automatic annotation’, Supplementary Figure 4). This information is updated each time a new version of the complete genome sequence becomes available. Improvement of the annotation data quality can be made in the ‘Gene Validation’ section of the gene editor, which allows the user to modify, delete and add information. Annotation homogenization is achieved via a procedure which is automatically launched when gene annotations are saved in the database. This allows for a minimal checking of the annotation coherence. For instance, ‘ProductType’ field must be equal to enzyme if an EC number is given (Supplementary Figure 4). A further advantage of MaGe's manual annotation system is that it enables a group of users, possibly at different locations, to easily co-operate on specific annotations: email addresses of either the last annotator (in the gene editor) or all the different annotators for a specific gene (in the ‘History’ functionality, data not shown) are available. To help the user in the manual annotation of a gene, a summary of available method results are visualized in a completely customizable list (Supplementary Figure 4). This part of the gene editor is essentially a workbench for curation and analysis of a single gene or its protein family. It contains information on gene prediction (AMIGene) and duplication results, similarity results against annotation data from reference genomes, Swiss-Prot curated annotations and TrEMBL databank, synteny results using PkGDB curated proteomes and complete prokaryotic genomes stored in the NCBI RefSeq section (about 280 to date). These comprehensive synteny results are useful to update, if necessary, the list of currently selected genomes which are visualized in the synteny maps. Other tables include enzymatic function predictions (PRIAM results), similarity results against COG (COGnitor), protein domain databanks (InterProScan) and HAMAP families. Finally, clues on the probable protein localization are given by the SignalP and tmHMM results (Supplementary Figure 4). For each set of results, external links, if any, are provided (NiceProt, NiceEnzyme, InterPro and COG databases, HAMAP families). In addition, direct interaction with PubMed (only if the field ‘PubMedID’ is filled), and with KEGG (external link) or BioCyc (internal link) metabolic pathway(s), is available. This integrative strategy allows annotators to quickly browse functional evidence, tracking the history of a function and checking the gene context conservation with an orthologous gene having an experimentally demonstrated biological function.
Metabolic pathway visualization
Using MaGe, metabolic pathway exploration is accessible through three different tools: KEGG, BioCyc and PHT. Starting from the set of predicted and/or validated EC numbers, metabolic maps are dynamically drawn via a request to the KEGG web server. A color-based code enables comparison of the studied organism enzyme content with a selected related organism, with enzymes encoded by genes localized on the current MaGe genome region highlighted in yellow (). The useful representation of KEGG interconnected metabolic pathways is supplemented by the organism-specific PGDB built with the BioCyc system and an access to a PHT web form (see ‘Metabolic pathway reconstruction’).
Exploration of metabolic pathways could be enhanced through gene context analysis. For example, in the case of lysine biosynthesis, three alternative routes are described in the literature: the succinylase, dehydrogenase and acetylase branches (
61). During the study of
Frankia alni genome, MaGe annotations combined with the FrankiaCyc PGDB revealed only one possible pathway involving the succinylase branch (). All of the genes coding for the enzymes of this pathway (
ask,
asd,
dapA,
dapB,
dapD,
dapE,
dapF and
lysA) have been found, except for the
dapC gene which encodes a succinyldiaminopimelate amino transferase activity. In
E.coli, the
dapC gene does not exist, but the ArgD protein possesses both an acetylornithine and a succinyldiaminopimelate aminotransferase activity for arginine and lysine biosynthesis, respectively (
62). In
F.alni, the
argD gene has been identified and its presence could explain the absence of
dapC. Actually, studying the
F.alni genomic context of the genes involved in lysine biosynthesis, we found a gene (FRAAL6125) described as a putative aminotransferase. This gene is co-localized with the characterized
dapE and
dapD genes which encode two of the three steps of the succinylase branch (). In addition, the corresponding KEGG map reveals the apparent lack of DapC activity and a co-localization of
dapE and
dapD genes (). Furthermore, the synteny results among thirty organisms show a chromosomal conservation of this three-gene organization. All these evidence leads us to assume that FRAAL6125 is a good candidate for
dapC. These assumptions were confirmed by sequence comparison with experimentally demonstrated
dapC genes in
Corynebacterium glutamicum (
63) and
Bordetella pertussis (
64) (52 and 32% amino acid identity, respectively). In contrast to the
dapC homolog in other organisms, in
F.alni the protein encoded by FRAAL6125 possesses an additional C-terminal domain of unknown function which is characterized by a glutamine- and glycine-rich content. This is shown, in , by the uncolored part of the rectangles in the synteny maps corresponding to the
dapC homologs in the selected organisms. Two other strains of the Frankia genus (Cci3 and EAN1pec), sequenced by the United States Department Of Energy, show a similar genomic organization of the
dapCDE gene cluster. But only the strain EAN1pec possesses this C-terminal domain (first synteny map in ). This Frankia-specific C-terminal domain of DapC calls for more experimental investigation. This example shows that MaGe integration of gene context methods is a powerful tool for experts in metabolic analysis.
Data exploration
Although the notion of multigenome comparisons is omnipresent in the graphical interface of our system, the exploration functionality developed in MaGe is linked to the genome being selected for expert annotation only (‘Display organism’ in the ‘Options’ functionality). A simple keyword search enables the user to quickly retrieve genes of the annotated genome having a particular function. Several sets of data can be queried, such as automatic and validated annotations (expert work), or a specific set of annotated CDSs corresponding, for example, to conserved hypothetical proteins which are in synteny with other organisms. In addition, each kind of computed result (PRIAM, InterPro, blast similarities in reference genome annotation data, and in Swiss-Prot or TrEMBL databanks) can be retrieved. The result output is a list of candidate genes, the genomic contexts of which can be easily visualized (automatic displacement of the genome browser centered on a gene of interest).
In a second section, called ‘PhyloProfile and Synteny’ (Supplementary Figure 5), the user can search for genes of the studied organism which are homologs of genes in certain organisms and exclude those that are homologs of genes in other organisms. The phylogenetic profile method is designed to infer functional relationships between genes: proteins involved in the same biological process are likely to evolve in a correlated fashion (
15). This method, combined with the integration of synteny results, allows one to detect a coevolution of gene groups which have a similar chromosomal organization. Integration of chromosomal proximity and gene content information has been reported to be more accurate than the single-gene phylogenetic profiles (
65).
Using the synteny results stored in our database (see ‘The relational database’), the fusion/fission events can easily be computed. Our procedure detects synteny groups having two genes from a compared genome corresponding to a single annotated CDS in the target genome (). BlastP correspondences are evaluated to exclude the detection of tandem duplications by keeping only non-overlapping side-by-side alignments. These events are listed in the ‘Fusion/Fission’ item of the ‘Explore’ functionality (Supplementary Figure 5) and split into two tables: one containing the list of putative fused genes, and the other for fission events. Annotators can then browse results by checking for possible pseudogenes or for true functional evidence leading to the annotation of a multifunctional protein (see above, the case of rnhA and dnaQ gene fusion in Acinetobacter ADP1).
In a fourth section of the ‘Explore’ functionality, specific regions between the genome under analysis and a set of genomes selected for their phylogenetic proximity can be browsed (Supplementary Figure 5). Data are represented in a table listing gene clusters that have no correspondences in one or more compared organisms. One application of this comparative genomic analysis is the detection of genomic islands. A comparative study between two A.baumannii strains, AYE a multi-drug resistant strain and SDF a fully susceptible one, led us to decipher a 86 kb AYE-specific region where more than 40 resistance genes are clustered (P.-E. Fournier et al., manuscript in preparation).