|Home | About | Journals | Submit | Contact Us | Français|
The analysis of DNA composition and codon usage reveals many factors that influence the evolution of genes and genomes. In this chapter, we show how to use CodonExplorer, a web tool and interactive database that contains millions of genes, to better understand the principles governing evolution at the single gene and whole-genome level. We present principles and practical procedures for using analyses of GC content and codon usage frequency to identify highly expressed or horizontally transferred genes and to study the relative contribution of different types of mutation to gene and genome composition. CodonExplorer’s combination of a user-friendly web interface and a comprehensive genomic database makes these diverse analyses fast and straightforward to perform. CodonExplorer is thus a powerful tool that facilitates and automates a wide range of compositional analyses.
Different organisms use codons for the same amino acid at different frequencies and this non-randomness can be exploited for a variety of analyses of both theoretical and practical importance. Soon after the genetic code was first elucidated (1, 2) it was clear that it was “degenerate” – not in the sense of decaying from a more elaborate form, but in the physicist’s sense of having multiple equivalent states. In this case, because there are 64 codons but only 20 amino acids that are universal to all life, some amino acids must be encoded by more than one codon (exceptions to this rule, such as selenocysteine and pyrrolysine, are important but lineage-specific). We might expect these equivalent codons to be used at random, but early studies of nucleotide frequencies (3) suggested that different organisms have extreme biases toward certain synonyms. The availability of the first DNA sequences for protein-coding genes (4–6) confirmed that codon usage is in fact highly non-random and differs dramatically among organisms.
Why should different organisms use codons that have the same meaning with such different frequencies? An innovative analysis of highly-expressed genes by Ikemura et al. showed that genes that are known to be required at high levels, such as ribosomal proteins, have substantially different codon usage from the average gene in the same genome (7). This suggested a strong effect of translation bias on codon usage because the tRNA pool in a given organism would recognize different anticodons with different efficiencies. For example, the GNN anticodon is often used to recognize either CNN or UNN at the first position in so-called “wobble pairing” (8, 9), but the efficiency of these two pairings can differ substantially. Consequently, highly-expressed genes would be expected to be composed of codons that match the tRNA anticodons that are expressed at the highest levels and/or have the highest copy numbers; this effect has been confirmed in bacteria (10, 11), yeast (12), Drosophila (13), and many other species. Thus, selection for translational efficiency has been shown to be one important influence on codon usage bias.
Many different measures of the codon usage bias in a given gene have been developed. One of the most popular is Sharp and Li’s codon adaptation index (CAI) (14), which indicates how closely the codon usage of each gene matches the codon usage of a set of known highly expressed genes. The relationship between codon usage and expression has proven useful for optimizing the expression of synthetic genes transferred into a host (reviewed in (15); see also Note 1 on the “Synthetic Gene Designer” website).
Selection for translational efficiency is not, however, the only influence on codon usage; the overall composition of the genome also has a strong effect on which codons are chosen. For example, Muto and Osawa were able to show surprisingly strong, linear trends relating the GC content of the whole genome to the GC content at the first, second, and third codon positions in a diverse assortment of 11 organisms (16). Knight et al. confirmed that these trends held true in hundreds of different organisms across all three domains of life and demonstrated that the codon usage of a given organism could be predicted with considerable accuracy simply from a knowledge of these overall trends and the GC content of the individual organisms (17). Consequently, both selection and mutation play a large role in structuring the codon usage of genes and genomes. In fact, multivariate analyses of codon usages typically identify expression and GC content of individual genes as the principal factors structuring composition within a genome (18).
Because codon usage is strongly influenced by mutational processes, codon usage can provide insight into the different mutational processes operating in different genomes (19–21), or in different parts of the same genome where regions of compositional heterogeneity such as isochores exist (22, 23). The ratio of different kinds of codons can provide insights into whether deamination or oxidation contributes more to the pattern of codon usage in a specific organism through techniques such as the fingerprint plot and the PR2 bias plot (24–26).
Finally, codon usage and other compositional information can be used to detect horizontal gene transfer (HGT), the movement of genes between different genomes (27). Because different organisms differ in the composition of their genes, HGT can be detected, if sufficiently recent (28), by looking for genes of unusual composition (29, 30). However, some caution must be exercised in this approach (See also Note 2), since highly expressed genes can also show codon bias, and gradients in composition can appear along a genome due to replication-coupled biases (31, 32).
CodonExplorer (available at http://bmf.colorado.edu/codonexplorer/) provides a platform for conducting many diverse analyses of codon usage and nucleotide composition in sequenced genes or genomes. One feature of CodonExplorer is the ease with which millions of sequenced genes and hundreds of genomes can be searched, grouped, and analyzed. By pre-computing statistics from whole-genome databases, using hundreds of hours of CPU time, CodonExplorer is able to provide graphical summaries of vast, complex datasets very rapidly.
The first step in performing sequence composition analysis using CodonExplorer is to gather the sequences on which the analysis will be performed. CodonExplorer incorporates multiple convenient methods for selecting nucleotide sequence data.
The first step when using CodonExplorer is to select a set of genes to analyze. There are several different search methods, including by KEGG gene IDs (see Note 3), KEGG orthology (KO) groups, or by enzyme commission (EC) ID (see Note 4). All genes in one or more genomes (specified by KEGG genome ID or using the NCBI taxonomy) can be conveniently gathered, and a gene screen allows selection of highly expressed genes or putatively horizontally transferred genes within a genome. Finally, many genomic and pathogenicity islands (regions of putatively transferred genes) from the literature have been included and can be easily selected.
Once sets of genes have been selected, they can be sorted, grouped, and edited manually to remove undesired sequences or further refine the set. The KEGG gene IDs for the final set of sequences can be downloaded to allow easy reproduction of the analysis.
To gather all genes in one or more genomes, a list of KEGG genome IDs (see Note 6) may be entered and the genes from each genome will be grouped and sorted according to the selected option (see grouping and sorting data, below).
To gather a list of genes by KEGG ID or gene name, enter the IDs or gene names in the appropriate fields. If search terms are entered in multiple fields, only genes in the intersection of terms are returned (i.e., the terms are combined with an “AND” operator).
Searching for KEGG genome “STM” and KEGG gene name “rpsu” returns only the rpsU gene from Salmonella entericaserovar Typhimurium LT2; searching for KEGG genome “STM” will return every gene from Salmonella enterica serovar Typhimurium LT2.
When a set of a precise set of gene IDs is not known in advance, CodonExplorer provides several convenient ways to select groups of genes sharing particular properties (Fig. 10.2).
To select homologs of recA, enter “K03553” into the KEGG ortholog or KEGG EC ID box, select “KEGG Ortholog” from the “Select type of ID” pull-down menu, and press the “Search for Genes” button at the bottom of the page.
CodonExplorer allows for users to search for genes encoding enzymes that share a common activity, as defined by their KEGG EC ID. To select a group by KEGG EC ID, enter the ID or list of IDs into the “KEGG ortholog or KEGG EC ID” and select “KEGG EC” from the “Select type of ID(s)” menu (see Note 10).
To search for α-amylases (EC22.214.171.124), enter EC126.96.36.199 into the KEGG ortholog field and select “KEGG Ortholog” from the “Select type of ID(s)” pull-down menu. Then press the “Search for Genes” button at the bottom of the page.
CodonExplorer allows users to select a range of genes within a genome. To do so, enter the KEGG gene ID for the first and last genes within the region in the “Enter KEGG gene range” field and select “Forward Only,” “Reverse Only,” or “Both Directions” from the “Gene region direction” pull-down menu (see Note 11).
To select genes in either orientation between STM1379 and STM1422 in the Salmonella enterica serovar Typhimurium genome (i.e., the Salmonella Pathogenicity Island-2 region), enter “STM1379” in the “Gene Start” field, “STM 1422” in the “Gene End” field, “STM” in the KEGG genome ID field, select “Either direction” in the “Gene region direction” pull-down menu, and press the “Search for Genes” button at the bottom of the page.
Many regions of genes present in a particular genome but absent in closely related lineages have been described in the literature (33, 34). Frequently these regions, often called “genomic islands,” have unusual dinucleotide compositions or contain mobility genes such as integrases or transposases (34). Thus, it is suggested that genomic islands are the product of HGT. A subset of these genomic islands called “pathogenicity islands” also contains pathogenicity determinants, making them particularly interesting.
CodonExplorer allows users to select genomic islands that have been described in the literature by selecting a genomic island from the “Select genomic island” pull-down menu (Fig. 10.3). The genomic islands used in Hsiao et al.’s 2005 analysis (34) are included, as are many other islands collected manually from the literature (see also Note 12).
For codon usage or other compositional property analyses, it is often useful to explore variation in highly expressed genes (e.g., those encoding ribosomal proteins) or genes that may have entered the genome by HGT.
To select such a set of genes in CodonExplorer, enter the KEGG genome name in the “KEGG genome” in the gene screen section, then select “Ribosomal + highly expressed KO groups,” “Putative Aliens,” or “Not putative aliens” from the pull-down menu (Fig. 10.3). The putative and aliens and non-alien sets are defined using a Markov Model method described by Nakamura et al.(30). However, like any compositional method, the Nakamura method should not be considered definitive evidence of transfer. See also the discussion below about the benefits and limitations of compositional methods for detecting gene transfer.
By default, genes selected for analysis in CodonExplorer are grouped together by genome and sorted in descending order of the number of selected genes in each genome. However, it is sometimes useful to group and sort genes in different ways.
CodonExplorer allows genes to be grouped by KEGG genome name, KEGG ortholog ID, KEGG EC ID, or (when NCBI taxon names were entered in the search) NCBI Taxonomy name (Fig. 10.4). To change how genes are grouped, enter your search terms as described above, then select the appropriate option from the “Group matching genes by” pull-down menu before pressing the “Search for Genes” button.
CodonExplorer also allows groups of genes to be sorted by the number of genes in the group, the group ID, or the group description (Fig. 10.4). Select the desired option from the “Sort gene groups by” pull-down menu under the “Output options” heading.
After searching for genes on which to perform an analysis, determining how those genes should be sorted and grouped, and pressing the “Search for Genes” button at the bottom of the page, the genes will be displayed at the top of the page (Fig. 10.5) and sorted and grouped according to the selected options (see Note 13). From this screen it is possible to view, edit, or save gene groups.
Each gene group can be viewed by pressing the “View” link to the right of the group description. This will bring up a window listing the genome, KEGG gene ID, KEGG gene name, and a description for each gene (Fig. 10.6). Clicking the “description” link will bring up the KEGG description page for that gene.
To include or exclude one or more gene groups from the analyses, check or uncheck the check box to the left of each gene group’s ‘Group ID’.
The KEGG primary gene IDs for genes in each group can be saved locally as a text file by clicking the “Save Genes” link next to that group.
Once one or more groups of genes have been selected, CodonExplorer allows many compositional analyses to be performed easily. In general, one or more analyses are selected, after which pressing the “Generate selected graph(s)” will cause a new window to appear displaying the desired plots. There are dozens of different analyses available, including comparisons of GC content at different codon positions, measures of codon usage bias, amino acid usage, and a Markov model method for HGT detection (Fig. 10.7). Custom scatter plots can also be constructed plotting many of these properties against one another, or against other properties such as length or hydrophobicity. Finally, a Monte Carlo method allows determination of the statistical significance of apparent differences between the properties of selected groups of genes and the genomes that contain them. For ease of use, we have classified these diverse approaches into the broad categories of “Codon Usage,” “GC Content,” and “Custom Graphs.”
Fingerprint plots were introduced by Sueoka (20) to measure mutational biases within and between genomes. The y-axis represents the frequency of A at position 3 relative to the frequency of A and T at that position (A3/(A3+T3)), while the x-axis represents G3/(G3 + C3). Each circle represents a different amino acid, with the radius of the circle proportional to the relative frequency of that amino acid (19). There are several options when generating fingerprint plots. The Quartets option generates a codon fingerprint plot using the four codon blocks (e.g., the CCN block coding for Proline, where N stands for any of the four nucleotides A, C, G, or T). These blocks are Leucine4, Valine, Alanine, Threonine, Proline, Serine4, Glycine, and Arginine4 (the 4 refers to the four-codon block of an amino acid that has six codons in total). The Split Quartets option generates a codon fingerprint plot using each of the six split codon blocks (e.g., the GAN block coding for Aspartic or Glutamic acid). These blocks include Phenylalanine/Leucine2, Isoleucine/Methionine, Histidine/Glutamine, Asparagine/Lysine, Aspartic Acid/Glutamic Acid, and Serine/Arginine2. The Tyrosine/Stop and Cysteine/Tryptophan/Stop partial blocks are not included on this plot.
Three graph types are available for codon fingerprint blocks in CodonExplorer. First, the Combined option plots all genes in the selected gene groups into a single fingerprint plot. This is useful when for summarizing large data sets, such as whole genomes, DNA strands within a genome, or genomic islands. Second, the Each Gene option allows for a finer grained analysis by plotting each gene within the set of genes selected for analysis separately (but see Note 14). Finally, the Gradient option plot displays a combined fingerprint plot for genes within each 10% interval of GC content. Thus, depending on the number of genes selected, and the range of GC content represented by those genes, only some of the ten possible ranges of GC content will be displayed. An example of constructing and interpreting a gradient codon fingerprint plot is given in Section 3.
As discussed above, GC content has an important influence on codon usage bias (Guy and Sueoka, manuscript in preparation). This influence can be illustrated by plotting CAI values against P3 GC content (see the examples section for some plots of CAI vs. P3 in organisms with very different genomic GC contents).
Chargaff’s second parity rule (PR2) states that within each DNA strand, the frequency of A ≈ T and the frequency of G ≈ C (35). Parity Rule bias plots were developed to examine deviations from PR2 (26). Selecting the PR2 bias plot option generates a set of PR2 bias plots for the selected gene set.
GC content at different codon positions is used to study mutation and selection pressures. Because GC content varies dramatically between but not within bacterial lineages (36), unusual GC content may indicate transfer. Thus unusual GC content has often been used as an indicator of gene transfer, although there are important limitations to this approach (see the example on HGT detection below). CodonExplorer allows the generation of several graphs that allow users to explore the GC content of genes and genomes.
Useful for contrasting the effects of mutation and selection for amino acid usage (P1 and P2) versus codon usage (P3). The x-axis represents the GC content at the third position within each codon in the selected group of genes, and the y-axis represents the GC content at either the first or second position in each codon in the selected group of genes. (P1 and P2 GC contents are plotted as separate series.)
Plots the combined GC for nucleotides at the first or second position in each gene against the GC content for nucleotides at the third position in that gene. The dotted diagonal line represents the expectation if all positions were under identical mutation and selection pressure.
Plots the first, second, and third position GC content (as separate series) against the overall GC content of the gene. Thus each point represents the P1, P2, or P3 GC content of a gene, versus the total GC content of that gene.
This plot allows the user to view a plot of the amino acid content of each gene within the group of selected genes against the P3 GC content of that gene. It should be noted that since this approach divides the data contained in each gene into 20 parts (by recording P3 information for usage of each amino acid for each gene), these plots may be subject to large amounts of noise when examining small datasets. (See Note 14 for a similar caution regarding gradient fingerprint plots).
In addition to allowing the generation of many predefined plots, CodonExplorer also allows users to construct custom scatter plots using a variety of compositional statistics.
Values that can be plotted against one another on either axis include the CAI (as well as several proposed variations on it), the third position GC content (P3), gene length, predicted peptide hydrophobicity, and the horizontal transfer index of Nakamura et al. (30) (Table 10.1).
Histograms can be constructed of CAI values, GC content at the third codon position (P3), gene length, or peptide hydrophobicity.
Often it is useful to report a table of codon usages for a single gene or a collection of genes, e.g., to identify frequently used codons (37). This can be done in CodonExplorer by selecting the “Codon Usage Table” option. The output is a table (Fig. 10.8) that lists the frequency of each codon per thousand codons in the dataset, followed by the absolute number of occurrences of the codon in the selected set of genes.
The total number of genes and codons in the data set are listed at the top of the table, and the overall GC content and the average GC content at positions 1, 2, and 3 are listed at the bottom of the table. This format is similar to that of the CUTG database (See also Note 1).
It is often desirable to compare the properties of some subset of the genes in a genome to the properties of the genome as a whole. For example, one may wish to determine whether the CAI values for genes in a proposed genomic island are significantly lower than those for the genome as a whole. However, visually comparing histograms of CAIs for the island and for the genome can lead to a false inference of difference simply because a small set of genes will tend to have more variation in compositional properties than the genome as a whole. Thus it is useful to be able to test whether observed differences are significant or if they could be explained simply by the reduced size of the examined sub-sample of the genome.
CodonExplorer provides a Monte Carlo method for evaluating the significance of differences between a set of genes and the genome as a whole. The procedure involves generating many (10, 100, or 1,000, at the user’s option) random sets of a number of genes equal to those in the selected subset (chosen individually or as a block of contiguous genes). The mean value of the property is then recorded for each randomly chosen set of genes. Finally a figure is generated showing a histogram of the average values for each random set (Fig. 10.9). The blue circle (dark grey here) represents the mean value of the chosen property for these random sets. The red circle represents the average of the chosen property for the query set. Finally, the light red bars represent a histogram of the values of the chosen property in the query set. Comparing the location of the average of the query set (red circle) against the histogram of average values for equally sized random sets (blue bars) allows the user to determine if the set of genes that they selected varies from other random equally sized gene sets more than would be expected from chance. The caption at the top of the Monte Carlo histogram figure lists the number of random gene sets constructed, how many of those random gene sets had a higher value for the chosen property than the query set, and the probability (as assessed using the one-tailed t-test) that the query set average could be as different as it was from a normal distribution fit to the distribution of random set averages by chance.
To start over and select a new set of genes for analysis, press the “new search” button at the top right of the analysis screen (your old set of genes will not be saved).
Many classic analyses of codon usage and DNA sequence composition can quickly and easily be recapitulated in CodonExplorer. This section provides a step by step guide to using CodonExplorer to examine the effects of selection, mutation, and horizontal transfer on codon usage.
Histograms of CAI values, and plots of CAI against GC content at the third codon position (P3), can provide insight into selection for translational efficiency. Monte Carlo techniques are useful in assessing the significance of differences in CAI or P3/CAI between a subset of genes and the genome as a whole. To generate the plots of CAI versus P3 shown in Figs. 10.10 and 10.11 requires analysis of all genes in the Salmonella enterica serovar Typhimurium LT2 genome (to generate panel A and C in Fig. 10.10), followed by an analysis of highly expressed genes in Salmonella (to generate panels B and D in Fig. 10.10 and panels A and B in Fig. 10.11).
To select all genes in the Salmonella enterica serovar Typhimurium LT2 genome for analysis, enter “stm” (see Note 6) in the “Enter KEGG genome abbreviation(s)” field. Then press the “Search for Genes” button at the bottom of the page. A new screen will appear showing the group of all genes in Salmonella, and several available analyses. Check the “scatter plot” check box by the “CAI vs. P3” header (to select the analysis in Fig. 10.10 A), then check the “CAI” check box to the right of the “Histograms” header (to select the analysis in Fig. 10.10 C). Finally, press the “Generate selected graph(s)” button at the bottom of the screen. After a brief wait, the images in Fig. 10.10 A and C will appear in a new window. (Right-click on the figures and select “Save image as…” to save them to disk as .png images.)
When you finish examining and/or saving the figures, close the figure window, and press the “new search button” at the top of the analysis screen to start a new search. On the search screen, locate the “Gene screen” header (Fig. 10.3), enter “stm” into the field to the left of the “KEGG genome abbreviation” header, and select “Ribosomal proteins + highly expressed Kos” from the pull-down menu. Press the “Search for Genes” button at the bottom of the screen.
When the analysis screen appears, check the boxes to select a “CAI vs. P3” scatter plot and a CAI histogram as before (to generate Fig. 10.10 B and D). Finally, to generate Fig. 10.11 A, check the “generate Monte Carlo Histograms for the selected gene group(s)” checkbox, then select “CAI (two g)” from the “Compositional measure” pull-down menu, and the desired number of replicates (in this case 100) from the “Number of replicates(n)” pull-down menu. Set the “Sample reference Set” pull-down menu to “individual genes” (see Note 15). Then click the “Generate Selected Graph(s)” button. After a brief pause, the images in Figs. 10.10 B, D, and Fig. 10.11 A should appear in a new window. After examining and (if desired) saving the images to disk, return to the select analysis screen and uncheck all boxes other than the Monte Carlo histogram box. Then set the “Compositional measure” pull-down menu to “P3/CAI.” Once again press “Generate selected graph(s)” at the bottom of the screen. The image from Fig. 10.11 B should appear in a new window.
Along with mutation pressure, selection for translational efficiency is believed to be an important force driving codon usage. Thus, those genes that are more highly expressed are expected to have unusually high CAI values for their GC content. Ribosomal proteins are highly expressed (although not all have high CAIs), as are elongation factors and some membrane factors (27). Highly expressed genes can often be detected by plotting CAI versus P3 (Fig. 10.10).
The trends in CAI versus P3 plots or histograms of CAI can be confirmed using the Monte Carlo histogram feature (Fig. 10.11). This is useful because small sets of genes frequently display apparently unusual codon usage or sequence compositional properties simply due to the effects of a small sample size. Thus, comparing against randomly chosen gene sets of the same size can be used to demonstrate or refute the statistical significance of an apparent difference in the plots.
CodonExplorer can be used to generate various P3 versus CAI plots useful for studying the effect of mutation pressure on codon usage. To generate the P3 versus CAI plots shown in Fig. 10.12, enter the KEGG abbreviations for the genomes desired (see Note 6) in the “Enter KEGG genome abbreviation(s)” field, entering one genome abbreviation per line. (When generating Fig. 10.12, “mge,” “eco,” and “sco” were entered on separate lines to search for all genes in Mycoplasma genitalium, Escherichia coli K-12 MG1655, and Streptomyces coelicolor, respectively.) Then press the “Search for Genes” button. A new screen will appear listing the genes found by the search, and available analyses. To generate the plots in Fig. 10.12 click the “scatter plot,” “heat-map,” and “contour plot” checkboxes next to the “CAI vs. P3” header. Make sure the “Combine Data Sets” option is set to “No,” then click the “Generate Selected Graph(s)” button. After a brief wait, figures corresponding to each of the nine panels in Fig. 10.12 will be generated.
Another type of plot useful for analyzing the effects of mutation pressure is the fingerprint plot (19). To generate the fingerprint plot shown in Fig. 10.13, first ensure that you are on the gene search page (if you are on the select analysis page press the “new search” button at the top of the page to start a new search). Search for all genes in the human genome by entering the KEGG genome id for Homo sapiens (hsa) in the “Enter KEGG genome abbreviation(s)” field (see Note 6). Then press “Search for Genes.” A new screen will appear displaying the selected genes and various options for analysis. Select the “combined” and “gradient” checkboxes next to the “Fingerprint plots” header in the codon usage section (see Fig. 10.7). Finally, click the “Generate Selected Graph(s)” button at the bottom of the page. A combined fingerprint plot as well as fingerprint plots for genes in each 10% bin of third codon position GC content (P3) will be generated. The combined fingerprint plot for Homo sapiens (Fig. 10.13 A) as well as the fingerprint plots for the 70–80% (Fig. 10.13 B), 80–90% (Fig. 10.13 C), and 90–100% GC content (Fig. 10.13 D) bins for this genome are shown in Fig. 10.13.
The scatter plot of CAI versus P3 illustrates the variable effect of GC content on CAI in different lineages (Fig. 10.12 A–C). In the Mycoplasma genitalium genome, which possesses a low GC content, there exists a strong negative correlation (r2 = 0.86) between the position three GC content and CAI, whereas in the high-GC Streptomyces coelicolor CAI and GC content are positively correlated (r2 = 0.73).
One difficulty when examining scatter plots of CAI versus P3 is that it can be difficult to determine the relative density of points in different regions of a scatter plot when the number of data points is very large. To help solve this problem, CodonExplorer allows generation of both contour plots (Fig. 10.12 D–F), and heat maps (Fig. 10.12 G–I). These plots more clearly illustrate the relative density of different regions.
The fingerprint plot in Fig. 10.13 illustrates that groups of codons containing different nucleotides at the second codon position tend to change in amino/keto bias together (20). The partial gradient fingerprint plot (Fig. 10.13 B–D) illustrates that although codons for different amino acids respond differently to changes in GC content, groups of codons with the same second position nucleotide tend to respond similarly.
Monte Carlo techniques are useful for testing the statistical significance of differences in codon usage or nucleotide sequence composition between putatively transferred sets of genes and the genome as a whole. To generate the Monte Carlo histograms in Fig. 10.14, first, select the genomic island of interest from the “select genomic island” pull-down menu (Fig. 10.3), or enter the first and last gene in the island of interest, using the “Search by Gene Region” option (Fig. 10.2; see Program Usage, above). Then press the “Search for Genes” button at the bottom of the page. Next, on the search results/analysis page, select the “generate Monte Carlo Histograms for the selected gene group(s)” checkbox, the desired compositional property from the “compositional measure” pull-down menu (for Fig. 10.14, “CAI (two g)” was selected), and the desired number of replicates (in this case 100) from the “Number of replicates(n)” pull-down menu. Make sure the “Contiguous blocks” option is selected from the “Sample reference Set” pull-down menu (see Note 15) and then click the “Generate Selected Graph(s)” button.
Because optimal codon usage varies in divergent microbial lineages, it has been proposed that genes that have been transferred from one genome to another will have patterns of codon usage more similar to the genome from which they were transferred than to the genome in which they currently reside (27) (Guy and Sueoka, submitted). Thus, unusually low CAI values (e.g., as shown for the SPI-2 pathogenicity island and the Fels-1 prophage in the genome of Salmonella enterica serovar Typhimurium LT 2 displayed in the Monte Carlo histograms in Fig. 10.14) are consistent with the hypothesis that these regions were acquired by HGT. However, these values on their own do not prove that such a transfer occurred.
Although an unusual sequence composition for a gene (or group of genes) is often taken as an indication of horizontal transfer, other factors may also explain unusual codon usage. Thus, while unusual codon usage may suggest the possibility of lateral transfer, the hypothesis of transfer should be confirmed using additional methods for HGT detection, such as discordance between gene trees and organismal trees, lack of orthologs in related lineages, unusual dinucleotide usage, or the presence of nearby mobility genes.
Conversely, horizontally transferred genes may, over time, alter their composition and codon usage to match that of the genome in which they reside (28). Thus a lack of unusual codon usage or nucleotide composition does not rule out the possibility of recent horizontal transfer from a lineage with similar properties, nor does it exclude the possibility of an ancient transfer that is no longer detectable using these methods.
This work was supported in part by the Biophysics and SCR training grants T32GM08759 and T32GM065103 from NIH. CodonExplorer is hosted on the W.M. Keck Foundation Bioinformatics Facility at the University of Colorado, Boulder.
1Other tools for codon usage analysis are available online. Examples include the following:
2One disadvantage to compositional methods for HGT detection is that transfer between related or unrelated strains with similar compositional characteristics will not be detected. Conversely, if a sequence characteristic varies within a genome, then classes of genes, such as ribosomal proteins, that have unusual compositions may appear to have been horizontally transferred. Even when the composition of a transferred gene is initially distinct from the composition of the recipient genome, the composition of the gene will drift toward that of the recipient genome over time. Thus, the traces of ancient transfers may be obliterated in sequences that have had time to equilibrate fully (28).
3The Kyoto Encyclopedia of Genes and Genomes (KEGG) is an online database that seeks to organize and integrate information on biological systems from the genomic to the environmental level. As part of this effort, KEGG contains re-annotations of genome sequences taken from RefSeq. Each gene within the genomes re-annotated by KEGG is given a unique KEGG gene id. Similarly, each genome is assigned a KEGG genome id. The KEGG website is http://www.genome.jp/kegg/.
4For details on EC groups, see the website of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (http://www.chem.qmul.ac.uk/iubmb/enzyme/).
5Only genes that match all properties in the fields under search by name or ID will be reported. Thus entering “rpsU” in the gene name field and “stm” (the KEGG genome abbreviation for Salmonella enterica serovar Typhimurium LT2) will find only the rpsU gene from Salmonella, whereas the same search without a name in the genome field will return records for all genes named “rpsU” across KEGG genomes.
6A list of KEGG genome IDs, grouped by phylogeny, can be found at the KEGG website (http://www.genome.jp/kegg/catalog/org_list.html). Similarly, a list of NCBI taxon names can be found at http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Root.
7Due to computational constraints, users are currently allowed to select a maximum of 100 genomes at a time for analysis, so selecting large or heavily sampled lineages (e.g., all Gamma-proteobacteria) for simultaneous analysis using the NCBI taxonomy search is not possible at this time.
8Because KO groups may contain multiple members in a single genome (e.g., there are currently two members of KO3553 in the Bacillus cereus e331 genome), members of a KO group are not always strict orthologs.
9If the KO number for a particular group of genes is unknown, but an example of the group is known, one can search for the known gene by name or ID (see above); click “View” (in the view genes column) for a genome containing the gene name or ID, then follow the link under the “description” column to the KEGG gene description page. This description includes the KO group to which the gene belongs.
10Currently only individual EC IDs are allowed, and not IDs for entire EC groups. However, if all EC IDs within are group are desired, each EC category in the class can be entered on a separate line.
11In some analyses it is desirable to separate genes on each strand of the DNA. One can do so in CodonExplorer by using the region selector for the entire genome, choosing only genes in either “forward” or “reverse” orientation, then repeating the analysis using genes in the other orientation.
12The boundaries of many genomic islands are not unambiguously defined and the gene ranges given in the pull-down menu should not be considered definitive. In cases where a better gene range for a gene has been determined (e.g., by comparative or compositional analysis), literature islands can be expanded or reduced based upon any extra information possessed by the user by searching for the predefined island, noting the first and last genes, and then using the Search by Gene Region feature to define a new region.
13If searches for genes or genomes repeatedly fail, it is sometimes useful to click the “Clear Search Results” button at the bottom of the page to ensure that all previous entries have been cleared.
14When plotting codon fingerprints, it is important to remember that the smaller the dataset, the larger the expected visual difference in the plots due to chance. While this is not likely to be a major problem at the whole genome level, it can be an important factor when analyzing codon fingerprints for each gene individually, or when using the P3 gradient option. It is a good idea to test the significance of an observed visual difference in codon fingerprint plots using Monte Carlo analysis (see the section on Monte Carlo Histograms, above).
15Comparing a contiguous block of putative HGT genes to individual random genes using the Monte Carlo tool may bias the analysis, since genes are not distributed randomly on the genome. Some classes of genes, such as those encoding ribosomal proteins, are both unusual in composition and clustered together in the genome. Thus it is usually better to compare a contiguous set of genes (e.g., a putative genomic island) to other similarly sized contiguous blocks of genes.