|Home | About | Journals | Submit | Contact Us | Français|
DNA composition in general, and codon usage in particular, is crucial for understanding gene function and evolution. CodonExplorer, available online at http://bmf.colorado.edu/codonexplorer/, is an online tool and interactive database that contains millions of genes, allowing rapid exploration of the factors governing gene and genome compositional evolution and exploiting GC content and codon usage frequency to identify genes with composition suggesting high levels of expression or horizontal transfer.
Nucleotide, codon and amino acid preferences vary greatly among genes and organisms. Codon usage preferences can occur because there are 64 codons but only 20 amino acids (with exceptions), so some amino acids are encoded by multiple codons. The earliest studies of nucleotide frequencies (Sueoka, 1961) suggested that organisms have different biases towards certain synonyms. Even the first gene sequences confirmed that codon usage varies greatly in different organisms. These systematic biases can be exploited to perform many analyses of great theoretical and practical interest.
Many factors affect codon usage. For example, the Codon Adaptation Index (CAI) (Sharp and Li, 1987) measures similarity to codon usage of known highly expressed genes and correlates with overall expression. However, despite the effects of selection for amino acid and codon usage, strong linear trends relate the GC content of the whole genome to the GC content at each codon position across many organisms, suggesting that mutational biases also play a crucial role in shaping codon usage (Muto and Osawa, 1987). Indeed, multivariate analyses typically identify expression levels and genomic GC content as the principal factors structuring composition of individual genes (Gupta and Ghosh, 2001). Thus selection and mutation are key codon usage determinants.
Codon usage also allows insight into mutational processes operating in different genomes (e.g. Sueoka, 2002), or in different parts of the same genome where regions of compositional heterogeneity such as isochores exist (Bernardi, 1993). Codon usage can also suggest horizontal gene transfer (HGT), the movement of genes between different genomes, because different genomes have different characteristic compositions (Karlin et al., 1998).
CodonExplorer, built using PyCogent (Knight et al., 2007), provides a platform for rapid testing of hypotheses about codon usage in sequenced genes and genomes. By precomputing statistics from whole-genome databases, using thousands of CPU-hours, CodonExplorer provides graphical summaries of vast datasets consisting of millions of genes and hundreds of genomes in seconds.
CodonExplorer is especially effective for revealing patterns associated with gene expression changes, mutational biases and HGT. It allows users to conveniently retrieve genes by genome, function or orthology, and then to visualize the composition of these genes. Analyses include:
P1 and P2 versus P3 GC: the effects of selection for amino acid usage and mutational bias can be contrasted by plotting GC content at the first or second codon positions (P1 and P2) against that at the third position (P3) (Sueoka, 1995).
Codon fingerprint and PR2 bias plots: the ratio of different kinds of codons can provide insights into whether deamination or oxidation contributes more to the pattern of codon usage in a specific organism via techniques such as the fingerprint plot and the PR2 bias plot (Sueoka (2002) has examples of both: see references contained therein to prior work). Chargaff's second parity rule (PR2) states that within each DNA strand, the frequency of A ≈ T and the frequency of G ≈ C (Rudner et al., 1969). In the fingerprint plot, circles representing different amino acids are plotted such that the location of each circle on the y-axis represents the frequency of A at position 3 in codons of that amino acid relative to the frequencies of A and T at that position A3/(A3+T3), while the position on the x-axis represents G3/(G3+C3). The radius of each circle is proportional to the relative frequency of that amino acid. PR2 bias plots show the extent of bias relative to P3.
Histograms: histograms can be constructed of CAI values, GC content at the third codon position (P3), gene length or peptide hydrophobicity, and can include a customizable Monte Carlo analysis to test whether observed differences between the properties of a selected subset of genes and the rest of the genome are statistically significant.
Histograms of CAI values, and plots of CAI against GC content at the third codon position (P3), can provide insight into selection for translational efficiency or mutational effects on codon usage. Figure 1a and b shows how two different genomes, Mycoplasma genitalium (low-GC) and Streptomyces coelicolor (high-GC), differ in compositional evolution: genes that fit each genome's overall GC preference are more likely to be highly expressed with high CAI values. Figure 1c shows fingerprint plots from different GC ranges of the S.coelicolor genome: at lower (non-preferred) GC, codon usage is relatively unbiased, whereas at higher GC distinct preferences for specific codons are apparent.
CodonExplorer can also employ Monte Carlo techniques for testing the statistical significance of differences in codon usage or nucleotide sequence composition between putatively transferred sets of genes and the genome as a whole. Figure 1d shows unusually low CAI for the SPI-2 pathogenicity island in the genome of Salmonella enterica serovar Typhimurium LT 2, consistent with the hypothesis that this region underwent HGT.
By allowing users to rapidly perform a wide array of compositional analyses on customizable gene collections, CodonExplorer provides a powerful platform for investigating many phenomena.
We thank Justin Kuczynski and other members of the Knight lab for comments on drafts of the manuscript. James Michaelis and Brice Pelle contributed to interface design in the course CSCI4838. CodonExplorer is hosted on the W.M. Keck Foundation Bioinformatics Facility at CU Boulder.
Funding: NIH/CU Molecular Biophysics and Signaling & Cell Regulation Training Programs (T32GM065103 and T32GM08759, in part); NSF EAPSI fellowship (OISE0812861, in part); National Institutes of Health (P01DK078669, in part); CU URAP and HHMI/UROP scholarships (in part).
Conflict of Interest: none declared.