|Home | About | Journals | Submit | Contact Us | Français|
Viral codon usage is shaped by the conflicting forces of mutational pressure and selection to match host patterns for optimal expression. We examined whether genomic architecture (single- or double-stranded DNA) influences the degree to which bacteriophage codon usage differ from their primary bacterial hosts and each other. While both correlated equally with their hosts’ genomic nucleotide content, the coat genes of ssDNA phages were less well adapted than those of dsDNA phages to their hosts’ codon usage profiles due to their preference for codons ending in thymine. No specific biases were detected in dsDNA phage genomes. In all nine of ten cases of codon redundancy in which a specific codon was overrepresented, ssDNA phages favored the NNT codon. A cytosine to thymine biased mutational pressure working in conjunction with strong selection against non-synonymous mutations appears be shaping codon usage bias in ssDNA viral genomes.
Viruses usually exhibit genomic signatures that closely mimic those of their primary hosts,1,2 in part to better evade innate and acquired immune responses.3,4 However, the majority of the close adherence to host nucleotide usage is attributed to selection for improved translational speed and efficiency, which are correlates of viral fitness. Synonymous codons are used at different frequencies in virtually all organisms,5,6 and the most frequently used codons correlate with the most abundant tRNAs within a cell.7,8 These favored synonymous codons are therefore recognized9 and translated10-12 more rapidly. The most frequently expressed cellular genes within a given organism exhibit similar patterns of this codon usage bias (CUB) and are more biased than less frequently expressed genes.11,13-16 For viruses, these factors should contribute to increased rate of replication when strictly adhering to host CUB. Therefore many viruses have been under selective pressure to match the CUB of their preferred hosts.17 Despite increased attention to the genomic match between viruses and their hosts, there have been few studies examining how different viral genomic architectures facilitate or hinder adaptation to their hosts’ genomes.
Phages are the optimal system in which to explore how genomic architecture affects viral molecular evolution. The codon bias expressed in prokaryotic hosts is constant for each host cell, unlike multi-cellular organisms, in which codon usage profiles are affected by tissue-specific gene expression.18 Perhaps due to this, phage are more strongly adapted to their primary hosts’ CUB than eukaryotic viruses,19 allowing the greatest potential to identify factors that diminish the match between virus and host genomes. Bacterial hosts also offer a wider range of genomic nucleotide content to examine compared with plant or mammalian hosts, and their CUB have been well-documented. Additionally, while phage host ranges are far from perfectly annotated, bacteriophage host ranges are usually quite narrow20 and many of their host ranges have been better delineated than eukaryotic viruses, such as phytopathogens.19
Two distinct phage genomic architectures (single-stranded DNA, ssDNA and double-stranded DNA, dsDNA) have been amply sequenced; unfortunately, the small number of sequenced RNA phages precludes their close examination at this time. The two DNA-based architectures are subject to specific constraints: dsDNA phages can house the largest genomes, up to ~300 kb,21,22 whereas even the largest ssDNA phages are smaller than 10 kb.23 Many dsDNA phages encode their own tRNAs, (e.g., T4 encodes eight24), decreasing selection for adherence to host CUB, whereas none have been found in ssDNA phages. dsDNA phages have the lowest mutation rates among viruses, while ssDNA phage mutation rates are faster, approaching those of a dsRNA phage.25,26 Eukaryotic viruses with the same ssDNA genomic architecture exhibit evolutionary rates orders of magnitude above those seen in eukaryotic dsDNA viruses.27 Consequently, faster-evolving ssDNA phages might be better able to adapt to host-imposed genomic conditions. Conversely, the mutation frequency in ssDNA phages may diminish their ability to conform to their host codon preferences.
Genomic GC content is a rough predictor of CUB, and many viruses match the GC content of their hosts.28-32 Bacteriophage GC content, in particular, correlates strongly to that of their primary bacterial hosts.33 We measured the similarity in GC content between each ssDNA and dsDNA GenBank phage reference genome and that of its primary host. We used the most numerous group of phages with a common host, Escherichia coli, to compare codon adaptation indices (CAI) and relative synonymous codon usage (RSCU) for a subset of highly expressed genes from dsDNA and ssDNA coliphages.1 Our results show that genomic architecture correlates to statistically significant differences in nucleotide content and codon usage between ssDNA and dsDNA phages, and point to an enrichment of thymine as a cause.
GC content in ssDNA and dsDNA phages was highly correlated with host GC content (r2 = 0.82 for ssDNA phages, 0.84 for dsDNA phages, equally correlated p = 0.72) across a very wide range of host GC content (~0.25 to ~0.72) (Fig. 1). A previous study found significant differences between ssDNA and dsDNA phage nucleotide correlation with their hosts,33 but the additional 333 dsDNA and 13 ssDNA reference sequences added to GenBank since that analysis suggest there is no difference (Table S1 and Fig. S1). ssDNA phages exhibited a pronounced genomic thymine bias (average 0.30 T), but nonetheless infected hosts with a range of GC contents (0.25 to 0.70), as wide as that of dsDNA phages (0.26 to 0.72).
Correlated GC content was a poor predictor of strong CAI match between E. coli and the coat genes of its phages. The mean CAI of ssDNA coliphages was 0.706, while the dsDNA phages were significantly better matched to E. coli (0.744, p < 0.001, Fig. 2). This number includes eight dsDNA coliphage genomes for which tail protein encoding genes were used, rather that coat protein encoding genes, due to the absence of properly annotated coat genes. The inclusion of tail genes did not change the results of this analysis (p < 0.001 with and without the eight tail genes). The evidence of selection for translational efficiency is stronger for dsDNA phages.
Comparison of the GC content of the first two positions of each codon (GC1,2) and the third position (GC3) of these genes revealed an interesting pattern: for both ssDNA and dsDNA coliphages, the GC1,2 was restricted to a tight range between about 0.45 and 0.55. dsDNA GC3 varied along a wide range, from 0.26 to 0.69, but ssDNA GC3 occupied a narrower range, from 0.30 to 0.54 (Fig. 3). Furthermore, when plotted with a line representing a perfect correlation between GC1,2 and GC3, all but one of the ssDNA phages fell to the left of that line (Fig. 3), indicating a paucity of GC in the third codon position of their coat genes. Conversely, the dsDNA coat genes were GC3-rich or GC3-poor in approximately equal numbers. Past studies have indicated that strong mutational biases often occur with low levels of CUB,34-36 possibly because a strong, non-specific mutational pressure would prevent any persistent, directional changes in the genome. The consistently lower GC3 content of the ssDNA genes suggests that a specific mutational pressure might be reducing GC3 content in a directional manner, which is disrupting the effects of selection for translational efficiency.
We further investigated the GC3-poor nature of ssDNA coliphage coat proteins with RSCU analysis. It revealed statistically significant variation in use for 15 of 59 codons between ssDNA and dsDNA phage (p < 0.03 for TTG, p < 0.002 for CTT and TCC, p < 0.001 for all other codons, Fig. 4). Notably, for four of the five codons more frequently used by ssDNA rather than dsDNA coliphages, thymine was in the third position. No codons enriched in dsDNA phage relative to ssDNA phage contained thymine in the third positions.
Calculation of RSCUs of coat genes in 28 ssDNA phages with a diverse host range confirmed this pattern: codons with thymine in the third position were extremely overrepresented (p < 0.001) for six amino acids (A, D, G, I, T, V), and were significantly favored (p < 0.012) in three more (H, P, S) (Fig. 5). Only one of the remaining nine degenerate amino acids had a statistically preferred codon in ssDNA phages (GAA for E, p < 0.01).
We subdivided our data set to separately examine the two morphologically distinct families of ssDNA phages, the Inoviridae and the Microviridae. Because inoviruses are frequently vertically transmitted and can productively infect their hosts without causing lysis, they might be under increased selective pressure to match the genomes of their more permanently associated hosts. RSCU comparisons revealed no consistent patterns associated with phage lifestyle. No difference in RSCU was evident for 11 of the 16 NNT codons in these groups (Fig. S2).
Cytosines are comparatively unstable and readily undergo spontaneous deamination to uracil, resulting in C to T transitions after unrepaired replication.37 This spontaneous deamination occurs 100 times more frequently in ssDNA than dsDNA, resulting in a higher mutation rate at cytosines38 than at other bases in ssDNA phage.39 ssDNA phage genomes appear to spend more time truly single-stranded, as they do not experience consistent intra-strand base pairing or regular secondary structure formation while encapsidated.40-45 This causes ssDNA phages to more frequently have unpaired bases than ssRNA genomes, which are constrained by extensive stem-loop formation both in the cytosol and when encapsidated.46
Any thymine-increasing bias does not appear to have a discernible effect on genomic nucleotide content relative to the phages’ primary hosts. Rather, it is likely that cytosine transitions in the first or second positions are subject to strong purifying selection relative to the wobble position,47-49 and the signature of this mutational bias is only observed in the overabundance of thymine in the third position of synonymous ssDNA phage codons. The significant overrepresentation of NNT codons is strongly indicative of a biased mutational pressure acting in concert with strong selection against non-synonymous substitutions.
Genomic architecture (nucleic acid, segmentation, strandedness), while acknowledged as an important characteristic of virus taxonomy, is not typically included in broad-scale analyses of viral evolution. Instead, most comparisons focus within a single kind of virus,50 and while many of these studies have provided insight into the codon usage biases of individual viruses, this is the first observation of a specific bias with a possible mechanistic explanation. Examining across two architectures, we saw strandedness play a critical role in the composition of phage genomes, and in determining the limits of ssDNA viral adaptation to their hosts.
All available ssDNA and dsDNA bacteriophage genome reference sequences were collected from GenBank on March 16, 2011. Reference sequences were used to avoid biasing our data sets toward any particular phage species, or highly studied phage, such as the model organisms PhiX174 or T7. These genomes were separated according to genomic architecture for further analysis. Initially collected were 41 ssDNA phages and 447 dsDNA phages (Table S2). For each phage having a known host with a sequenced genome (GenBank reference sequence), the relationship between the GC content of the phage and the host bacterium was examined. Because not every sequenced phage has an identified and sequenced host, not all phages were included in this analysis. Four ssDNA phages were excluded, as were 44 dsDNA phages (Table S2).
The codon usage biases of representative ssDNA and dsDNA phages were examined to gain a more complete picture of the CUB patterns in both architectures. Codon usage profiles were determined using major coat/capsid genes, or, in the eight cases for which coat genes were not available, tail gene sequences retrieved from GenBank reference genomes (Table S3). These structural proteins are highly expressed and exhibit the highest degrees of codon usage bias found in phage.51,52 We compared codon usage between the two genomic architectures for phages infecting a single host: Escherichia coli. Coat or tail genes from 11 ssDNA and 34 dsDNA phages were used (Table S3). The online CAIcal tool53 was used to calculate each phage’s codon adaptation index (CAI), a measure of the degree to which one gene or set of genes adheres to the CUB of another gene or set of genes,1 as implemented by Xia.54 CAI ranges from zero to one; values closer to one indicate a strong correlation. The average CAI was calculated for both architectures.
Frequency of the first and second codon positions (GC1,2) and frequency of GC in the third position (GC3) were calculated for these genes using CAIcal and relationship between the two was analyzed. A plot of GC1,2 against GC3 is a common measure of the factors affecting CUB in a gene or set of genes; a strong correlation between the two implies that genome-wide mutational pressures are the driving force behind CUB, while a weaker correlation indicates that some force is unequally affecting the first two positions and the third position. Usually, this is interpreted as implying a selective force acting on CUB, as is expected to be the case for viruses under relatively strong selection for translational speed.
To examine the variation in codon usage that contributes to the differing CAI values and site-specific base compositions, relative synonymous codon usage (RSCU) values were calculated for the same sets of genes using CAIcal. RSCU is a measure of the relative codon usage for each individual degenerate amino acid compared with expected levels if synonymous codons were used with equal frequency. An RSCU of about one indicates that a codon is used as frequently as expected, while values above or below one indicate over or underuse of that synonymous codon, respectively. Mean dsDNA coliphage RSCUs were compared with ssDNA coliphage RSCU to determine the proximate cause of the observed variation in CAI. RSCU was also calculated for 17 additional sufficiently well-annotated genomes of ssDNA phages infecting a wide host range (primarily infecting Acholplasma, Bdellovibrio, Chlamydia, Escherichia, Propionibacteria, Pseudomonas, Ralstonia, Spiroplasma, Vibrio and Xanthomonas, Table S4), and the complete set of 28 ssDNA phage RSCUs was assessed for consistent CUB. For amino acids with 6-fold redundancy (L, R, S), RSCUs were calculated separately for the codon sets with 4-fold and 2-fold redundancy. Significantly biased codon use was measured for each codon with one-tailed t-tests (Microsoft Excel) and Bonferroni correction for multiple comparisons (α = 0.017 for 4-fold, α = 0.025 for 3-fold).
We thank Dr. Eric Ho (Rutgers University) for use of his Python scripts. This work was supported by funds from the Rutgers School of Environmental and Biological Sciences, the New Jersey Agricultural Experiment Station and NSF MCB 1034927.
No potential conflicts of interest were disclosed.
Supplemental materials can be found at: www.landesbioscience.com/journals/bacteriophage/article/18496
Previously published online: www.landesbioscience.com/journals/bacteriophage/article/18496