|Home | About | Journals | Submit | Contact Us | Français|
Whether particular amino acids are favored by selection at high temperatures over others has long been an open question in protein evolution. One way to approach this question is to compare homologous sites in proteins from one thermophile and a closely related mesophile; asymmetrical substitution patterns have been taken as evidence for selection favoring certain amino acids over others. However, most pairs of prokaryotic species that differ in optimum temperature also differ in genome-wide GC content, and amino acid content is known to be associated with GC content. Here, I compare homologous sites in nine thermophilic prokaryotes and their mesophilic relatives, all with complete published genome sequences. After adjusting for the effects of differing GC content with logistic regression, 139 of the 190 pairs of amino acids show significant substitutional asymmetry, evidence of widespread adaptive amino acid substitution. The patterns are fairly consistent across the nine pairs of species (after taking the effects of differing GC content into account), suggesting that much of the asymmetry results from adaptation to temperature. Some amino acids in some species pairs deviate from the overall pattern in ways indicating that adaptation to other environmental or physiological differences between the species may also play a role. The property that is best correlated with the patterns of substitutional asymmetry is transfer free energy, a measure of hydrophobicity, with more hydrophobic amino acids favored at higher temperatures. The correlation of asymmetry and hydrophobicity is fairly weak, suggesting that other properties may also be important.
Thermophilic organisms live at 50 °C to over 100 °C, temperatures that would quickly denature most proteins from mesophiles. There is considerable interest in determining what enables proteins from thermophiles to function at high temperatures, both for the practical benefit of engineering proteins for high-temperature industrial processes and as an evolutionary and biochemical puzzle.
One way to investigate whether some amino acids are more favorable than others at higher temperature is to compare the overall proportions of amino acids in protein sequences from prokaryotes living at different temperatures (Cambillau and Claverie 2000; Fukuchi and Nishikawa 2001; Chakravarty and Varadarajan 2002; Singer and Hickey 2003; Berezovsky et al. 2007). An amino acid that is more abundant in species living at higher temperatures is then interpreted to be adaptive to the higher temperatures. However, a major problem with this approach is that prokaryotes vary widely in genome-wide GC content, and amino acids with GC-rich codons are generally more abundant in organisms with GC-rich genomes (Lobry 1997; Singer and Hickey 2000). There is conflicting evidence about whether genome-wide GC content shows any relationship with habitat temperature (Musto et al. 2006; Wang et al. 2006), but the strong association of GC content and amino acid abundance will obscure any relationship between temperature and amino acid abundance if the variation in GC content is ignored.
The effects of temperature and GC content can be separated using multivariate statistical techniques, such as principal component analysis (Kreil and Ouzounis 2001; Saunders et al. 2003), correspondence analysis (Tekaia et al. 2002; Lobry and Chessel 2003; Tekaia and Yeramian 2006; Boussau et al. 2008; Puigbò et al. 2008), and other techniques (Naya et al. 2006; Zeldovich et al. 2007). However, these approaches suffer from “phylogenetic pseudoreplication”; they treat multiple species from the same clade and same habitat as if they were independent samples, and it has long been known that this can cause serious statistical problems (Felsenstein 1985; Harvey and Pagel 1991). To illustrate why this is a problem, imagine biologists who were interested in temperature adaptation of terrestrial vertebrates. If those biologists surveyed vertebrates from a variety of habitats and looked for associations with temperature, they would see a higher proportion of species that shed their skin living in warmer areas. However, it would be erroneous to conclude from this that shedding skin is an adaptation to high temperature; the association would merely result from sampling large numbers of Squamata (lizards and snakes) in warm areas and few squamates in cold areas. Similarly, in studies of temperature and amino acid composition, some clades are found predominantly among thermophiles, and some are predominant among mesophiles; for example, of the 204 species studied by Zeldovich et al. (2007), 63% of the thermophiles and 5% of the mesophiles are archaea, whereas 0% of the thermophiles and 54% of the mesophiles are proteobacteria. A multivariate statistical technique that treated each species as an independent data point could produce an apparent association of particular amino acids with higher temperatures, when in reality that association might result from a difference between clades that may have nothing to do with temperature.
A second form of evidence used to compare amino acid composition in mesophiles and thermophiles is substitutional asymmetry (Argos et al. 1979; Haney et al. 1999; McDonald et al. 1999). Protein sequences from one mesophile and one thermophile are aligned, and the observation of more aligned sites with amino acid A in the mesophile and B in the thermophile than the opposite pattern provides evidence that B is favored over A in the higher temperature organism. Because only aligned sites in homologous proteins are considered, the effect of gain or loss of proteins of different amino acid composition does not obscure the results. In addition, each mesophile–thermophile pair of species can be phylogenetically independent of others that have been compared, an important consideration when using comparative methods to infer adaptation. (To say that mesophile–thermophile pair A and B are “phylogenetically independent” of other pairs means that A and B are more closely related to each other than either is to any of the other species in the data set.) This approach has found extensive evidence for substitutional asymmetry (Haney et al. 1999; McDonald et al. 1999; McDonald 2001; Nishio et al. 2003; Mizuguchi et al. 2007), but the problem remains that for those pairs of amino acids whose codons have different GC content, overall differences in GC content between the mesophile and thermophile could still be the cause of substitutional asymmetry. Here, I use logistic regression of the proportion of substitutions in one direction versus the overall difference in GC content to predict the substitutional asymmetry in a pair of species with identical genomic GC content. This method should help determine whether amino acids that are favored at higher temperatures share biochemical properties.
If substitutional asymmetry between mesophilic and thermophilic proteins results from temperature adaptation based on the fundamental biochemical properties of the amino acids, the same patterns should be found in all mesophile–thermophile comparisons after controlling for differences in GC content. Differences in other aspects of the environment, such as salinity, hydrostatic pressure, pH, oxygen, and nutrient source, could cause patterns of asymmetry that are unrelated to temperature and therefore different in different mesophile–thermophile pairs. In addition, biosynthetic costs of amino acids are high enough to cause selection on amino acid usage (Akashi and Gojobori 2002; Seligmann 2003; Heizer et al. 2006; Swire 2007), so organisms which differ in biosynthetic pathways, or which differ in whether they are autotrophic or heterotrophic for a particular amino acid, may have different patterns of substitutional asymmetry. A second goal of this paper is to see how consistent the patterns of substitutional asymmetry are among different species, which may help determine how much of the asymmetry is due to temperature adaptation and how much is due to other factors.
The NCBI Entrez Genome Project database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj) was searched for thermophilic archaea and bacteria (optimum growth temperature, Topt, greater than or equal to 50 °C) with complete, published genome sequences. Species from higher taxa in which all species with published genomes are thermophiles, such as Aquificae and Crenarchaeota, were excluded. The closest mesophile (Topt≤40 °C) with a complete published genome sequence was identified for each thermophile using published phylogenies. Where a thermophile had more than one mesophile that was equally closely related or vice versa, the species pair was chosen with the most similar habitat, physiology, and genomic GC content. Where more than one strain of a species had been sequenced, the strain with the earliest published sequence was used. Nine phylogenetically independent pairs of mesophiles with thermophiles were identified (table 1); at the time the database was searched, there were no other mesophile–thermophile species pairs with published genomes that were phylogenetically independent of the nine used here.
For seven of the mesophile–thermophile pair of species, the Entrez Gene Plot function (http://www.ncbi.nlm.nih.gov/sutils/geneplot.cgi) was used to obtain a list of reciprocal best matches of protein sequences. Each list was sorted, and where a sequence from one species had multiple best matches from the other species (which can happen when there are multiple identical protein sequences), all but one of the matching pairs were deleted. Proteins encoded by small extrachromosomal elements in Methanocaldococcus jannaschii or plasmids in the other species were deleted.
For the Pelotomaculum thermoproprionicum versus Desulfitobacterium hafniense and Nitratiruptor versus Sulfurovum comparisons, Geneplot was not available. I therefore used Blast to obtain a list of the best match for each protein sequence in the other species and then sorted the two lists in a spreadsheet to identify the reciprocal best matches.
No attempt was made to eliminate proteins whose genes may have been acquired recently by horizontal gene transfer (HGT). Whether a gene could be identified as acquired through HGT would depend on how divergent the source species was and whether its sequences were available; therefore, painstaking investigation of each gene would only result in eliminating some, but not all, such genes. Leaving genes acquired through HGT in the data set would tend to obscure patterns of consistent substitutional asymmetry by introducing noise into the data rather than creating patterns by statistical artifacts that would not be there otherwise.
The complete set of protein sequences was downloaded from Entrez Genome for each species, and a Pascal program was written to use the list of reciprocal best matches, create a file for each pair of protein sequences, extract the protein sequences, and put them in the appropriate files.
Each pair of protein sequences was aligned using ClustalW (Chenna et al. 2003), with the default parameters. Protein pairs with less than 35% identical sites and proteins less than 20 amino acids long were deleted. Ambiguously aligned sites adjacent to gaps were then omitted, with the omitted sites extending from the gap to the nearest pair of adjacent sites that were both identical in the two sequences, using the program AmbiguityRemover. The number of unambiguously aligned sites exhibiting each of the 190 possible pairwise patterns of difference was then counted using the program AsymmetryCounter. Both programs are available for download from http://udel.edu/~mcdonald/asymmetry.html.
For each pair of amino acids in each pair of species, the exact binomial test (for N < 1,000; McDonald 2009, p. 24–32) or G-test of goodness-of-fit (for N > 1,000; McDonald 2009, p. 46–51) was used to test the significance of the deviation from the expected 1:1 ratio.
To distinguish between asymmetry resulting from genomic GC differences and asymmetry due to other causes, the LOGISTIC procedure of SAS (SAS Institute 2009) was used to perform logistic regression for each pair of amino acids, with the difference in genomic GC content between the thermophile and the mesophile as the independent variable and the proportion of substitutions in one direction as the dependent variable. Logistic regression (McDonald 2009, p. 247–255) finds the best-fitting equation of the form ln[Y/(1 − Y)] = a + bX, where Y is the probability of obtaining a particular value of a nominal variable for a given value of the measurement variable, a is the intercept, b is the slope, and X is the value of the measurement variable. For example, the logistic regression equation for the amino acids histidine and tyrosine (fig. 1) predicts the probability (Y) that a histidine/tyrosine site has histidine in the mesophile and tyrosine in the thermophile for any value of X, the difference in GC content between two species. The significance of the slope was used to test whether there was a significant relationship between the difference in GC content and the pattern of asymmetry. The significance of the intercept was used to test whether the predicted asymmetry for a mesophile–thermophile pair with equal GC contents was significantly different from the 1:1 ratio expected under the neutral model of molecular evolution.
To identify amino acids that deviated from the overall pattern in particular species pairs, the residual (difference between the observed proportion of substitutions in one direction and the proportion predicted by the logistic regression model) was calculated for each amino acid pair in each species pair and then averaged across the 19 pairs involving each amino acid. For this analysis, the proportion of sites with the target amino acid in the thermophile and the other amino acid in the mesophile was used.
The logistic regression equation for each pair of amino acids was used to predict the expected proportion of substitutions in each direction in a hypothetical species pair that did not differ in GC content. These predicted proportions were multiplied by the total number of substitutions across the nine species pairs for that amino acid pair to yield a synthetic data set. The AAindex list of amino acid indices (Kawashima et al. 2008) was downloaded from http://www.genome.ad.jp/dbget/aaindex.html. Indexes that measure the propensity of amino acids to occur in particular proteins or parts of proteins were deleted, as were those with missing or estimated values. For each index, the difference between the values of the index for each pair of amino acids was used as the independent variable in a simple logistic regression. The dependent variable was taken from the synthetic data set, the expected number of substitutions in each direction in a species pair that does not differ in GC content.
There is extensive substitutional asymmetry; of the 1,710 total comparisons (190 pairs of amino acids in nine species pairs), 1,038 are significantly (P < 0.05) different from the expected 1:1 ratio (supplementary table 1, Supplementary Material online). Each of the 190 pairs of amino acids is significantly asymmetrical in at least one of the nine species pairs, and 125 of the pairs of amino acids are asymmetrical in at least five species pairs.
Some of the asymmetry is associated with differences in GC content. Of the 190 pairs of amino acids, 153 differ in average GC content of their codons (e.g., histidine [H] has an average of 1.5 GC in its codons [CAC, CAT] vs. tyrosine [Y], which has an average of 0.5 GC in its codons [TAC, TAT]). The logistic regression of substitutional asymmetry versus difference in genome-wide GC content has a significant slope for 122 out of these 153 pairs of amino acids (supplementary table 2, Supplementary Material online), indicating that the proportion of substitutions in each direction depends on the difference in genome-wide GC content. Figure 1 shows an example of this; the proportion of H ↔ Y sites with H in the mesophile and Y in the thermophile decreases for species pairs in which the thermophile has greater GC than the mesophile. Of the 37 amino acid pairs with no difference in average GC content of their codons, 15 have a significant slope.
Of the 122 pairs of amino acids with differing average GC content and significant slopes, 114 are in the expected direction: sites with the GC-rich amino acid in the mesophile and the GC-poor amino acid in the thermophile become less common in the species pairs where the thermophile has higher genome-wide GC content than the mesophile (supplementary table 2, Supplementary Material online). Seven of the eight pairs of amino acids that show the opposite pattern involve methionine. Sites with aspartic acid, cysteine, glutamic acid, glutamine, leucine, serine, or threonine in the mesophile and methionine in the thermophile become more common as the thermophile–mesophile GC difference increases, even though the methionine codon has a slightly smaller GC content than the codons for the other amino acids.
The logistic regression for 139 out of 190 pairs of amino acids had a significant intercept (supplementary table 2, Supplementary Material online), meaning that a mesophile–thermophile species pair with no difference in genomic GC content would be expected to have significant asymmetry. The intercept of each logistic regression was used to estimate the substitutional asymmetry predicted for a mesophile–thermophile pair with no difference in GC content (table 2). The average of the 19 intercepts for each amino acid gives a measure of how strongly that amino acid is preferred in mesophiles or thermophiles; for example, only 41.6% of the substitutions involving serine would have serine in the thermophile and some other amino acid in the mesophile (table 3).
The residual (the difference between the observed asymmetry and that predicted by the logistic regression) was calculated for each pair of amino acids in each species pair, and the average residual was calculated for each amino acid in each species pair. In some species pairs, the average residual for some amino acids is quite a bit larger or smaller than expected (fig. 2). For example, in the Streptomyces–Thermobifida species pair, there are fewer sites with lysine (K) in the thermophile and other amino acids in the mesophile than predicted by the logistic regression, whereas there are more such sites than predicted in the Deinococcus–Thermus species pair. Out of 180 average residuals (20 amino acids in nine species pairs), 98 have a 95% confidence interval that does not include 0.
After removing indices with missing or estimated values, and indices that represent frequencies in different parts of proteins, the AAindex database (Kawashima et al. 2008) contains 238 measures of biochemical and physical properties of amino acids. Treating the difference in each index for each of the pairs of amino acids as 190 values causes all kinds of statistical problems with nonindependence, so the results of the logistic regression of substitutional asymmetry versus index differences should be viewed as an exercise in data exploration not hypothesis testing. The strongest relationship between the difference in amino acid index and the predicted substitutional asymmetry is with transfer free energy (Simon 1976), a measure of hydrophobicity. In general, amino acids with higher transfer free energy tend to be substituted at high temperatures for amino acids with lower transfer free energy (fig. 3). However, differences in transfer free energy do not explain all the substitutional asymmetry. Of 139 pairs of amino acids with a significant intercept in the logistic regression (meaning that the substitutional asymmetry is predicted to be significant for a mesophile–thermophile pair with no difference in genome-wide GC content), 14 have the opposite pattern: the amino acid with lower transfer free energy is found more often at higher temperatures. The next strongest associations are with several other measures of hydrophobicity (Zimmerman et al. 1968; Jones 1975; Argos et al. 1982; Takano and Yutani 2001), all of which are highly correlated with transfer free energy.
Each of the nine mesophile–thermophile species pairs exhibits a large amount of substitutional asymmetry; for most pairs of amino acids, there are more homologous sites with one amino acid in the mesophile and the other amino acid in the thermophile than the opposite. Substitutional asymmetry has been previously observed in small numbers of proteins from Methanococcus versus Methanocaldococcus (Haney et al. 1999; McDonald et al. 1999), Bacillus versus Geobacillus (McDonald et al. 1999), and Deinococcus versus Thermus (McDonald 2001). Here, I use translated protein sequences from the entire genomes of these species pairs and add six additional mesophile–thermophile pairs from a broad variety of habitats.
Differences in genome-wide GC contents are one cause of substitutional asymmetry; all the species pairs used here differ to some degree in GC content, and it has long been known that amino acids with GC-rich codons are more common in species with GC-rich genomes (Lobry 1997; Singer and Hickey 2000). It is not clear whether differences in genome-wide GC content are caused by selection or mutational bias (Rocha and Danchin 2002; Lind and Andersson 2008), and it is not clear to what extent increased habitat temperatures cause increased GC contents (Musto et al. 2006; Wang et al. 2006). What is clear is that any attempt to identify selection on amino acids as a cause of substitutional asymmetry must remove the effects of GC content.
Here, logistic regression modeling is used to control statistically for the effects of differing GC content, with the difference in GC content as the independent variable and the direction of substitution as the dependent variable. For the majority of amino acid pairs, the logistic regression predicts that a mesophile–thermophile pair of species that did not differ in GC content would exhibit extensive substitutional asymmetry. The significant intercepts in the logistic regression models mean that the preferences for one amino acid over another are fairly consistent across the nine pairs of species.
Substitutional asymmetry in one mesophile–thermophile pair could be caused by any number of habitat differences; for example, the mesophile Methanococcus maripaludis was isolated from a salt marsh (Jones, Paynter, and Gupta 1983), whereas the thermophile M. jannaschii was originally isolated from a deep-sea vent 2,600 m below the ocean surface (Jones, Leigh, et al. 1983). A difference in hydrostatic pressure may favor some amino acids over others (Di Giulio 2005); if hydrostatic pressure were an important selective factor, M. maripaludis and M. jannaschii would have patterns of asymmetry different from the other mesophile–thermophile pairs, which do not differ in the hydrostatic pressure of their habitats. The consistency of the patterns of asymmetry across species pairs suggests that much of the asymmetry results from selection caused by the different habitat temperatures.
Although the patterns of asymmetry are consistent enough across species pairs to produce logistic regression models with significant intercepts, the amounts of asymmetry in each species pair are not exactly as predicted by the logistic regression; many amino acids are favored more or less strongly in some species pairs than would be expected. The optimal temperatures of the species pairs differ by different amounts, from 15 to 55 °C, so it would have been startling if they all exhibited the exact same amount of asymmetry. The species pairs differ in how recently they diverged from a common ancestor, and the species pairs also vary in other aspects that may affect selection on amino acid use: aerobic versus anaerobic; autotrophic versus heterotrophic; marine, freshwater, and terrestrial; and deep sea versus shallow water. Species pairs in which the ancestral species was thermophilic and one lineage then adapted to lower temperatures may show different patterns of temperature adaptation than species pairs in which the ancestor was mesophilic and one lineage adapted to higher temperatures (Berezovsky and Shakhnovich 2005). There is also increasing evidence that biosynthetic costs may affect amino acid use (Akashi and Gojobori 2002; Seligmann 2003; Heizer et al. 2006; Swire 2007), and the costs of particular amino acids will depend on factors that may be unrelated to temperature, such as the biosynthetic pathways used (for autotrophs) and environmental availability and uptake costs (for heterotrophs). Including all the possibly relevant variables when there are only nine species pairs would result in a logistic model that was completely overdetermined, with many spurious correlations; separating the substitutional asymmetry caused by temperature adaptation from the asymmetry resulting from other causes will require examining the genomes of a much larger number of mesophile–thermophile species pairs than currently available.
These results show that amino acids with greater hydrophobicity (higher transfer free energy) tend to be preferred in thermophiles, which is consistent with several earlier studies (Argos et al. 1979; Gromiha, Oobatake, Kono, et al. 1999; Haney et al. 1999; Tekaia et al. 2002; Nakashima et al. 2003; Sadeghi et al. 2006; Berezovsky et al. 2007). There are, however, numerous exceptions to this rule. This is consistent with previous research that has failed to identify a single physicochemical property of the amino acids that would explain all the differences in amino acid abundance between mesophiles and thermophiles (Böhm and Jaenicke 1994; Zhou et al. 2008). One possible explanation is that thermal adaptation of amino acids is based on complicated tradeoffs between different properties (Gromiha, Oobatake, and Sarai 1999). Another possibility is that the cost of synthesizing amino acids plays a major role; the relative synthesis costs of amino acids change as temperatures increase (Amend and Shock 1998), and amino acids with lower synthesis costs tend to be more abundant, even in heterotrophs (Swire 2007). Values for the cost of synthesis of each amino acid in each species at a variety of temperatures are not available; as this information accumulates, it may become possible to understand the role that relative biosynthetic costs of amino acids play in temperature adaptation of proteins.
There are numerous reports of charged amino acids being more common in thermophiles than in mesophiles (Cambillau and Claverie 2000; Das and Gerstein 2000; Szilágyi and Závodszky 2000; Fukuchi and Nishikawa 2001; Vielle and Zeikus 2001; Chakravarty and Varadarajan 2002; Tekaia et al. 2002; Nakashima et al. 2003; Suhre and Claverie 2003; Sadeghi et al. 2006; Berezovsky et al. 2007). That pattern is not apparent here; of 47 significant intercepts in the logistic regression involving one charged amino acid (arginine, aspartic acid, glutamic acid, and lysine) and one noncharged amino acid, 24 have the charged amino acid becoming more common in the thermophiles, but 23 have the charged amino acid becoming less common in the thermophiles (table 2). If histidine, which is weakly charged at physiological pH, is included in the charged amino acids, the result is the same: Of 57 significant intercepts, 28 have the charged amino acid becoming more common in the thermophiles, but 29 have the charged amino acid becoming less common in the thermophiles. Most of the studies reporting increased proportions of charged amino acids in thermophiles have relied heavily on hyperthermophiles, which have optimum growth temperatures of 85 °C to >100 °C, whereas the nine species pairs used here include only one hyperthermophile, M. jannaschii, with an optimum growth temperature of 85 °C. It may be that increasing the overall proportion of charged amino acids is only an important adaptation at very high temperatures.