|Home | About | Journals | Submit | Contact Us | Français|
Colias eurytheme butterflies display extensive allozyme polymorphism in the enzyme phosphoglucose isomerase (PGI). Earlier studies on biochemical and fitness effects of these genotypes found evidence of strong natural selection maintaining this polymorphism in the wild. Here we analyze the molecular features of this polymorphism by sequencing multiple alleles and modeling their structures. PGI is a dimer with rotational symmetry. Each monomer provides a critical residue to the other monomer’s catalytic center. Sequenced alleles differ at multiple amino acid positions, including cryptic charge-neutral variation, but most consistent differences among the electromorph alleles are at the charge-changing amino acid sites. Principal candidate sites of selection, identified by structural and functional analyses and by their variants’ population frequencies, occur in interpenetrating loops across the interface between monomers, where they may alter subunit interactions and catalytic center geometry. Comparison to a second (and basal) species, Colias meadii, also polymorphic for PGI under natural selection, reveals one fixed amino acid difference between their PGIs, which is located in the interpenetrating loop and accompanies functional differences among their variants. We also study nucleotide variability among the PGI alleles, comparing these data to similar data from another glycolytic enzyme gene, glyceraldehyde-3-phosphate dehydrogenase. Despite extensive nonsynonymous and synonymous polymorphism at PGI in each species, the only base changes fixed between species are the two causing the amino acid replacement; this absence of synonymous fixation yields a significant McDonald-Kreitman test. Analyses of these data suggest historical population expansion. Positive peaks of Tajima’s D statistic, representing regions of neutral “hitchhiking,” are found around the principal candidate sites of selection. This study provides novel views of molecular-structural mechanisms, and beginnings of historical evidence, for a long-persistent balanced enzyme polymorphism at PGI in these and perhaps other species.
By using genetic variants of known function to probe organism-environment interactions, biologists can identify evolutionary processes which maintain variation or change it through time (Dean and Golding 1997; Newcomb et al. 1997; Watt and Dean 2000). Variants first identified at functional, organismal, and ecological levels (e.g., Krebs and Feder 1997; Rawson and Burton 2002; Hoekstra and Nachman 2003; Watt 2004) may then be pursued downward to their molecular origins. Conversely, molecular studies of natural genetic variation often find sequence patterns suggestive of natural selection (Aquadro 2000; Aquadro et al. 2001; Kreitman 2000) but seldom have functional impacts and/or ecological interactions of such patterns been studied (e.g., Feder and Mitchell-Olds 2003). Here we explore the integration of these approaches by molecular study of a polymorphism maintained in the wild by functionally based natural selection.
The time scales over which such natural molecular variation persists are of special interest here. Genetic polymorphisms are often thought to be transient phenomena, displaced eventually by fixation due to selection or neutral drift. Some allozyme polymorphisms seem to support this view, being limited to single species and of apparently recent origin (Eanes 1999). Certain scenarios are known to maintain longer term variability within populations or on a species-wide basis, such as frequency-dependent selection in fungal or plant self-incompatibility systems (e.g., Ioerger, Clark, and Kao 1990; May et al. 1999; Uyenoyama and Takebayashi 2004), or coevolutionary host-disease interactions (e.g., Hughes and Nei 1992; Tian et al. 2002; Garrigan and Hedrick 2003). It has seldom been thought that within-population polymorphism affecting metabolism, development, or other aspects of adaptive performance might persist, for example, through heterozygote advantage, over broad spans of time (but see Gillespie 1991; Watt 2004).
In the glycolytic enzyme phosphoglucose isomerase (PGI, EC 126.96.36.199) of Colias butterflies, widespread polymorphism occurs in multiple species complexes (Watt 2004). In lowland Colias, four electrophoretically distinct (electromorph [EM]) PGI alleles, called 2–5 according to their mobilities, are common (Watt 1977). These alleles form 10 EM genotypes, upon which earlier studies found evidence of strong balancing selection at biochemical, organismal, and ecological levels in the wild (Watt 2004). Genotypes’ frequencies have been stable among populations from California to Colorado over 30 years of study (Watt et al. 2003). Until now, though, we have known nothing of the molecular-structural basis for the extensive functional differences among these genotypes, nor have we had evidence of their evolutionary history.
We begin to address these issues by studying the cDNA sequences, including all common EM allele classes, of lowland Colias’ PGI. From these, we build an approximate structural model for Colias PGI based on homology with mammalian PGIs for which high-resolution structures are known. Using it, we study the location of Colias’ PGI amino acid variants, proposing structural hypotheses as to which of these changes may make large contributions to known genotypic functional differences. We report initial results on the frequency and pattern of PGI genetic variation, exon and intron structure, and recombination levels. Finally, we explore PGI’s evolutionary history through its nucleotide variation patterns, comparing them to sequences of another glycolytic enzyme gene, glyceraldehyde-3-phosphate dehydrogenase (G3PD).
Study of the sequence and structure of Colias PGI variants is mandated by earlier findings of large functional differences among the variants’ genotypes and resulting successful predictions of large performance and fitness differences among those genotypes. We summarize these results in context of the recursive stages of natural selection (Feder and Watt 1992):
We now seek, by the study of PGI variants’ molecular sequence and structure differences, to illuminate further this progression from genotypic differences in function to fitness differences in the wild. At the same time, statistical analysis of molecular variation patterns can explore the connection between the observed present-day balancing selection regime and the historical action of selection and/or other demographic processes.
All Colias eurytheme were sampled from a single population just north of Tracy, Calif. Homozygotes of allele copies IBD for each of the four EM allele classes were produced in our laboratory colony of C. eurytheme, thus allowing haplotype determination, comprising haplotypes 2-A, 3-A, 3-B, 4-A, 5-A (table 1). Fourteen other sequences were obtained from individuals, homozygous for the EM 3 or 4 alleles, chosen randomly from one field sample at Tracy on September 12, 2000. These samples, combined, yielded EM allele frequencies representative of a random field sample. For example, the mean frequencies and standard deviation (SD) of EM alleles 2–5 of PGI from eight population samples collected at Tracy between 1980 and 1999 (average n = 141; Watt et al. 2003), were: p2 = 0.06 ± 0.02, p3 = 0.60 ± 0.04, p4 = 0.29 ± 0.03, and p5 = 0.04 ± 0.01. From these frequencies, one would expect among 19 EM allele copies to find 1 two-allele, 11 three-alleles, 6 four-alleles, and 1 five-allele; we found 1, 12, 5, and 1, respectively. Because the sampling was random within each subclass (the actual variant DNA sequences were unknown to us a priori) and because sampling within each allele class reflected natural frequencies, our aggregate sample mimics the properties of a truly random sample of potential genetic variation at the molecular level.
For comparative purposes, alleles from each of the two major PGI EM classes of the closely related species Colias meadii (sampled in the Snowy Range, Wyo.), were polymerase chain reaction (PCR)-amplified and sequenced as per C. eurytheme below.
Total RNA was extracted from 30 mg of abdomen tissue using RNeasy and QIAshredder kits (Qiagen Inc., Valencia, Calif.). GIBCO Moloney Murine Leukemia Virus (GIBCO Inc., Carlsbad, Calif.) was used for reverse transcription (RT) as per the usual protocol. Using degenerate primers based on Drosophila PGI sequence, partial (central) Colias cDNA PGI sequence was obtained by RT-PCR (Pollock 1995). The 3′ end was obtained by standard rapid amplification of cDNA ends (RACE) protocol. The 5′ end was obtained by three rounds of nested PCR following a modified RACE protocol. Terminal deoxynucleotidyl transferase (TdT) tailing was used to add 15–20 G’s to the 5′ end of cDNA following the manufacturer’s protocol; extension time was 10 min at 37°C followed by inactivation via 10 min at 65°C (GIBCO). Five microliters of the TdT reaction was directly used in the first of three rounds of PCRs using a poly C15 primer and a gene-specific primer, PGI-A-389 (5′-ATA ACT TGC GTG GAG AAC TC-3′; primer nomenclature: S = sense direction, A = antisense, and number = nucleotide position of 3′-primer end relative to the start of the PGI gene). First round PCR product was purified (QIAquick kit) and used as template for second round PCR with primer PGI-A-340 (5′ TCA GGT GTG ACA TCT TTA CC-3′); this process was repeated for the third round with primer PGI-A-242 (5′-TCA CCA GCA AAC ATA GCA TC-3′). Each step of the nested PCR increased specificity. These RT-PCRs were done using recombinant Taq polymerase (GIBCO) at 1.5 mM MgCl2 (50 µl volume; cycle regime 94°C 2 min, then 35 cycles 94°C 30 s, 55°C 30 s, 72°C 1 min).
Initial sequencing found a conserved 30-bp region before the start codon, in which a 5′-end primer was designed. A 3′-end primer was designed, after sequencing by primer walking, in a conserved region beyond the stop codon. These were used to amplify PGI alleles from intact cDNA, using HiFi Platinum Taq polymerase (Invitrogen, Carlsbad, Calif.), with final concentration 2 mM Mg++ in 50 µl volume, with a cycle regime of: 94°C 2 min, then 35 cycles 94°C 30 s, 55°C30s, 72°C 3 min, with a final extension at 72°C for 10 min. Both strands of purified PCR products were cycle sequenced using Applied Biosystems BigDye 2.0 chemistry and analyzed with a 377 sequencer. Internal primers were designed for use with the end primers in sequencing; all are listed in table 2. All data were verified by sequencing in both forward and reverse directions. Sites were scored as heterozygous only when both forward and reverse direction sequencing reads agreed clearly in this respect (Gen-Bank accession numbers DQ205063–DQ205076).
Study of Colias PGI’s structure is facilitated by the high conservation of PGI sequence among animals (Jeffery, Hardre, and Salmon 2001). For example, a three-allele amino acid sequence from Colias (3-A of table 1) shows 69% sequence identity and 83% Dayhoff similarity to both rabbit and human PGI sequences. We have thus used crystal structures for rabbit and human PGI (Jeffery, Hardre, and Salmon 2001; Read et al. 2001; Arsenieva and Jeffery 2002) to build a model for Colias PGI (fig. 2) using the web-based Swiss-Model program (Schwede et al. 2003). Swiss-Model takes amino acid sequence as input, searches 3-D structural databases, retrieves candidates, aligns their sequences and structures, and based on these alignments calculates a predicted protein structural model for the input amino acid sequence. Structural homologues identified and used as templates were from the Research Collaboratory for Structural Bioinformatics Protein Data Bank (http://www.pdb.org; Berman et al. 2000): 1HOX.pdb (rabbit PGI bound with fructose-6-phosphate [F6P]), 1G98.pbd(rabbit PGI bound with phosphoarabinonate), and 1IAT.pdb (human PGI bound with sulfate ion) (Jeffery, Hardre, and Salmon 2001; Lee et al. 2001; Read et al. 2001). There is a 1.86-Å root-mean-square deviation between backbone atoms of the 1HOX.pdb structure and of the Swiss-Model results for C. eurytheme.
Solvent-accessible surface (SAS) is a measure of the fraction of each amino acid’s surface exposed to surrounding solvent in the native enzyme structure. SAS was calculated with Molmol (Koradi, Billeter, and Wüthrich 1996) for the dimeric PGI structure, using a solvent radius of 1.4.
Genomic DNA was extracted from a 5th instar larva, IBD for an EM 4 allele (sequence 4-A, table 1) using a Qiagen kit. We amplified a complete genomic copy of PGI from this extract, using HiFi Platinum Taq with extension times of 15 min/cycle and a 25 min final extension. The product was cloned into the XL Topo TA vector (Invitrogen). Purified vector + amplicon was then infected with a transposable element (TE) containing a new antibiotic resistance gene and retransformed (GeneJumper, Invitrogen). One hundred resulting new colonies with the TE randomly inserted were PCR mapped for TE insertion position. Those with insertions suitably spaced across PGI were purified and sequenced, using forward and reverse primers in the TE, for full coverage (GenBank accession number DQ205092).
cDNA of wild C. eurytheme individuals, prepared by RT-PCR as above from the Tracy field sample of September 12, 2000 (above), was amplified by PCR as above for the G3PD gene using HiFi Platinum Taq (Invitrogen) and primers G3P-S-23 and G3P-A-973 (table 1). These and two internal primers, G3P-S-468 and G3P-A-496 (table 1), were used to sequence PCR products as above (GenBank accession numbers DQ205077–DQ205091).
We infer polarity of amino acid or nucleotide changes by the usual criteria for plesiomorphy or apomorphy. Thus a unique variant B, at a site otherwise occupied by A, would be given as A → B. When polarity is ambiguous, we give instead A ↔ B.
Sequence statistics and sequence-based tests of selection used DnaSP 4.0 and Proseq (J. Rozas and R. Rozas 1999; Filatov 2002), unless otherwise noted. Averages of synonymous and nonsynonymous changes were calculated using MEGA2 (Kumar et al. 2001). Incidence of intragenic recombination (Watt 1972) was estimated among the IBD haplotypes (2-A, 3-A, 3-B, 4-A, 5-A) using the conservative “four-gamete” test (Hudson and Kaplan 1985) in DnaSP 4.0. Population expansion was analyzed in two ways. First, whole-gene distribution of nucleotide variation was assessed using Tajima’s D (Tajima 1989) and (Ramos-Onsins and Rozas’ 2002) R2, with significance for both tests determined using 10,000 coalescent simulations with DnaSP, without recombination. Second, we implemented a Bayesian approach for direct estimation of the growth rate for our data, as implemented in LAMARC 2.0 (Kuhner et al. 2004).
Molecular tests of selection using Tajima’s D (Tajima 1989) calculations were run on synonymous site data. Their significance levels were determined by 10,000 coalescent simulations in DNAsp with no recombination. Not using recombination in these simulations makes tests of D conservative. Analysis of the juxtaposition of extreme values in Tajima’s D across exons 7 and 9 was done as follows. For each exon, 10,000 coalescent simulations using Hudson’s MS program (2002) were run using exon size, number of haplotypes sequenced, and number of total changes as inputs (exon 7 = 126 bp, 19 haplotypes, 8 changes; and exon 9 = 162 bp, 19 haplotypes, 13 changes). Sliding window analysis of Tajima’s D (window size = 70 bp, 25 bp steps) was then done on the simulations. Significance was determined by counting how many simulations had windows with more extreme values than those observed.
Most sequences, as documented in table 2 and figure 1, covered the full coding region of 1,668 bp, or 556 codons, of PGI cDNA (eight of 19 sequences lacked nine initial bases, while four lacked the final 249 bases). There were 130 segregating sites, six of which had three variants, for a total of 136 changes; 119 were at synonymous sites (ss) and 17 were at nonsynonymous sites (nss). Theta (θ = 0.0224) and average nucleotide diversity (πtotal = 0.0199) were almost threefold higher than for Drosophila melanogaster PGI and nearly five times higher than the average for D. melanogaster autosomal genes (37). Colias PGI’s synonymous site diversity (πss = 0.0733) was 30 times higher than its nonsynonymous diversity (πnss = 0.0024). The mean (± SD) number of nonsynonymous changes among allhaplotypes was 2.8 ± 0.85, with changes at five out of 17 nonsynonymous sites altering PGI’s charge, always in ways consistent with EM allele mobility (table 2). The 3-EM class has more synonymous variation than the 4-EM class (28.3 ± 3 vs. 22.9 ± 3.1, respectively), but has similar levels of nonsynonymous intra-EM variation (2.2 ± 0.7 vs. 2.2 ± 1.0). There are fewer nonsynonymous differences between the 3-EM and 4-EM classes (2.8 ± 1.0) than in other possible comparisons between EM classes (average = 4.46, table 3). Finally, of the multiply sampled EM classes, 3 EM is more diverse than 4EM (π3 = 0.021 ± 0.002 and π4 = 0.015 ± 0.004).
There were 35 synonymous and 3 nonsynonymous sites among the 38 sites that differed between single haplotypes of C. meadii’s 2-EM and 3-EM alleles (fig. 1). Theta and nucleotide diversity (θ = 0.022, πtotal = 0.022) were roughly similar to C. eurytheme.
Two nonsynonymous changes, but no synonymous ones, were fixed between species.
Extensive polymorphism in Colias PGI contrasts with the overall sequence conservation of this enzyme across diverse taxa, already mentioned. All catalytic center residues identified in vertebrates are conserved between them and Colias. Three-dimensional structure is also conserved: Colias PGI is a dimer with 180° rotational symmetry, like known rabbit and human structures (fig. 2). A notable feature found in other PGIs extends to Colias: each monomer provides a histidine residue (388 in rabbit and 392 in Colias), essential for catalytic function, to the other monomer’s catalytic center. Starting in each monomer’s catalytic center with Glu 361, the peptide chain runs to the surface, making a loop turn near the monomer interface. The chain then turns back, crossing the monomer interface, to place His 392 in the other monomer’s catalytic center before returning to its own monomer. By analogy to mammalian PGI (Jeffery, Hardre, and Salmon 2001; Arsenieva and Jeffery 2002), in each catalytic center His 392 begins isomerization of glucose-6-phosphate (G6P) to F6P (or vice versa) by protonating the hexose ring oxygen. This yields a cis-enediol transition state whose resolution to the aldo (G6P) or the keto (F6P) isomer is facilitated by Glu 361 and by motion of the C-terminal domain (but see Read et al.  for another view of the action of these same residues). Thus, monomers alone are catalytically inactive; the minimum unit of function is the native dimer structure.
This interdependence of monomers, central to PGI’s catalytic mechanism, may predispose PGI to the heterozygote functional advantage often seen in Colias and other taxa (Gillespie 1991; Watt 2003). It may also explain the reluctance of enzyme preparations from different homozygotes of Colias PGI, when mixed, to exchange monomers during thermal cycling or other manipulations which can “reshuffle” subunits of other multimeric proteins (P. M. Schulte and W. B. Watt, unpublished data).
Amino acid changes in an enzyme’s catalytic center sharply alter its substrate specificity or reaction mechanism (Dean and Golding 1997; Newcomb et al. 1997). But often, as in lactate dehydrogenase (LDH) (Gerstein and Chothia 1991; Fields and Somero 1998), quantitative variation of catalytic performance occurs via amino acid changes across the enzyme surface rather than in the catalytic center. Despite their distance from the catalytic center, such changes can alter its flexibility and/or geometry, hence enzyme kinetics (Gerstein and Chothia 1991; Watt and Dean 2000). Further, the integrity of the monomer interface is essential for PGI’s catalysis (Bruch, Schnackerz, and Gracy 1976), so amino acid variation in or near this interface could affect catalysis. We thus expect the amino acid changes which underly Colias PGI’s genotypic functional and fitness differences to occur outside the catalytic centers but around the monomer interface or in other surface regions whose change can affect catalytic center performance.
All variable amino acid residues in C. eurytheme PGI are substantially exposed to solvent (SAS > 24%), with the exception of 538 Ile → Val (SAS < 3%; fig. 3 and table 4). Conservative changes such as Ile → Val have small effect in protein interiors (Bordo and Argos 1991). This concentration of natural variation at the enzyme s surface is similar to findings with other enzymes (e.g., Bustamante, Townsend, and Hartl 2000; Watt and Dean 2000) and is thought to reflect structural constraint against physicochemical changes in interior-core sequences. In contrast, synonymous base changes, which do not change the encoded amino acid residues, show no relation to the solvent accessibility of the amino acids which they encode (table 4).
Amino acid variants which change charge demarcate EM allele classes by definition. What do they contribute to the known functional differences among the EM alleles’ genotypes? Are there charge-neutral differences among the EM classes which also contribute? The structural nature and placement of variants may suggest answers to these questions.
Many charge-neutral amino acid changes might have occurred in strong linkage disequilibrium with the defining charge changes, but this was not found (table 2). Setting aside a unique allele in the 4-EM–allele class, discussed below, five amino acid change sites distinguish the allele classes consistently in C. eurytheme, and four of these are the defining charge changes. Of the four classes, only the 3-EM class has no uniquely diagnostic amino acid state distinguishing it from all three others.
In the 2-EM allele, 375 Asp/Glu → Gln replaces a negatively charged carboxyl group with a polar but uncharged amide group, compared to other sequences (table 2). This site is positioned in the “interpenetrating loop” (cf. above), at the enzyme surface and at one edge of the monomer interface (fig. 4). One charge-neutral variant at site 41 is unique to the 2-EM allele: Asn → Ile, decreasing polarity and increasing side chain bulk.
Common variants of the 4 allele (table 2, 4-B–4-E) show Arg → Cys at site 369. This change removes a positively charged guanidino group and the bulk of five backbone C and N atoms and their bound hydrogens. It too occurs in the interpenetrating loop near the monomer interface (fig. 4). The 5-EM allele shares 369 Arg with the 2-EM and 3-EM alleles, but has Asn → Asp at site 21, adding a negative charge, and Lys → Met at site 317, removing a positive charge.
The largest within–EM-class difference is the unique A allele of EM class 4, which mimics the usual 4-allele 369 Arg → Cys charge change with 131 Ala → Glu while keeping 369 Arg. Comparing the functional effects of these structurally very different, but electrophoretically similar, changes will be of much interest for future study. Otherwise, there are charge-neutral changes as singletons at sites 50, 110, 144, 152, and 458, and others at sites 128, 375, 450, 463, and 538 that segregate more widely.
Between the two C. meadii allelic sequences, we find two charge-neutral amino acid changes, 67 Thr ↔ Ala and 235 Ala ↔ Ser, and one charge-changing variant, 51 Lys ↔ Asn. The two species differ by only two fixed nonsynonymous base changes, at positions 1 and 2 of codon 370, changing Gly in C. meadii to Ser in C. eurytheme. The 370 Gly, codon GGG, is also fixed in a larger sample of C. meadii and may be ancestral among North American Colias because basal Colias palaeno also shows GGG at codon 370 (based on 12 C. meadii and 2 C. palaeno haplotypes; C. W. Wheat, unpublished data). The 370 Gly → Ser lies between eurytheme’s polymorphic sites 369 and 375 in the interpenetrating loop. It accompanies a large overall increase of thermal stability of C. eurytheme genotypes compared to those of C. meadii and a reversal in the order of kinetics versus stability trade-offs between EM 2 and EM 3 homozygotes of these species (Watt, Donohue, and Carter 1996).
The presence of so many EM-class–distinctive variants at sites 369 and 375 in the interpenetrating loop and at the monomer interface suggests that these are the most important foci of the kinetics and stability differences among PGI genotypes of C. eurytheme. (This is also likely to be true of the functional effects of the interspecific fixed difference at site 370.) The more structurally remote changes among the C. eurytheme EM classes, at sites 21, 41, and 317, could have indirect but also important effects on catalysis, as, for example, varying hydration due to the charge changes at 21 or 317 could alter protein conformation (as is seen in human hemoglobin, Perutz 1983). These now comprise working hypotheses for future testing in detailed structure-function studies. They also give ordered expectations for the location and strength—369, 375 > 317, 21, 41 > within-class variants—of possible effects of selection on patterns of variation in PGI DNA sequence, to be studied here below.
The evolutionary dynamics affecting the observed variation in Colias PGI depend partly on the arrangement of the gene on its chromosome. By sequencing genomic DNA of the IBD EM 4-A allele, we found that the 1,668 bp PGI cDNA is divided among 12 exons, averaging 139 bp each, and separated by 11 introns of length 296 to 2,445 bp, forming a total transcribed region of over 10,000 bp (fig. 1). We also found initial evidence for some intron length polymorphism in the wild (C. W. Wheat, unpublished data). Residues 369 and 375, candidate foci of functional differences among EM allele classes, are located in exon 9, which is 8.1 and 2.2 kb from the start and stop codons, respectively. Based on the DNA sequence data from our IBD allele copies, there have been at least 11 intragenic recombination (IGR) events, all upstream of exon 9 (fig. 1).
For comparison to molecular studies of PGI, we sequenced the coding region of G3PD, obtaining 951 bp from 13 C. eurytheme for a total of 26 sequences for analysis, and from 2 C. meadii for 4 sequences for analysis. As another glycolytic enzyme, G3PD offers an important “control” on the study of PGI, allowing independent assessment of selectively neutral, genome-wide effects of past population size changes.
We found 44 segregating sites in C. eurytheme G3PD, only one of which caused an amino acid change (and this occurred in only one individual): Lys → Asn, G → Tat bp 804 (fig. 5). Theta (θ = 0.0121) and average nucleotide diversity (πtotal = 0.0089) and synonymous site diversity (πss = 0.0356) were about half the levels seen at PGI. Our four C. meadii sequences had six segregating sites, all of which were synonymous changes, with theta and nucleotide diversity (θ = 0.0034, πtotal = 0.0031) much lower than C. eurytheme.
Disentangling the effects of neutral demographic change from those of selection is a critical goal in molecular-evolutionary analyses. But while selection is expected to act on specific genes, population size changes should affect entire genomes. Thus, molecular-evolutionary analysis benefits greatly from the study of multiple genes.
Here we test in two ways for the effects of historical population size change in our samples of PGI and G3PD from C. eurytheme. First, we compare the observed distribution of nucleotide variation to coalescent simulations for equilibrium populations via Tajima’s D and the R2 test. The Tajima’s D statistic tests for a departure of synonymous variants from neutral expectations in a population sample: many low-frequency polymorphic sites yield significant negative values, indicating a history of population expansion or directional selection, while intermediate variant frequencies yield significant positive values, indicating historical population contraction or balancing selection (Tajima 1989). For PGI as a whole, Tajima’s D = −0.300 (P = 0.43), and for G3PD as a whole, D = −1.01 (P = 0.15). The R2 test compares the number of singleton mutations to the average number of nucleotide differences within a sample, with lower values indicative of population expansion (Ramos-Onsins and Rozas 2002). For PGI, R2 = 0.11 (P = 0.273) and for G3PD, R2 = 0.08 (P = 0.063).
The R2 test has greater statistical power to detect population change than Tajima’s D, and the near-significance of its results with G3PD are suggestive of population growth. However, the power of both these tests are very sensitive to both recombination and the timing, as well as duration, of demographic changes (Ramos-Onsins and Rozas 2002). We can circumvent these limitations by directly estimating the exponential growth rate using LAMARC. Our Bayesian estimation of g, the exponential growth rate, consisted of three separate analyses on the G3PD data set spanning a range of prior g values (1000, 1, and −100). Each run used the default conditions of an initial 10 chains, 500 sampled trees, three chain temperatures (1, 1.1, and 1.2), and a final run of two chains with 10,000 trees sampled. All runs converged on similar unimodal distributions, with an average best estimate of g = 933 and lower and upper average 99% confidence limits of 348 and 1,023, respectively. Significantly positive g values are evidence of population growth (Kuhner et al. 2004). Maximum likelihood estimations gave nearly identical results.
The neutral theory of population genetics provides null hypotheses against which patterns of molecular variation in the wild can be tested (e.g., Kreitman 1996). A variety of methods have been developed to do so. We apply some of these to the Colias PGI system, asking if there is evidence for historical selective action related to the known present-day selective regime.
For neutral changes, the proportion of nonsynonymous to synonymous variants which are fixed between species should reflect the proportion of nonsynonymous to synonymous variants which are polymorphic within species (McDonald and Kreitman 1991). Interspecies directional selection will cause an excess of fixed nonsynonymous differences, purifying selection within species will lower nonsynonymous variation, and balancing selection will yield high nonsynonymous and synonymous variation within species compared to fixed differences between them (McDonald and Kreitman 1991; Sawyer and Hartl 1992). Among C. eurytheme and C. meadii PGI sequences, we find 126 synonymous and 20 nonsynonymous polymorphic sites. From their ratio, 6.3:1, neutrality predicts ~13 synonymous fixations alongside the two observed interspecies nonsynonymous fixations. But, no fixed synonymous sites were found (above). These data differ significantly by Fisher’s exact test, P = 0.021, following Moriyama and Powell (1996) and by Goldstein’s (1964) exact binomial test, x* = 3.41, P = 0.0006.
We cannot perform this test on G3PD. Although there are 1 nonsynonymous and 45 synonymous polymorphic sites, there are no fixed differences, synonymous or not, between C. meadii and C. eurytheme.
As discussed earlier, the Tajima’s D statistic can test for departure of synonymous variants from neutral expectations in a population sample, thereby testing for hitchhiking of neutral variants around selected sites. Here we use a 70-bp (half the average exon length) sliding window to analyze Tajima’s D for synonymous sites, as a compromise width large enough to capture potential clouds of variation, yet small enough to detect variation in Tajima’s D across an exon. For the latter purpose, use of a small window is critical given observed levels of nucleotide diversity and recombination (Andolfatto and Nordborg 1998; Nordborg and Innan 2003).
As context for these analyses, recall that our structural analyses above identified the EM-class–distinctive amino acid change sites 369 and 375 as the most likely important foci of known functional and thence selectively maintained differences among C. eurytheme’s PGI genotypes, followed by possible but less certain roles for variants at sites 21, 41, and 317. Further, of these latter EM-class–distinctive sites, 317 in exon 7 is within 1 kb, in genomic sequence, of 369 and 375 in exon 9, whereas sites 21 and 41 are separated from exons 7 and 9 by 4 to 5 kb and many observed intragenic recombination, IGR, events (above, fig. 1).
We have tested these EM-class–distinctive sites for historical extent of functionally based balancing selection on PGI. As noted above, averaged over the whole length of this gene, Tajima’s D is slightly though nonsignificantly negative. Because G3PD’s overall D value was also negative, these results together suggest historical population expansion and hence a weak but general genomic trend toward negative Tajima’s D values. However, given sufficient time of persistence of balancing selection, neutral hitchhiking variants at intermediate frequencies should accumulate around the selected sites, creating local “islands” of positive D values in contrast either to simple neutrality or to population expansion favoring negative values of D.
Sliding window Tajima’s D analysis of silent sites across C. eurytheme PGI, exon by exon, finds two positive peaks, with the highest (D = +1.71, P = 0.04) in the 5′ end of exon 9 and the other in the 3′ end of exon 7 (D = +1.59, P = 0.04) (fig. 6; this was not a post hoc search for significance, so identified peaks were not subject to multiple test correction). (Sliding windows of 35 and 140 bp identified the same peaks.) These exons contain important candidate foci of balancing selection: sites 369 and 375 in exon 9, and site 317 in exon 7.
Indeed, patterns of Tajima’s D within exons 7 and 9 further support our identification of these sites as selective foci. In exon 7, where the 3′ end including site 317 has D = +1.59, the 5′ end has D = −0.73; in exon 9, where the 5′ end including sites 369–375 shows D = +1.71, the 3′ end shows D = −1.88. Such juxtapositions of extreme values observed in these exons occur rarely in coalescent simulations (exon 7, P = 0.004 and exon 9, P = 10−4), and in both cases the amino acid candidates for foci of balancing selection are at the ends with highly positive D values.
The other possible EM-class–distinctive changes, at sites 21 and 41, receive no support from Tajima’s D for historical selection on them.
While synonymous nucleotide diversity (πss) shows a generally high level across the broad center of PGI cDNA (fig. 6), the gene as a whole (above) showed a negative average value of Tajima’s D. But in post hoc exon by exon analysis, at the 5′ and 3′ ends, exons 1 and 12, respectively, show D = −1.86 (P = 0.053) and D = −2.04 (P = 0.016), although when the Dunn-Sidak multiple test correction for k = 12 exons is applied (Sokal and Rohlf 1995), these lose significance with the new significance threshold of α′ = 0.0043. These observations do further reinforce the overall gene-wide contrast to the positive D values localized around the most prominent candidate sites for known selection.
To facilitate following discussion, we summarize the diverse findings reported above:
We now consider implications and new working hypotheses arising from these findings.
The nucleotide diversity of Colias’ PGI may have the highest value yet seen for a gene in a well-sampled single population of insects. Compared to C. eurytheme G3PD, and diverse genes in D. melanogaster (unusual among plants and animals in not having PGI allozymes), Colias PGI (π = 0.019) is much more variable, for example, than C. eurytheme G3PD, π = 0.009; in D. melanogaster PGI, π = 0.0008; Est6, π = 0.006; adenosine diphosphate, π = 0.008), and genomic average, π = 0.004 (Moriyama and Powell 1996). There is one other such study in Lepidoptera, on pheromone-binding protein in the moth Ostrinia nubilalis (π = 0.017) (Willett and Harrison 1999). We need more data to know if insects with high chromosome numbers (Lepidoptera often have n ≥ 30 chromosomes) and holocentric chromosomes show elevated levels of intrapopulation genetic variation.
Previous work on Colias PGI used allozyme electrophoresis to identify genetic variation for biochemical and field study. We now find that the EM alleles differ at multiple amino acid positions. These include electrophoretically cryptic charge-neutral variants, most of which are singletons, with the most consistent differences among EM classes being specific charge-changing amino acids. But this need not have been the case. For example, charge-neutral amino acid variants of D. melanogaster phosphoglucomutase, differing significantly in function, show extensive clinal variation across the eastern United States, while no such clines occur among charge-changing variants (Verrelli and Eanes 2001a, 2001b). Indeed, the 370 Gly Ser replacement between C. meadii and C. eurytheme PGI alleles, found here, is charge neutral and indetectable by electrophoresis except at very high resolution, yet it accompanies major changes in PGI genotypic function and resulting fitness components between these taxa (Watt, Donohue, and Carter 1996). To what extent such complex interplays of charge-changing and charge-neutral variants may occur in other allozyme systems remains to be addressed by functional genomic studies.
There are two likely sources of the constraint against amino acid variation in enzyme cores. First, the integrity of a protein’s hydrophobic core affects the stability of its whole tertiary structure. This has led protein chemists to postulate restriction of the flexibility needed for enzyme catalysis to the outer portions of enzymes’ structure (e.g., Gerstein and Chothia 1991). Second, a protein must self-assemble as it is made. This assembly has complex dynamics, and any one mature structure may be reachable by only a few folding pathways. Along these, the formation of secondary and tertiary substructures, which can then undergo mutual “docking,” will depend on repeatable associations of core residues (Fersht et al. 1991).
The relaxation of these constraints, as one passes from protein interiors to protein surfaces (Bustamante, Townsend, and Hartl 2000), allows room for changes showing a wide range of functional and fitness differences among genotypes (e.g., Watt and Dean 2000). Our sequencing has apparently captured a broad subset of such changes, from large down to neutral, or nearly so, in functional effect. While amino acid variation in enzymes or other proteins may sometimes be neutral or under purifying selection (Kimura 1983; Bustamante, Townsend, and Hartl 2000), it is also true that if only a protein’s surface is free to vary, then that surface must be where adaptive, as well as neutral, variation is to be found (Fields and Somero 1998; Watt and Dean 2000).
In both Colias species complexes studied so far (Watt 1983; Watt, Donohue, and Carter 1996), there is widespread heterozygote advantage in kinetics, expressed in high values of Vmax/Km and largely in small values of Km. Further, all homozygotes display consistent trade-offs of kinetic performance versus enzyme stability. How can we begin to account for these in terms of our present structural findings?
The reciprocal connection of PGI monomers by the interpenetrating peptide loops between their catalytic centers ensures strong interaction of monomer structures. The concentration of differences among C. eurytheme’s EM alleles into the 361–392 loop and monomer interface region is also striking. These facts together lead to an hypothesis testable by more detailed molecular study of structure and function among Colias’ PGI genotypes. Following Monod, Wyman, and Changeux (1965) (without implication of allosterism for which there is no evidence here), we propose that the symmetry of the PGI dimer and the close connections of its monomers may impose rigid steric constraints on homodimers whose monomers have, ipso facto, identical amino acid sequences. Such constraints may force homozygotes’ PGI into distinct alternative conformation states, resulting either in greater stability or in greater flexibility which can support tight substrate binding and high Vmax/Km ratios, but not both. This would produce the tradeoffs of stability versus kinetics seen in homozygotes of both species. In contrast, in the heterodimers of heterozygotes, alternation of key amino acids in the loop-interface region, for example, 369 Arg/369 Cys in C. eurytheme’s 3/4 heterozygote, might release the structure from such symmetry-based constraints. Heterodimers could then assume a range of intermediate configurations, combining near-maximal stability with nonadditivity or actual heterosis in kinetics, as seen especially in C. eurytheme’s 3/4 and C. meadii’s 2/3 genotypes.
Colias’ PGI polymorphism now presents a fine opportunity to “dissect” adaptive enzymatic structure-function relationships using genetic variants. This may be feasible either by directed mutagenesis, as done for interspecies differences in fish LDH sequences (Johns and Somero 2004), or by use of natural recombinants found in the wild, many of which differ pairwise by only single amino acids. This would apply equally to study of amino acid variation between and within EM classes. As noted earlier, our original studies of function among EM genotypes (Watt 1983) used IBD representative alleles from each class, to avoid confounding by mixtures of cryptic alleles. Now, we can discover, by comparing functional parameter values, exactly which within-class alleles were used in that work. It is an open question to what extent within-EM class variation accounts for some of the variance found around the mean differences of performance and fitness values, successfully predicted from our functional studies, among EM-class genotypes in the wild (e.g., Watt 1992).
The change at PGI’s codon 370 from Gly in C. meadii (and C. palaeno) to Ser in C. eurytheme, in the interpenetrating loop between monomers, is striking in more than one way.
A likely candidate substitution path for the change is apparent. Fixation of the Ser codon, TCG, in C. eurytheme would have occurred most parsimoniously in two mutational steps (different transversions, G → T and G → C) from ancestral Gly, GGG. Possible intermediates would be Ala, GCG, or Trp, TGG. Ala would be far less disruptive of the monomer interface or the interpenetrating loop than Trp. Continued study of the phylogenetics of Colias (Pollock et al. 1998; C. W. Wheat and W. B. Watt, unpublished data) and search for transitional states of the polymorphism in intermediate relatives as thus identified may eventually allow us to test these possibilities.
Further, the complete “overturn” of the identities of selected polymorphic variants between C. meadii and C. eurytheme, accompanying this fixation, underscores the strength of selection in recruiting novel variants as well as the speed of reassembly of neutral hitchhiker associations with some of them. The interplay between directional selection during taxon separation and ongoing balancing selection within each taxon, implied by this apparent history, is remarkable.
The extensive subdivision of the PGI gene by introns may have impacted its evolutionary dynamics and may have recorded evidence of the history of those dynamics. For example, the negative Tajima’s D values in the 3′ half of exon 9, following an area of hitchhiking of neutral variants around candidate sites for the focus of balancing selection may reflect other kinds of events downstream of this exon. The number and size of PGI’s introns will have facilitated the action of intragenic recombination, IGR, by expanding the coding region across its chromosomal locus by nearly a factor of 10. IGR may thus have contributed to the evolution of PGI exons by generating diverse new allelic combinations among segregating sites, as long ago proposed (Watt 1972, see above). Further study will clarify how recombination, intron length variation, and neighboring gene evolution interact with selection to shape variation in this chromosomal region.
Our comparative analysis of PGI and G3PD nucleotide variation allows us to begin teasing apart the effects of demography from that of selection. The large number of low-frequency polymorphic sites at the G3PD gene and in the majority of the PGI gene in C. eurytheme are striking. The most parsimonious explanation for these observations is historical population expansion, perhaps due to range expansion after the Wisconsin glaciation or possibly an event such as the divergence of the lowland complex including C. eurytheme from C. meadii and other basal taxa. In contrast to such patterns, neutral variants closely linked to sites under balancing selection are held longer in populations at intermediate frequencies than would be so under genetic drift, creating a cloud of neutral, hitchhiking variants around selected sites (e.g., Kaplan, Hudson, and Langley 1989). Thus, demographic and selective effects upon the PGI gene lead to opposite expectation of Tajima’s D (population expansion = negative Tajima’s D, balancing selection = positive Tajima’s D). Furthermore, the high levels of intragenic recombination found at PGI should restrict the physical extent of hitchhiking (Asmussen and Clegg 1981; Andolfatto and Nordborg 1998). Our Tajima’s D analysis shows the interaction of these effects at PGI, in the extensive negative D values across the gene and in the narrowness of the islands of positive D around the sites which are candidate foci of functional and fitness differences.
This work and that of Garrigan and Hedrick (2003) on major histocompatibility complex variants offer important insights into the ability of Tajima’s D to detect the action of independently demonstrated natural selection within a diploid population. Neutral hitchhiking variation, on which this and other such tests depend, requires substantial time to accumulate (Garrigan and Hedrick 2003). Thus, our detection of a selection signature with Tajima’s D suggests that selection has acted as strongly on Colias’ PGI in the past as it does now. While we cannot yet estimate the age of the balanced polymorphisms in Colias PGI, all populations of Colias studied to date exhibit multi-EM–allele PGI polymorphism (>20 species from North America and Eurasia, e.g., Watt 2003).
In contrast, the cases of human hemoglobin and G6PD polymorphism, where the action of balancing selection has also both been measured in the wild and pursued to its molecular origins, show no such hitchhiking signature, probably due to the recent rise of malarial selection pressure maintaining this variation (Harding et al. 1997; Tishkoff et al. 2001; Verrelli et al. 2002). Moderate to high levels of intragenic recombination may also make this “footprint” of balancing selection so narrow as to be undetectable (Andolfatto and Nordborg 1998; Nordborg and Innan 2003). Thus, failure to detect a selection signature with such tests is ambiguous. Even when such indirect tests are significant, they inform us neither about the current level nor the mechanism of selection (Simonsen, Churchill, and Aquadro 1995; Garrigan and Hedrick 2003). In conclusion, rather than supporting the sole use of such indirect tests, our findings argue for greater emphasis on combining evidence from functional and fitness measurements with molecular analysis of the genes in question.
This work connects formal molecular-evolutionary study, emphasizing statistical variation patterns, with the mechanistic study of the processes shaping those patterns. We have focused on the natural connection between these perspectives through the evolutionary recursion (above, Feder and Watt 1992). Each element of this recursion offers in turn new insights and predictive abilities, integrating genomic and protein structural variation with its biochemical and organismal performance, and measurable fitness consequences in well-studied populations.
The pervasiveness of PGIEM polymorphism in a broad range of prokaryote, plant, and animal taxa as noted earlier offers many potential tests of the generality of our results. Using indirect molecular tests of selection, some other studies have also rejected neutral patterns of PGI nucleotide variation in favor of balanced polymorphism (Katz and Harrison 1997; Filatov and Charlesworth 1999). A correlation has recently been found in beetles among PGI EM alleles, their functional properties, and HSP 70 expression in what appears to be selection on the thermal stability of PGI and its impacts on chaperone maintenance of cytosol protein integrity (Dahlhoff and Rank 2000; Rank and Dahlhoff 2002). Each of these studies, in its own way, extends the general question we have begun to address here: why does strong selection maintain persistent variability among such divergent species? Our results here on PGI, as well as those of Powers et al. (1991) on fish LDH, emphasize the hitherto understudied role of heteromultimeric functional interactions in such scenarios of enzyme evolution. Evolutionary insight into the common features of these and other such cases, building on elementary ideas already in view (Powers et al. 1991; Watt 2000; Feder and Mitchell-Olds 2003), will illuminate this question further.
Parts of this work were submitted by C.W.W. and D.D.P. to Stanford University in partial fulfillment of requirements for their Ph.D. degrees. We thank S. Ramos-Onsins for his help in coalescence analysis and C. Boggs, M. Feder, R. Hudson, T. Mitchell-Olds, D. Petrov, D. Rand, R. Simoni, G. Somero, J. Stamberger, C.Yanofsky, and several anonymous reviewers for stimulating discussions and/or for helpful commentary on this manuscript. We have been partly supported by the Ford Motor Co. Fund through the Center for Evolutionary Studies at Stanford, by the U.S. Department of Energy (grant ER-61667 to W.B.W.), by the U.S. National Science Foundation (graduate fellowship to C.W.W. and grant IBN 01-17754 to W.B.W.), and the Max Planck Gesellschaft. Our findings are our own and represent no official policy of any government agency or corporate entity.