|Home | About | Journals | Submit | Contact Us | Français|
A combination of selective and neutral evolutionary forces shape patterns of genetic diversity in nature. Among the insects, most previous analyses of the roles of drift and selection in shaping variation across the genome have focused on the genus Drosophila. A more complete understanding of these forces will come from analyzing other taxa that differ in population demography and other aspects of biology. We have analyzed diversity and signatures of selection in the neotropical Heliconius butterflies using resequenced genomes from 58 wild-caught individuals of Heliconius melpomene and another 21 resequenced genomes representing 11 related species. By comparing intraspecific diversity and interspecific divergence, we estimate that 31% of amino acid substitutions between Heliconius species are adaptive. Diversity at putatively neutral sites is negatively correlated with the local density of coding sites as well as nonsynonymous substitutions and positively correlated with recombination rate, indicating widespread linked selection. This process also manifests in significantly reduced diversity on longer chromosomes, consistent with lower recombination rates. Although hitchhiking around beneficial nonsynonymous mutations has significantly shaped genetic variation in H. melpomene, evidence for strong selective sweeps is limited overall. We did however identify two regions where distinct haplotypes have swept in different populations, leading to increased population differentiation. On the whole, our study suggests that positive selection is less pervasive in these butterflies as compared to fruit flies, a fact that curiously results in very similar levels of neutral diversity in these very different insects.
GENETIC variation within and between populations is shaped by numerous factors. In particular, genetic drift is stronger in smaller populations, such that organisms with larger population sizes should be more diverse under neutral evolution. However, it has long been known that the amount of genetic variation does not always scale as expected with population size, with a deficit of genetic variability in larger populations as compared to the neutral expectation (Lewontin 1974). This has become known as “Lewontin’s paradox.” It is likely that this paradox can be explained by considering the influence of natural selection (Ohta and Gillespie 1996; Leffler et al. 2012; Cutter and Payseur 2013; Corbett-Detig et al. 2015). Since drift can act to retard selection, natural selection tends to be more efficient in organisms with larger population sizes. Consistent with this, estimated rates of adaptive evolution are often greater for smaller organisms with larger population sizes. For example, it has been estimated that >50% of amino acid substitutions between fruit fly species are driven by positive selection (Sella et al. 2009; Messer and Petrov 2013), but in humans, <15% of recent amino acid substitutions appear to have been driven by selection (Eyre-Walker 2006; Messer and Petrov 2013). Considering the relative importance of natural selection and genetic drift in maintaining genetic diversity has important implications for explaining current patterns of biodiversity and predicting future adaptive potential (Gillespie 2001; Leffler et al. 2012).
The solution to Lewontin’s paradox appears to lie in the influence of natural selection on linked sites. Selection acting on one locus can cause the removal of genetic variation at physically linked, neutral loci. This can occur either through fixation of beneficial alleles (“hitchhiking”) (Maynard Smith and Haigh 1974) or by purging of deleterious alleles (“background selection”) (Charlesworth et al. 1993). Both of these processes have more pronounced effects in genomic regions of lower recombination rate. The importance of linked selection is supported by a positive correlation between recombination rate and neutral genetic diversity, first and most thoroughly studied in Drosophila melanogaster (Begun and Aquadro 1992; Langley et al. 2012; Mackay et al. 2012; McGaugh et al. 2012; Campos et al. 2014), and subsequently observed in other taxa, including humans (Nachman et al. 1998; Payseur and Nachman 2002; McVicker et al. 2009; Lohmueller et al. 2011), yeast (Cutter and Moses 2011), mice, rabbits (Nachman and Payseur 2012), and chickens (Mugal et al. 2013). A major recent advance has come from a population genomic analysis of 40 species, which showed not only that this phenomenon is widespread in plants and animals, but importantly, that the effectiveness of selection at removing variation at linked sites is correlated with population size (Corbett-Detig et al. 2015). The increased effectiveness of natural selection in larger populations reduces genetic diversity to a much greater extent than in smaller populations, countering to some degree the reduced influence of genetic drift.
However, this correlative evidence fails to capture the complexities of how natural selection acts in different species. A range of factors will affect the efficiency of natural selection and how strongly it influences linked sites, including the recombinational landscape across the genome, the frequency of adaptive change, and historical population demography (Cutter and Payseur 2013). These factors vary enormously between species, and in-depth analyses of an increasing number of taxa have revealed that not all conform to the same general trends. For example, certain plant species do not show a correlation between recombination and neutral polymorphism (see Cutter and Payseur 2013 for a thorough review). In the insects, most of what we have learned about the action of selection and drift in natural populations comes from studies of the genus Drosophila (Andolfatto 2007; Sella et al. 2009; Sattath et al. 2011; McGaugh et al. 2012; Campos et al. 2014; Comeron 2014; Lee et al. 2014). For example, in Drosophila simulans, genetic diversity is strongly reduced in the vicinity of recent nonsynonymous substitutions, indicative of strong hitchhiking around beneficial mutations (Sattath et al. 2011; Lee et al. 2014). This contrasts with a more subtle pattern in humans, where a reduction in diversity around functional substitutions is only detectable after accounting for background selection (Hernandez et al. 2011; Enard et al. 2014). It remains to be seen whether the rampant selection seen in Drosophila spp. is typical of insects.
Here we investigate the action of selection and other evolutionary forces in Heliconius butterflies, focusing in particular on H. melpomene. This species differs from D. melanogaster in a number of ways that might influence patterns of selection across the genome. Populations of Heliconius live in tropical rainforests and are characterized by long life spans and stable populations (Ehrlich and Gilbert 1973). In addition, H. melpomene has a similar per base recombination rate to D. melanogaster, but more chromosomes (21 compared to 4), potentially allowing higher overall recombination rates. Although the ecology and evolution of this genus has been the subject of much research (reviewed by Merrill et al. 2015), including recent genomic studies of adaptation and speciation (Arias et al. 2012; Nadeau et al. 2012, 2013; Kronforst et al. 2013; Martin et al. 2013; Supple et al. 2013), whole-genome studies of selection and within-species genetic diversity have been lacking. Data from H. melpomene was included in the recent comparative study of Corbett-Detig et al. (2015). However, only four individuals from a single population were considered. Here we examine in detail the action and influence of natural selection within and between populations and species using whole genome resequencing data from 59 H. melpomene individuals and an additional 21 samples from 11 related species. We first identify four large but cohesive populations and then explore genetic variation within and between populations and species, describing the footprints of various selective and neutral processes.
The analyzed genome sequences from 80 butterflies included both published and new data. Sample information and accession numbers are given in Supplemental Material, Table S1. The 58 wild-caught H. melpomene samples cover much of the species range and included 13 wing pattern races. We also reanalyzed sequence data from a single individual from the inbred H. melpomene reference strain (Heliconius Genome Consortium 2012). For sequences generated in this study, methods were as described by Martin et al. (2013). All sequences analyzed here consisted of paired-end reads obtained by shotgun sequencing using either Illumina’s Genome Analyzer IIx system or Illumina’s HiSeq 2000 system, according to the manufacturer’s protocol (Illumina).
Quality-filtered, paired-end sequence reads were mapped to the H. melpomene genome scaffolds (version 1.1) (Heliconius Genome Consortium 2012) using Stampy version 1.0 (Lunter and Goodson 2011). Genotypes were called using the GATK version 2.7 UnifiedGenotyper (DePristo et al. 2011). See File S1 for detailed methods. Only “high quality” genotype calls (Phred-scaled mapping quality and genotype quality ≥30) were used in downstream analyses. We optimized our genotype calling procedure by examining the total numbers of genotype calls and estimated error rates produced by different pipelines. The rate of false positive heterozygous genotype calls was estimated by analyzing a homozygous region in the inbred reference sample. We further examined how error rates change with increasing divergence and different read depths using simulated sequence reads generated using seq-gen (Rambaut and Grass 1997) and ART (Huang et al. 2012). See File S1 for further details.
A maximum-likelihood tree for all 80 samples was generated using only fourfold degenerate (4D) sites that had high-quality genotype calls in at least 60 samples, giving an alignment of 1.7 million bases. RAxML (Stamatakis 2006; Ott et al. 2007; Stamatakis et al. 2008) was used with the GTRGAMMA model, and 100 bootstrap replicates were performed.
We then used two approaches to identify populations that would be considered separately in downstream analyses of diversity: STRUCTURE (Pritchard et al. 2000; Falush et al. 2003), a model-based clustering method that infers the proportion of each individual’s genotype made up by each of a defined number of clusters; and principle components analysis (PCA), performed using Eigenstrat SmartPCA (Price et al. 2006). To minimize the influence of selection, both analyses considered only fourfold degenerate sites. Detailed methods are provided in File S1.
We generated unfolded site frequency spectra for each H. melpomene population by counting the number of derived alleles at biallelic sites. Sites were polarized by comparison with the “silvaniform” clade species: H. hecale, H. ethilla, and H. pardalinus. To allow comparison among populations, and account for missing data, each site was randomly down-sampled to the same number of individuals. See File S1 for details.
To infer changes in ancestral population sizes, we used the pairwise sequentially Markovian coalescent (PSMC) program (Li and Durbin 2011). This method fits a model of fluctuating population size by estimating the distribution of times to most recent common ancestor across a diploid genome. Twelve samples were selected a priori for PSMC analysis. These 12 were chosen because they all had similar sequencing depth, similar numbers of genotyped sites, were all male (homogametic, ZZ), and provided a good representation across the species range. Detailed methods are provided in File S1.
Various population parameters were calculated for nonoverlapping 100-kb windows across the genome. Only windows with a sufficient number of sites genotyped in at least 50% of samples were considered. See File S1 for details. We used 100-kb windows because linkage disequilibrium (LD) tends to break down almost completely within 10 kb and reaches background levels within 100 kb (Figure S1), meaning that measures from adjacent windows would be largely free of linkage effects.
Nucleotide diversity (π) and absolute divergence (dXY) were calculated as the average proportion of differences between all pairs of sequences, either within a sample (π) or between two samples (dXY). Sites with missing data were excluded in a pairwise manner to maximize the amount of data being considered. Tajima’s D (Tajima 1989) and FST (as in equation 9 of Hudson et al. 1992) were calculated using the EggLib Python module (De Mita and Siol 2012).
We estimated the genome-wide rate of adaptive substitution (α) using Messer and Petrov’s asymptotic method (Messer and Petrov 2013), comparing synonymous and nonsynonymous SNPs covering 11,804 polymorphic genes (11,638 autosomal and 166 Z-linked). Polymorphism was measured in the Western population of H. melpomene, and divergence was measured between the Western population and H. erato. We calculated confidence intervals around the estimated α by performing 1000 bootstraps, in each of which 11,804 genes were resampled, with replacement. Detailed methods are described in File S1.
We used multiple linear regression to model nucleotide diversity at 4D sites (π4D) in 100-kb windows (calculated for each population separately and then averaged). The aim was to assess the influence of selection at linked sites on diversity at neutral sites. Since linked selection is largely modulated by the number of selected sites and the extent of linkage, we included as explanatory variables the local gene density (the proportion of coding sequence per window) as proxy for the density of nearby selected sites (Corbett-Detig et al. 2015) and local recombination rate (), calculated from the linkage map. To account for genetic hitchhiking, we also included the number of recent nonsynonymous substitutions (Dn) in the H. melpomene lineage per window as an explanatory variable. As an alternative, and a potentially more direct indicator of adaptive substitutions, we also tested a model using summed gene-by-gene estimates of the number of adaptive nonsynonymous substitutions (a), estimated by maximum likelihood using the McDonald–Kreitman test framework (Welch 2006). To account for mutation rate variation, the rate of synonymous substitutions per synonymous site (dS) was also included as an explanatory variable. Lastly, we also included GC content at third codon positions to account for any effects of DNA composition. Detailed methods for the estimation of various explanatory variables and data processing for this analysis are provided in File S1.
To further investigate the interrelationships between the explanatory variables, we used principal component regression (PCR) (Drummond et al. 2006; Mugal et al. 2013). This approach can help to tease apart the effects of the various explanatory variables by summarizing the explanatory variables into orthogonal components, thereby accounting for multicollinearity. Regression analyses were performed with the R version 3.0.3 (https://www.R-project.org) using the pls package (Mevik and Wehrens 2007).
To assess the robustness of our findings, we investigated the influence of various modifications to the model, such as restricting the analysis to genes showing minimal codon usage bias, the exclusion of chromosome ends, and use of a different outgroup. Details are provided in File S1.
Multiple linear regression was also performed for whole chromosomes, where the response variable was the mean 4D site diversity per chromosome (4D). Here, rather than using recombination rate estimated from the linkage map, we used chromosome length as a proxy for recombination rate (Kaback et al. 1992; Lander et al. 2001). Thus, the five explanatory variables were as follows: chromosome length, average gene density, average synonymous substitution rate (S), average number of nonsynonymous substitutions per 100 kb (n), and average GC content. As above, 4D was square-root transformed, but in this model, none of the explanatory variables required transformation to correct for skewness. As above, all explanatory variables were Z-transformed so that their effects could be compared.
To identify candidate selective sweep locations in the Eastern and Western populations, we used SweeD (Pavlidis et al. 2013). This program is based on Sweepfinder (Nielsen et al. 2005) and uses a composite likelihood ratio (CLR) to identify loci showing a strong deviation in the site frequency spectrum toward rare variants. See File S1 for detailed methods.
All raw sequence reads are available from the Sequence Read Archive from the National Center for Biotechnology Information. Accession numbers are provided in Table S1. Processed genotype data, along with data files underlying all results, figures and tables, and code used for model fitting are available from Data Dryad (http://dx.doi.org/10.5061/dryad.g0874). The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.
The median depth of coverage across all samples was 28×. A median of 78% of sites genome-wide, and 96% of coding sites, had high-quality genotype calls for samples of H. melpomene, its two close relatives H. cydno and H. timareta, and the “silvaniform” clade species H. hecale, H. ethilla, and H. pardalinus, which diverged from H. melpomene ~3.8 million years ago (MYA) (Kozak et al. 2015) (Figure S4). A few samples had considerably fewer sites genotyped, owing to poor sequence coverage (Table S1). More distant species, including H. wallacei (~8.8 MYA), H. doris (~9.7 MYA), and H. erato (~10.5 MYA) all showed strongly reduced numbers of genotype calls (median 33%), suggesting that many reads from these species were too divergent to be mapped reliably to the H. melpomene reference. However, the number of calls obtained in coding regions showed very little drop-off with phylogenetic distance, with a median of 90% of coding sites genotyped in the most divergent outgroup, H. erato (Figure S4). This implies that coding regions are sufficiently conserved to allow read mapping and genotyping across all Heliconius species. Our analyses of the more distant species therefore focused only on coding regions.
We selected a genotyping pipeline that gave a false positive SNP rate of 0.03% per site (three errors in 10,000 calls), when comparing the inbred reference sample to itself (Table S2). Using simulated reads, we found that our pipeline produced higher error rates for more divergent taxa, especially when sequencing depths were low (Figure S5). Nevertheless, for divergences below 6%, which is typical for coding sequences in this genus, and with appreciable sequencing depth, estimated error rates were still well under 0.05% (five in 10,000). As we are concerned primarily with large-scale genomic trends, with all analyses considering large numbers of sites, rare genotyping errors are unlikely to influence our conclusions.
In order to focus our analyses on biologically meaningful populations, we first investigated population structure among our samples. Analysis of population structure based on 4D sites, using both PCA and STRUCTURE (Pritchard et al. 2000; Falush et al. 2003) gave largely congruent results. Both analyses identified three distinct H. melpomene clusters that were largely partitioned geographically (Figure 1, B and C). Consistent with previous studies using smaller datasets, H. melpomene samples from the eastern and western slopes of the Andes formed two strongly differentiated populations, separated by a deep phylogenetic split (Figure 1B). The third population was made up of the samples from French Guiana. These three populations will be referred to as the “Eastern,” “Western,” and “Guianan” populations. The only exception to this geographic clustering was a group of five samples of H. m. melpomene from the eastern slopes of the Andes in Colombia, which formed a monophyletic clade most closely allied with the Western population. However, the STRUCTURE results suggested admixture between these and both the Eastern and Guianan populations (Figure 1B). Although not differentiated by principal components 1 and 2 (Figure 1C), these five Colombian samples were differentiated from the Western population by principal component 3 (Figure S9). Given the distinct geography and genomic composition of these samples, we made the conservative decision to consider this group as a fourth distinct population (“Colombian”).
Tests for selection can be confounded by historical changes in population size (Eyre-Walker and Keightley 2009; Li et al. 2012), so we investigated the population history of the four H. melpomene populations. Three of the four populations showed signatures consistent with fairly stable population sizes, but the Eastern population showed evidence of a recent expansion. Tajima’s D was consistently negative in the Eastern population, but close to zero in the Western, Colombian, and Guianan populations (Figure 2A). Negative Tajima’s D is indicative of an excess of rare variants, consistent with recent population growth. This finding was substantiated by direct examination of the unfolded site frequency spectrum (SFS). The SFS at 4D sites showed a strong excess of rare variants in the Eastern population compared to the other populations (Figure 2B). This skew remained when only geographically proximate samples, with high sequencing coverage, were considered (Figure S10), indicating that it was not an artifact of sampling design. The same trend was also observed at intronic and intergenic sites (Figure S10). Compared to neutral expectations with constant population size, the other three populations displayed a weak excess of rare variants, but to a much lesser degree than the Eastern population. For the Eastern and Western populations, which were more densely sampled, we were able to compare the SFS down sampling to 20 individuals for each SNP position. This deeper sampling reproduced the pattern, further showing that the skew was not limited to singleton variants, but also doubletons (derived alleles present twice in the sample) (Figure S10). While genotyping error could explain some of the excess of singleton SNPs, it is unlikely to cause the observed excess of doubletons, nor the dramatic skew seen in the Eastern population.
We further verified our hypothesis of a recent expansion in the Eastern population using the PSMC method of Li and Durbin (2011). All four populations showed similar population size histories up until ~200,000 years ago, with a gradual increase in population size beginning ~1 MYA and leveling off ~300,000 years ago. However, while the Western, Colombian, and Guianan samples showed a subsequent decrease in the inferred Ne, that of the Eastern samples rose again, roughly doubling between 100,000 and 30,000 years ago (Figure 2C). Closer to the present (<30,000 years ago) the inferred individual histories diverged considerably, as may be expected given the dearth of information about recent demography to be gained from analysis of single genomes (Li and Durbin 2011). Nevertheless, there appears to be a tendency for the more southern samples to show greater expansion (e.g., H. m. amandus from Bolivia and H. m. aglaope from Peru). Methods more sensitive to demographic change in the recent past may be necessary to confirm this pattern. This analysis was repeated several times, varying the PSMC input parameters for block size and recombination rate. While absolute population size estimates tended to be lower at smaller block sizes, the observed trend of a population size expansion in each of the Eastern samples was consistent throughout (data not shown). As it is based on heterozygosity in single genomes, this analysis is independent of the site frequency spectrum and therefore provides an additional line of evidence for a recent expansion of the H. melpomene population to the east of the Andes.
One potential caveat in this conclusion is that hybridization and gene flow may also produce patterns consistent with population growth, and gene flow is known to occur between the eastern population and H. timareta (Martin et al. 2013). However, there is also significant gene flow between the Western population and H. cydno, implying that hybridization alone is unlikely to explain the distinct pattern seen in the Eastern population.
Estimated neutral diversity in H. melpomene was found to be high, and comparable with that in Drosophila spp. Estimates of within-population nucleotide diversity (π) in H. melpomene made use of only those samples with at least 25× depth of coverage, because we found that levels of within-sample heterozygosity tended to be underestimated at sequencing depths below this threshold (Figure S11). Genome-wide π, averaged over all 100-kb windows across the four populations was 1.9%, and similar when only intergenic (2.0%) or intronic (1.9%) sites were considered (Table 1; Table S3). As expected, diversity was strongly reduced at first and second codon positions (0.6%) and higher at third codon positions (1.5%). Diversity was highest at 4D sites (2.5%).
The peripheral Western and Guianan populations had significantly lower diversity than those at the center of the range (Figure 3A) (paired Wilcoxon signed rank test, P < 2e-16). To ensure that this trend was not simply driven by population substructure among sampled individuals, we examined levels of heterozygosity within each sample. As mentioned above, this revealed that sequencing depth affected estimates of heterozygosity, but that above a depth of roughly 25×, heterozygosity was consistent within each population. Considering only samples with depth of at least 25×, we found that average 4D site heterozygosity in the Western samples (2.67%) was only marginally lower than that in the Colombian (2.79%) and Eastern (2.82%) samples, whereas that of the Guianan samples (2.16%) remained considerably lower than the other populations (Figure S11).
To estimate as closely as possible the value of θ = 4Neμ, we recalculated average diversity at 4D sites considering only autosomal genes showing minimal codon usage bias. This gave a slightly higher value of 2.7%, ranging from 2.1 to 2.9% across the four populations (Table 2). These values are in the same range as estimated neutral diversity for D. melanogaster in Southern Africa (~2%) and D. simulans (~3.5%) (Begun et al. 2007; Langley et al. 2012).
Mean divergence at 4D sites between H. melpomene and H. erato, its most distant relative in the genus, was 15.8% [16% when only minimal codon usage bias (CUB) genes were considered]. A calibrated phylogeny places the split between these two species at ~10.5 MYA (Kozak et al. 2015). Assuming four generations per year, this corresponds to a neutral mutation rate of 1.9 × 10−9 per site per generation. This is about two-thirds of the spontaneous mutation rate recently estimated using whole genome sequencing of parents and offspring in H. melpomene (2.9 × 10−9) (Keightley et al. 2014). Using these two rates, we estimated Ne (θ/4μ) for the four populations, which ranged from 1.8–2.8 million for the Guianan population to 2.5–3.8 million for the Colombian population (Table 2). We note that these do not represent instantaneous values, but rather aggregates over the course of the coalescent time scale. They are also based on a small subset of putatively neutral sites, making them difficult to compare with the PSMC results (Figure 2C), as the latter are based on the whole genome data, for which the level of polymorphism is ~30% lower (Table 1).
Levels of diversity at 4D sites varied considerably across individual chromosomes (Figure 3C). This heterogeneity was strongly conserved between the three populations. Interspecific divergence also varied across the chromosomes, but to a lesser extent (Figure 3D). Diversity was also reduced on the Z chromosome relative to autosomes, as expected, given its lower effective population size (Ne). This discrepancy between autosomes and Z was also present in measures of divergence between H. melpomene and its closer relatives, but disappeared at higher levels of divergence, with dXY between H. melpomene and H. erato being nearly identical for autosomes and Z (Table 1). This is consistent with a decreasing contribution of Ne to coalescence time for deeper species splits.
One potential concern is that highly variable regions may have been missed due to poor read mapping, in which case diversity and divergence might be underestimated. To test for this bias, we investigated whether π and dXY were correlated with the proportion of missing data per window (measured as sites genotyped in fewer than 50% of samples). Third codon positions averaged just 2.3% missing data among the H. melpomene samples, and just 2.4% across the entire set of 79 wild samples. Neither π nor dXY were correlated with the proportion of missing data (Figure S12, Figure S13). Hence, there is no evidence for any bias in coding regions. In intergenic regions, which averaged 23% missing data in H. melpomene samples and 25% across the whole sample set, both π and dXY were found to be weakly correlated with the proportion of missing data (Figure S12). However, the amount of variance explained by missing data was very low (linear regression, R2 = 0.037 and 0.009, respectively). The effect of missing data on our estimates of diversity and divergence therefore appears to be minimal. We suggest that the reduced rate of read mapping in noncoding regions in H. melpomene and its closer relatives is probably driven more by an abundance of repeats and structural variation rather than by excessively divergent sequences.
We estimated that 31% of fixed amino acid substitutions between Heliconius species are adaptive. We used Messer and Petrov’s asymptotic method (Messer and Petrov 2013) to estimate a genome wide α, the proportion of nonsynonymous substitutions driven by positive selection. The exponential model showed a good fit to the data (Figure 4), and gave an estimated α of 31.0%. The 5th and 95th quantiles from 1000 bootstrap replicates were 29.0 and 33.0%, respectively. This value is roughly intermediate between estimates for humans (13%) and D. melanogaster (57%), generated using the same approach.
We explored the influence of selection on linked sites using a multiple linear regression approach, following several previous studies (Cutter and Moses 2011; McGaugh et al. 2012; Mugal et al. 2013). Our “main model” had nucleotide diversity at 4D sites (π4D) in 100-kb windows as the response variable and five explanatory variables: (i) local gene density, a proxy for the number of nearby selected sites; (ii) local recombination rate (); (iii) the number of recent nonsynonymous substitutions (Dn), to account for hitchhiking around beneficial mutations; (iv) the synonymous substitution rate (dS), a proxy for local mutation rate; and (v) GC content, to account for effects of local DNA composition (Figure S14, Figure S15). The results for the main model are summarized in Table 3 and Table S4 and described below. There was limited serial correlation among windows. The Durbin–Watson statistic (Durbin and Watson 1950, 1951) for the main model was 1.3, suggesting that autocorrelation is unlikely to influence our conclusions (Field 2009). Several modifications of the model were also tested to investigate the robustness of the result. These are all summarized in Table S5, Table S6, Table S7, Table S8. Overall, results were consistent throughout, but several notable differences are described below.
The main model explained 34% of the variation in π4D (F5,1461 = 151.1, P < 2.2e-16) and all five predictor variables were found to have significant effects (Table 3). Unsurprisingly, synonymous substitution rate (dS) was a strong predictor of intraspecific diversity (F1,1461 = 260.89, P < 2.2e-16), implying that at least some of the observed heterogeneity in diversity across the genome is explained by variation in mutation rate. Local gene density also showed a strong negative relationship with π4D (F1,1461 = 162.03, P < 2.2e-16), consistent with a considerable effect of selection at linked sites. Local recombination rate showed a positive relationship with diversity (F1,1461 = 65.27, P < 1.357e-15), also as expected under linked selection. In addition, recombination rate was not correlated with dS (Pearson’s R = −0.01, P < 0.648; Figure S16), implying that the positive relationship with diversity cannot be explained by a mutagenic effect of recombination. We suggest that the true relationship between recombination rate and neutral diversity may be stronger than our model predicts, as the imperfect placement of scaffolds on the Hmel1.1 linkage map limits our ability to accurately estimate local recombination rates (see below).
There was also a significant negative effect of the number of nonsynonymous substitutions per window (Dn) on π4D (F = 15.08, P < 0.0001), implying a small but detectable effect of genetic hitchhiking around adaptive substitutions. While multicollinearity among the explanatory variables was generally weak, there was a moderate correlation between gene density and Dn (Pearson’s R = 0.53; Figure S16). This is unsurprising, since nonsynonymous substitutions can occur only in coding sequence and are thus likely to be more common in gene-rich regions. This can make it difficult to separate the roles of hitchhiking and background selection. However, the variance inflation factors for gene density and Dn in our model were low and do not suggest a significant impact of collinearity on the precision of our estimates (Table 3). Moreover, PCR analysis, which first accounts for collinearity among explanatory variables by separating them into principle components before performing a multiple regression, demonstrated an effect of gene density on diversity independent of Dn (Figure S17).
One potential concern is that Dn could be underestimated in highly divergent genes, where read mapping for the distant outgroup H. erato may be poor. We therefore tested a modified model in which dS and Dn were estimated using only the more closely related silvaniform species as outgroups. This made little difference to the results (Table S5).
As an alternative to Dn, we also tested a modified model that included maximum-likelihood estimates of a (the number of adaptive nonsynonynmous substitutions) per gene, summed across each window. This revealed a similar significant negative effect of a on π4D (Table S6), while collinearity with gene density was considerably lower than for Dn (Pearson’s R = 0.27). Taken together, these finding all support a role for genetic hitchhiking in addition to background selection in shaping neutral variation in H. melpomene.
One striking difference between humans and fruit flies is that genetic hitchhiking in Drosophila spp. is pervasive enough to produce an average trough in diversity around nonsynonymous substitutions (after scaling for mutation rate variation) (Sattath et al. 2011; McGaugh et al. 2012); whereas this is not directly observable in humans (Hernandez et al. 2011). We performed an equivalent test with our data (Figure S18), and found patterns similar to those in humans, with no significant reduction of scaled diversity in the vicinity of nonsynonymous substitutions compared to synonymous substitutions. Enard et al. (2014) suggest that the effect of hitchhiking around nonsynonymous substitutions can be masked by background selection. This may be because background selection tends to be stronger in more conserved genomic regions, where adaptive substitutions are expected to be less common. This might explain the fact that we observe evidence for reduced diversity only around nonsynonymous substitutions in our multiple-regression model. Although our approach does not model the action of background selection explicitly, by including gene density and recombination rate as explanatory variables, it may account for some of the confounding influence of background selection. The relative roles of hitchhiking and background selection may be further resolved in the future by explicitly modeling these different processes (Corbett-Detig et al. 2015).
In the main model, GC content was found to be negatively correlated with π4D (F1,1461 = 18.98, P = 1.416e-05). This may reflect CUB, with a preference for codons ending in C or G, which would lead to elevated GC content at genes under stronger selection for codon usage (Wright 1990). Indeed, in our analysis of codon usage, genes with a higher GC content at the third codon position tended to display stronger evidence for CUB (Figure S5). In a modified model where π4D was calculated using only our defined set of minimal CUB genes, GC content became a nonsignificant predictor of diversity, whereas effect sizes for all other explanatory variables were similar (Table S7).
We last confirmed that the observed patterns are not predominantly driven by chromosome ends. A model excluding windows in the outer 5% of chromosomes was similar to the main model (Table S8). Generally, effect sizes and P-values were lower, but this may partly reflect the 10% reduction in the number of observations. Interestingly, GC content was no longer a significant predictor of diversity. This might imply that patterns of codon usage change toward the chromosome ends, but this will require further investigation.
Given the large effect of gene density on diversity at 4D sites, we further explored this relationship visually (Figure 5). There was a clear trend of lower π4D in gene-rich regions, but also a conspicuous increase in variance in regions of lower gene density (Figure 5A). This is most likely caused by the smaller number of 4D sites available in such regions, resulting in fewer data and therefore increased noise. We were able to account for this issue in our multiple linear regression model by weighting residuals according to the number of data points available per window, and overall the model showed minimal violation of assumptions (Figure S14, Figure S15). On several chromosomes, the correlation between gene density and π4D was remarkably clear (Figure 5B).
There was a strong negative relationship between 4D site diversity and chromosome length (in bases) (Table 4), further supporting the pervasive role of linked selection in shaping genetic diversity in H. melpomene. Long chromosomes tend to have lower recombination rates per base pair (Kaback et al. 1992; Lander et al. 2001; Kawakami et al. 2014), which should lead to stronger linked selection. Although a considerable number of scaffolds in the H. melpomene v1.1 genome are not properly positioned and oriented on a chromosome, the chromosomal assignment could be inferred for almost all large scaffolds (83% of the genome in terms of bases) (Heliconius Genome Consortium 2012), making for fairly robust estimates of chromosome length. We used a multiple regression model similar to that used for 100-kb windows above, but here averaging all parameters over each of the 20 autosomes, and with chromosome length used as a proxy for recombination rate. This model explained 73.34% of the variation in average chromosomal diversity at 4D sites, 4D (F5,14 = 7.7, P = 0.0011) (Table 4). There was a strong negative relationship between 4D and chromosome length. Comparing models with and without chromosome length as an explanatory variable, we found that the model including chromosome length had a far better fit to the data (F1,15 = 20.15, P < 0.0005). Hence, long chromosomes tend to be less variable at neutral sites than short chromosomes, and this trend was clear upon visual inspection (Figure 6A). As in the window-based model, 4D was positively correlated with average synonymous substitution rate, S (F1,14 = 9.85, P = 0.0073), but there was no significant relationship between S and chromosome length (Pearson’s r = 0.08, P = 0.7289) (Figure 6B), reinforcing our finding that mutation rates are not correlated with recombination rate. The simplest explanation for this pattern is therefore that linked selection drives patterns of diversity not only among small windows, but also among whole chromosomes.
The only other noteworthy correlation with chromosome length was a negative relationship with GC content. We hypothesized that this skew in base composition may be driven by stronger CUB on shorter chromosomes. This would be expected if higher recombination rates on shorter chromosomes allowed for more efficient selection by reducing interference among sites (Hill and Robertson 1966; Felsenstein 1974). Consistent with this hypothesis, the proportion of genes identified as having nonnegligible CUB was negatively correlated with chromosome length (Spearman’s r = −0.555, d.f. = 19, P = 0.009) (Figure S19).
We investigated whether there were signatures of strong, recent selective sweeps in H. melpomene and also whether sweeps tended to be geographically restricted to a particular population. We scanned the genome for putative selective sweep signals using SweeD (Pavlidis et al. 2013), which uses the CLR method of Nielsen et al. (2005) to identify loci displaying a strong skew in the SFS toward rare variants in comparison with the genomic background. A number of regions throughout the genome had outlying CLR values, above a threshold determined using neutral simulations (Figure S20). Most of these were restricted to the Eastern population, with only one being restricted to the Western population, and one region with partially overlapping outliers in both populations. Given the excess of outliers in the Eastern population, most of which are represented by just single windows, it seems likely that most of these signals are spurious, perhaps reflecting increased variance in the SFS caused by a less stable demographic history. Only two loci had strong putative sweeps indicated by clusters of outlying CLR values. On chromosome 11, outliers in the Eastern and Western populations roughly overlapped (Figure 7A), consistent with a beneficial allele spreading across both populations. On chromosome 12, the cluster of outlying CLR values was restricted to the Eastern population (Figure S20). We investigated patterns of diversity within and divergence between populations at these two loci in finer detail. The results were very similar for both candidate sweep loci, so we present the results for chromosome 11 in Figure 7, and those for chromosome 12 in Figure S21.
At the putative sweep locus on chromosome 11, nucleotide diversity (π) in both the Eastern and Western populations was reduced (Figure 7B). Absolute divergence between the two populations (dXY) was similarly reduced (Figure 7B). However, the fixation index, FST, between the Eastern and Western populations was strongly elevated in this region (Figure 7C). This implies that the region carries a low overall amount of genetic variation, but that the variation that is present constitutes fixed or nearly fixed differences between the two populations (Charlesworth 1998). This is consistent with a scenario where the alleles that swept to high frequency in the two populations were not identical, but similar, and potentially carried the same beneficial allele (Bierne 2010). The putative sweep locus on chromosome 12 showed very much the same pattern (Figure S21).
To further explore the genetic make-up of these regions, we visualized the genotypes of individuals from all four populations at a sample of 600 highly polymorphic biallelic SNPs across the scaffold containing the putative sweep locus. Both scaffolds had regions in which heterozygous genotypes were clearly reduced, and fixed differences between the Eastern and Western populations strongly increased. In both cases, there were also several fixed differences between the Eastern and Guianan populations, but no fixed differences between the Western and Colombian populations among the 600 sampled SNPs. We note that by focusing on highly polymorphic SNPs, we highlight differentiation between the populations and fail to show how much of the region is shared, which must be considerable, given the reduced dXY. Nevertheless, this visualization confirms our hypothesis that distinct alleles reached high frequency in the different populations. The presence of some heterozygous sites in the sweep regions indicates that these may be fairly ancient sweeps and/or that no single allele fixed in each population (i.e., a “soft sweep”). One notable observation is the presence of long runs of heterozygous genotypes in certain individuals. This is also consistent with a soft sweep, but may alternatively reflect gene flow subsequent to the sweep, leading to long haplotypes introgressing between the populations.
Both putative sweep locations contained multiple annotated genes. All genes in these two regions that gave significant BLAST hits to D. melanogaster proteins (18 and 13 genes, respectively) are listed in Table S9. Due to the large number of potential targets of selection, we do not speculate here as to the adaptive significance of these events.
The extent to which evolutionary change is a result of neutral or selective forces remains one of the long-standing questions in evolutionary biology. Genomic data permit powerful tests of the various forces that shape genetic variation within and between species, but such studies have only recently been extended beyond a few well-studied taxa. We examined a large number of whole genome sequences to investigate the forces shaping diversity in Heliconius butterflies. Levels of neutral diversity in H. melpomene are similar to those in Southern African populations of D. melanogaster, suggesting comparable effective population sizes. However, actual census population sizes of fruit flies must at times reach numbers far greater than those of Heliconius butterflies, which are characterized by low-density, stable populations (Ehrlich and Gilbert 1973). This paradox of divergent demography but similar diversity is partly explained by the impact of selection on linked sites. Rampant selection across the compact Drosophila genome leads to a dramatic reduction in diversity at linked sites. By contrast, our results suggest that selection is less pervasive in Heliconius, and its influence on linked sites, though significant, is less pronounced.
Our findings suggest that positive selection plays an important, but less prominent role in Heliconius. We estimate that 31% of amino acid substitutions between H. melpomene and H. erato were driven by positive selection. This contrasts with an estimated rate of adaptive substitution in D. melanogaster of 57%, made using the same method (Messer and Petrov 2013). Although our multiple-regression model indicated that genetic hitchhiking has had a significant effect on neutral variation, we did not detect the same strong reduction in diversity around nonsynonymous substitutions seen in Drosophila spp. (Sattath et al. 2011; McGaugh et al. 2012). One possible explanation is that positive selection more often targets noncoding regulatory changes, as appears to be the case in humans (Enard et al. 2014). However, even more general signatures of recent selective sweeps in the form of strong skews in the site frequency spectrum were limited. It is important to note that the amount of adaptive evolution depends not just on the efficacy of selection, but also on how often novel, adaptive phenotypes arise. This depends both on the changeability of the fitness landscape and the availability of adaptive variation. Populations of fruit flies may occasionally reach extremely high densities, increasing the potential for selection to detect advantageous mutations (Barton 2010). The relevant population size for adaptive evolution could therefore be much higher than that estimated from neutral variation, especially in species with highly fluctuating populations.
Certain aspects of Heliconius biology might also obscure the footprints of positive selection when it does occur. Barriers to dispersal, such as the Andes Mountains, can reduce the signature of hitchhiking by slowing the progression of sweeps (Barton 2000; Kim 2013). Indeed, at the two putative sweep loci investigated, similar but distinct alleles appear to have swept in the Eastern and Western populations. This is consistent with a scenario where a globally beneficial allele recombined onto a different genetic background as it spread, which would not only soften the sweep signal, but also enhance population differentiation (Slatkin and Wiehe 1998; Bierne 2010). The source of beneficial variation is another important factor, as adaptation from standing or introgressed variation, can result in soft sweeps (Pennings and Hermisson 2006). One likely example is the repeated evolution of certain wing pattern forms. Despite the strong selection known to act upon wing-patterning loci (Mallet and Barton 1989), signatures of selective sweeps at pattern loci have not been observed (Baxter et al. 2010; Nadeau et al. 2012). While the present study was not designed to address this question, all Western population samples shared a red forewing band, controlled by the B locus on chromosome 18 (Baxter et al. 2010); and yet no significant sweep signal was detected at this locus. It appears that wing patterning frequently evolves by sharing of preexisting alleles between populations, and even between species through rare hybridization (Pardo-Diaz et al. 2012; Heliconius Genome Consortium 2012; Wallbank et al. 2016). The presence of variation among these old alleles might eliminate any signature of genetic hitchhiking (Pennings and Hermisson 2006). It is yet to be established whether adaptation from standing and introgressed variation is generally common in Heliconius, but studies in other systems are increasingly suggesting an important role for preexisting adaptive variation in evolution (Jones et al. 2012; Gosset et al. 2014; Roesti et al. 2014).
Despite the limited influence of hard selective sweeps, the genomic landscape of variation in H. melpomene has been shaped significantly by selection on linked sites. Diversity at neutral sites is positively correlated with recombination rate and negatively correlated with local gene density, which is a good proxy for the density of both coding and noncoding functional elements. These trends are consistent with a pervasive influence of selection on linked sites and are similar to patterns in several other animals (Cutter and Payseur 2013; Corbett-Detig et al. 2015). There is no evidence that recombination rate affects the mutation rate, in agreement with findings in Drosophila (McGaugh et al. 2012) and humans (McVicker et al. 2009). The effects of linked selection are also visible at the whole-chromosome scale. Longer chromosomes are less polymorphic, presumably because they have lower recombination rates per base pair, leading to stronger effects of linked selection on average. A negative relationship between chromosome length and recombination rate has been observed in a range of taxa (Kaback et al. 1992; Lander et al. 2001; Kawakami et al. 2014), and probably stems from a requirement for at least one obligate crossover event per chromosome during meiosis, even on the smallest chromosomes (Kawakami et al. 2014). Increased recombination rates on smaller chromosomes would also be expected to lead to more efficient purifying selection, due to reduced interference among loci (Hill and Robertson 1966; Felsenstein 1974). Indeed, smaller chromosomes display greater evidence of codon usage bias, a phenomenon probably driven by fairly weak selection. This is akin to the observation in D. melanogaster of reduced codon usage bias in regions of minimal recombination (Kliman and Hey 1993).
Further studies of other taxa are necessary to build a complete picture of how selection shapes genetic variation in natural populations. Nevertheless, even a simple comparison between butterflies and fruit flies can be enlightening. The different biology of these two insect groups results in distinct patterns of adaptive evolution. However, the lower effective population sizes in Heliconius combined with less influence of selection on linked sites leads to levels of neutral diversity similar to those in Drosophila spp. Indeed, Corbett-Detig et al. (2015) estimate that the impact of selection on linked sites is around three times greater in fruit flies. Although it has long been recognized that levels of neutral variation are not solely determined by population size, whole genome studies such as this are beginning to reveal in detail how different processes combine to shape genetic diversity.
We thank the handling editor, David Begun, and two anonymous reviewers, whose comprehensive comments greatly improved this manuscript. Nick Barton and Aylwyn Scally provided helpful comments on an earlier draft. We thank John Davey for advice on the recombination rate analyses and for useful discussions of the results. James Walters provided advice for the analysis of the rate of adaptive substitution, and some of the groundwork for this analysis was performed by Gabriel Jamie and Joseph Harvey. We also thank James Mallet and Lawrence Gilbert for thought-provoking discussions of the life history of Heliconius. Finally, we thank Jenny Barna for computing support. This work was funded by European Research Council grants 281668 to F.M.J. and 339873 to C.D.J. Some sequencing was funded by a John Fell Fund of Oxford University grant to Judith Mank and James Walters. S.H.M. was supported by St. John’s College, Cambridge. M.M. was supported by a Swiss National Science Foundation Early PostDoc Mobility fellowship (P2EZP3_148773) and an Austrian Science Fund Erwin Schrödinger fellowship (J3774-B25). C.S. was funded by Convocatoria para Proyectos de Investigación en Ciencias Básicas-2014 (Colciencias), contract no. FP44842-103-2015. W.J.P was supported by a Medical Research Council Centenary Award.
Communicating editor: D. J. Begun
Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.183285/-/DC1.