|Home | About | Journals | Submit | Contact Us | Français|
Marine stickleback fish have colonized and adapted to innumerable streams and lakes formed since the last ice age, providing an exceptional opportunity to characterize genomic mechanisms underlying repeated ecological adaptation in nature. Here we develop a high quality reference genome assembly for threespine sticklebacks. By sequencing the genomes of 20 additional individuals from a global set of marine and freshwater populations, we identify a genome-wide set of loci that are consistently associated with marine-freshwater divergence. Our results suggest that reuse of globally-shared standing genetic variation, including chromosomal inversions, plays an important role in repeated evolution of distinct marine and freshwater sticklebacks, and in the maintenance of divergent ecotypes during early stages of reproductive isolation. Both coding and regulatory changes occur in the set of loci underlying marine-freshwater evolution, with regulatory changes likely predominating in this classic example of repeated adaptive evolution in nature.
The genetic and molecular basis of adaptive evolution is still largely unknown. Some researchers have championed a preeminent role for regulatory changes during evolution of adaptive phenotypes, because such changes may avoid pleiotropic consequences of protein-coding alterations1–3. Others have catalogued known phenotypic differences caused by protein-coding changes and have questioned whether sufficient case histories exist to estimate the relative frequency of regulatory and coding changes during adaptive evolution4. Despite progress on individual traits5, it has been difficult to accumulate enough examples in any particular group to obtain an overall picture of molecular mechanisms underlying evolutionary change, particularly for clearly adaptive phenotypes in wild organisms.
Threespine sticklebacks offer a powerful system for studying the molecular basis of adaptive evolution in vertebrates. Following the retreat of Pleistocene glaciers, marine sticklebacks colonized and adapted to many newly-formed freshwater habitats, evolving repeated changes in body shape, skeletal armour, trophic specializations, pigmentation, salt handling, life history, and mating preferences6,7. Recurrent evolution of similar phenotypes in similar environments suggests these traits evolve by natural selection8. Distinctive marine and freshwater forms can still hybridize, making it possible to map the genetic basis of individual traits, and identify particular genes underlying armour, pelvic, and pigmentation evolution9–12. At two of these key loci, distinctive haplotypes are reused when similar phenotypes evolve in different populations11,12, a pattern later found at additional loci13,14. Ongoing gene-flow between marine and freshwater forms occurs along coastal rivers15,16, making it possible to spread adaptive alleles among populations, while homogenizing neutral genomic regions17. Here we use signatures of allele sharing to identify a genome-wide set of adaptive loci consistently associated with recurrent marine freshwater evolution.
To facilitate studies of stickleback evolution, we first generated a reference genome assembly from a homogametic (female) freshwater stickleback (Gasterosteus aculeatus) from Bear Paw Lake, Alaska. The sequenced individual was partially inbred and retained heterozygosity at approximately 1/700 bp. The assembly, gasAcu1.0, was generated with 9.0x coverage in Sanger sequence data (ABI3730), and has an N50 contig size of 83.2 kb, an N50 scaffold size of 10.8 Mb and a total gapped size of 463 Mb, close to previous 530 Mb estimates18. The 113 largest scaffolds (86.9%, 400.4 Mb) were anchored to stickleback linkage groups in an F2 marine x freshwater intercross, while 60.7 Mb in 1,812 smaller scaffolds (N50=0.3 Mb), remains unanchored. Use of a single partially-inbred individual, construction and assembly of a range of genomic library sizes, and the relatively low repeat and duplication content of the stickleback genome, have produced a highly contiguous anchored genome assembly with contig and scaffold sizes much larger than other published teleosts19–22 (Supplementary Table 1).
The stickleback sequence was annotated using the Ensembl pipeline, which predicted 20,787 protein-coding and 1,617 RNA genes (Supplementary Table 2). Of the protein-coding genes, 7,614 showed one-to-one orthology with mammals and an additional 7,192 showed one-to-one orthology among fishes. The other 5,981 genes showed complex orthology relationships, including some lineage-specific gene expansions that likely contribute to stickleback adaptations (e.g., a duplicated mucin family encoding glue proteins used for male nest building23). A total of 13.4% of the stickleback genome appeared under evolutionary constraint when compared with other fishes using PhastCons24. The conserved portion was roughly equally divided between protein-coding and non-coding sequences, with ~71% of the latter shared with mammals and ~29% representing fish-specific conserved sequences (Supplementary Table 3).
To search for loci underlying repeated evolution in sticklebacks, we first identified populations showing characteristic marine-freshwater morphology (Fig. 1a, Supplementary Fig. 1, and Supplementary Table 4). Repeated adaptation to divergent marine-freshwater environments resulted in dramatic correlated changes in body shape, length, depth, fin position, spine length, eye size, and armour plate number (Fig. 1b). Because quantitative trait loci (QTL) controlling these traits map to many different chromosomes12,25–30, the morphological screen should identify populations differing in a genome-wide range of adaptive loci underlying marine-freshwater differences.
From the distinct morphological clusters of marine and freshwater fish, we selected multiple marine-freshwater pairs, from both Pacific and Atlantic populations, including individuals from opposite ends of rivers with marine-freshwater hybrid zones15,16 (21 fish total, including the reference genome individual). The sampling strategy should minimize geographic bias in the data set, while maximizing the chance for local exchange of neutral regions of the genome.
We generated 2.3x average coverage per individual using Illumina sequencing (Supplementary Table 5, Supplementary Information). To identify single nucleotide polymorphisms (SNPs), we pooled data from all fish and identified positions where at least four reads support a variant allele. This criterion identified 5,897,368 candidate SNPs (Supplementary Table 6), with most being true positives based on experimental validation (N=48 tested, 82.6% confirmed; Supplementary Information).
Previous studies have shown that repeated armour evolution in sticklebacks occurs through ancient variants at the EDA locus, which are reused in multiple freshwater populations11 and subject to strong selection31. To identify loci where alleles have similarly been used repeatedly during adaptive divergence of marine and freshwater fish, we used two methods to look for regions where sequences of most freshwater fish were similar to each other, but differed from sequences typically found in marine populations. Note that this pattern will not identify adaptive variants that are unique to individual freshwater populations, but instead focuses on variants with striking evidence of biological replication across populations.
First, we developed a self-organizing map-based iterative Hidden Markov Model (SOM/HMM) to identify the twenty most common patterns of genetic relationships (“trees”) among the 21 individuals. Genomic regions were assigned to pattern-types based on likelihood, with boundaries defined using HMM transitions. This method iteratively models recurring phylogenetic patterns on a local genomic basis with increasing resolution (Fig. 1c and Supplementary Information). Most of the genome was assigned to trees describing geographic relationships between populations (e.g., distinct Pacific vs. Atlantic clades, each containing marine and freshwater fish; Supplementary Table 7 and Supplementary Fig. 2 & 3). Two hundred fifteen regions comprising 2,096,101 bp (0.46% of the genome; median size: 4,684 bp) were assigned to one of four trees separating most marine from most freshwater fish (Supplementary Fig. 3 trees a-d). After filtering, the most prevalent marine-freshwater divergent tree identified 90 genomic regions with a median size of 4,266 bp covering 848,691 bp (0.18% of the genome).
Second, we used a genetic distance-based approach (Fig. 1c) based on building 21x21 pairwise nucleotide divergence (π) matrices for each of 877,568 overlapping windows across the genome (2,500 bp, step size: 500 bp). Each distance matrix was used to calculate a marine-freshwater cluster separation score (CSS), quantifying the average distance between marine and freshwater clusters after accounting for variance within ecotypes (Supplementary Information). The score is highly correlated with FST distances, but provides increased resolution under high divergence (Supplementary Fig. 4). After permutation testing, we recovered 174 marine-freshwater divergent regions, covering a total of 1,214,500 bp (0.26% of the genome; median size: 3,000 bp) at a 5% false discovery rate (FDR), and 84 divergent regions covering 479,500 bp (0.1% of the genome; median: 4,000 bp) at 2% FDR. To verify cluster membership in highly divergent genomic regions, we also employed an unguided Bayesian model-based data-driven clustering (DDC; Fig. 2c; Supplementary Information). For each window of the genome, we estimated the most likely number of distinct clusters of fish (k=0 to 5) and their cluster memberships.
The independent SOM/HMM and CSS approaches both successfully recover the previously described chromosome IV EDA locus among the top-scoring marine-freshwater regions (Fig. 2). Notably, the cluster membership assigned by DDC successfully recapitulates the breakpoints of the minimal 16 kb shared freshwater EDA haplotype (Fig. 2c) previously defined by a multi-year positional cloning study of the major locus controlling armour plate differences in sticklebacks11. Additional regions were identified on the same chromosome with similar marine-freshwater divergence patterns, including regions surrounding the developmental signaling gene WNT7B (Supplementary Fig. 5), and a locus involved in hormone and neurotransmitter binding and metabolism (sulfotransferase 4a1, SULT4A32). SOM/HMM and CSS defined many other loci that also show globally-shared marine-freshwater divergence, including 242 regions identified by either method (0.5% of the genome), and 147 regions identified by both (0.2% of the genome). The median size of recovered regions (<5 kb) approaches the size of individual genes, and often highlights purely intergenic regions, such as the exclusively non-coding region identified between BANP and RAS on chromosome XIX (Supplementary Fig. 6). The genomic distribution, sizes, and overlaps of recovered regions are described in Fig. 3, Supplementary Fig. 7 and Supplementary Table 8, including a list of specific genes identified in top-scoring regions (Supplementary Data 1). Using genotyping assays for SNPs in 11 regions recovered by both SOM/HMM and CSS analyses, we found 91% of tested regions show significant enrichment of ecotypic alleles in independent marine and freshwater populations (Supplementary Information). These results confirm that our experimental design successfully identifies both known and novel loci consistently associated with parallel evolution of distinct marine and freshwater ecotypes.
Compared to the genome overall, the 242 regions implicated in repeated marine-freshwater evolution show higher gene density (Supplementary Fig. 8, P<4.5×10–13) and higher concentration of conserved non-coding sequences in intergenic regions (Supplementary Fig. 9, P<1.9×10–11), likely reflecting a more complex regulatory architecture33. Gene Ontology analysis shows significant enrichment of genes involved in cellular response to signals, behavioural interaction between organisms, amine and fatty acid metabolism, cell-cell junctions, and WNT signaling (Supplementary Table 9). Changes in these biological processes, and in the individual genes defined by parallel divergence analysis, likely underlie recurrent differences in morphology, physiology, and behaviour previously described in marine and freshwater sticklebacks7. For example, the WNT7B and WNT11 family members identified by the genomic survey have previously been implicated in a paracrine signaling pathway that controls kidney collecting tubule length and diameter34. Fish living in freshwater produce copious hypotonic urine compared to marine fish35, and long-term adaptation to freshwater may select for variants in the same developmental signaling pathways that polarize epithelial cell divisions and regulate kidney tubule formation in other animals.
While our method identifies regions used repeatedly during stickleback evolution, it does not tell us how prevalent such regions are among all differentiated loci in a particular marine-freshwater species pair. To address this, we analysed patterns of genomic differentiation across a marine-freshwater hybrid zone in River Tyne, Scotland (Fig. 4a). Previous studies show ecologically-mediated postzygotic selection maintains distinct ecotypes in this system, despite hybridization and opportunity for extensive gene-flow16. Whole-genome sequencing of a pair of marine and freshwater fish from either end of the Tyne hybrid zone identified a set of genomic windows with high divergence. Within the top 0.1% divergent windows, 35.3% contain elevated globally-shared marine-freshwater divergence (Fig. 4b and Supplementary Information) suggesting an ancient shared origin for many, but not all, loci with highly differentiated alleles in this marine-freshwater species pair. Previous studies have shown that some traits in sticklebacks evolve by independent mutations that vary among populations10. The non-globally-shared divergent alleles in the Tyne may also represent recent, or locally arising adaptive variants, though further studies will be required to link such variants to particular traits, or to distinguish them from neutral but highly variable regions of the stickleback genome.
When adaptive divergence occurs in hybridizing systems, theory predicts that selection can favour molecular mechanisms that suppress recombination between independent adaptive loci17. We observed extended stretches of elevated CSS spanning 442 kb, 412 kb, and 1700 kb on chromosomes I, XI, and XXI (Fig. 3). Based on sharp transitions in CSS score and DDC cluster assignments at the boundaries, we hypothesized that chromosomal inversions explain these extended regions. By analyzing paired-end sequence reads from a marine large-insert (~220 kb) BAC library36, we identified individual clones with size and orientation anomalies relative to the freshwater reference genome assembly. The only locations with five or more anomalous clones mapped to chrI, XI, and XXI, and these anomalies could be resolved by the presence of inverted chromosome segments between the marine fish and the freshwater reference genome (Fig. 5a-b). Sequences flanking the predicted inversion breakpoints contain inverted repeats, consistent with generation of inversions by intra-chromosomal recombination (Supplementary Fig. 10). Interestingly, repeats flanking the chrXI inversion contained alternative 3’ exons for the voltage-gated potassium channel gene KCNH4. Because KCNH4 transcription is initiated within the inversion, alternative inversion orientations could generate marine- and freshwater-specific KCNH4 isoforms (Fig. 5c). While the functional consequences of such ecotype-specific isoforms remain unknown, KCNH4 homologs help maintain resting currents, affect cardiac contractility, and alter performance on cognitive tasks if perturbed in mice37–39. Further, QTL for two distinct marine-freshwater divergent traits have previously been mapped to the broad region of the chrXXI inversion (Fig. 5d)27,30, as expected if inversions help maintain linkage between different adaptive QTLs40.
Importantly, cluster assignment of individual fish by DDC shows that most marine and freshwater populations in the Pacific carry contrasting forms of the inversion regions (Supplementary Table 10). Similar ecotype associations are seen in the Atlantic basin for chrI (no exceptions), chrXI (2 exceptions), and to a lesser extent for chrXXI (3 freshwater exceptions). Genetic markers within the chrI and XXI regions are polymorphic in hybrid zones, and show large frequency differences when genotyped in adjacent upstream and downstream fish, confirming that these regions are subject to divergent selection in marine and freshwater habitats (Supplementary Table 10). Our results help explain the broader patterns of genomic divergence seen in Fig. 3, and add to growing evidence that chromosome inversions are a common genomic mechanism that maintains contrasting ecotypes in hybridizing natural populations41–44.
Identification of a genome-wide set of loci used repeatedly in stickleback adaptation provides a rare opportunity to estimate the relative contribution of coding and regulatory changes underlying adaptive evolution in natural populations. To examine this issue, we analyzed 64 marine-freshwater divergent regions with the strongest evidence of parallel evolution: those identified by both SOM/HMM and CSS analyses using the strictest significance thresholds (Supplemental Information and Supplemental Data 1), and containing SNPs showing perfect allele-ecotype association between marine and freshwater fish. Many of these 64 regions (41%) mapped entirely to non-coding regions of the genome, and presumably contain regulatory changes (Fig. 6a). A smaller fraction contains protein-coding sequences with consistent non-synonymous substitutions between marine and freshwater fish (17%). Finally, a fraction of regions (43%) include both coding and non-coding sequences, but lack ecotype-specific amino acid substitutions (Supplementary Data 1). Since all of these regions contain SNPs with perfect allele-ecotype association that do not cause protein-coding changes, they also likely contribute to adaptive divergence by regulatory alterations. The combined data suggest that both coding and regulatory differences contribute to parallel stickleback evolution, with regulatory changes accounting for a much larger proportion of the overall set of loci repeatedly selected during marine-freshwater divergence.
To assess further the possible role of gene regulatory evolution in stickleback evolution, we constructed whole-genome expression arrays to compare levels of gene expression in tissues from Little Campbell River (LITC) marine and Fish Trap Creek (FTC) freshwater fish. Of 12,594 informative genes across the genome, 2,817 showed significant expression differences between ecotypes. Genes with marine-freshwater expression differences were significantly more likely to occur in or near the adaptive regions recovered by SOM/HMM or CSS analysis (Fig. 6b, P<7.1×10−8). While expression differences can be due to either cis- or trans-acting changes, the expression data are consistent with an important role of regulatory changes during parallel evolution of marine and freshwater sticklebacks.
Progress in genetic mapping and positional cloning approaches has recently made it possible to identify a few individual genes and mutations that contribute to phenotypic differences between stickleback populations10–12,25. Despite this progress, identifying many such examples using genetic linkage mapping alone would require years of additional effort. Fortunately, the highly replicated nature of stickleback evolution provides clear molecular signatures that can be used to recover many loci consistently associated with parallel marine-freshwater adaptation. The signal resolution of repeatedly used adaptive loci approaches ~5 kb, often identifying single genes or intergenic regions, and offering a significant advantage over the several hundred kilobase candidate intervals typically identified in genetic mapping crosses11,12, or the megabase or larger regions identified in previous selection scans of the stickleback genome13. The many marine-freshwater divergent loci and gene expression changes identified in the current study will substantially accelerate ongoing searches for the genetic and molecular basis of fitness-related morphological, physiological and behavioural differences between marine and freshwater fish.
In addition, the genome-wide set of divergent regions already provides new insights into evolutionary processes shaping adaptive evolution and ecological speciation. Our results suggest that parallel evolution of marine and freshwater sticklebacks occurs by dynamic reassembly of many “islands” of divergence distributed across many chromosomes. Reassembly by linkage is likely strengthened by inversions that distinguish marine and freshwater ecotypes. Differences in both globally-shared and locally-restricted genetic variation actively maintained across a hybrid zone provide a snapshot of the genomic architecture and evolutionary processes contributing to the early stages of reproductive isolation. Finally, our data indicate that repeated evolution of marine-freshwater differences depends on both protein-coding and regulatory changes. Regulatory evolution appears to play a particularly prominent role, as indicated by the increased density of conserved non-coding intergenic sequences found near marine-freshwater divergent loci (Supplementary Fig. 9); the substantial fraction of loci mapping entirely to non-coding regions (Fig. 6a); and the significant enrichment of genes with expression differences near key regions used for parallel evolution (Fig. 6b). Mutations causing structural changes in proteins are the most abundant variants recovered in laboratory E. coli and yeast evolution experiments45,46. They make up 90% of forty published examples of adaptive changes between closely-related taxa4, and 63–77% of the known molecular basis of phenotypic traits in domesticated or wild species5. The larger fraction of regulatory changes implicated during repeated stickleback evolution may reflect our use of whole-genome rather than candidate gene approaches, stronger selection against loss-of-function and pleiotropic protein-coding changes in natural populations than in laboratory or domesticated organisms1–3, or an increasing prevalence of regulatory changes at interspecific compared to intraspecific levels5,47, including emerging species such marine and freshwater sticklebacks.
Although our study has focused on marine-freshwater divergence, freshwater sticklebacks also repeatedly evolve characteristic lake-stream differences; open-water and bottom-dwelling lake ecotypes; gigantism in particular lakes; and substantial changes in seasonality and life history6,7,48–50. Given the considerable fraction of parallel stickleback evolution likely occurring by shared variants (Fig. 4b), sequencing of additional populations should make it possible to identify similarly shared loci contributing to other ecological traits, again using the power of replicated evolution to illuminate both specific and general mechanisms underlying evolutionary change in natural populations.
A reference stickleback genome sequence was assembled from a single female freshwater stickleback (Bear Paw Lake, AK), using 9x coverage of paired-end Sanger-sequenced reads from multiple insert size libraries. Scaffolds were assigned to linkage groups in a genetic cross, and annotation was carried out using the Ensembl evidence-based pipeline. Twenty-one fish from independent populations were chosen for short-read sequencing (48x combined coverage) based on morphometric analysis. Patterns of genetic variation were analysed for divergence between marine and freshwater fish, using both a self-organizing map/Hidden Markov Model and a pairwise distance matrix approach (see Supplemental Information). Paired-end reads from a marine BAC library were placed against the reference freshwater genome sequence to identify possible chromosome rearrangements.. Sequenom iPlex genotyping assays were carried out to verify predicted SNPs and divergent marine-freshwater regions. RNA samples were prepared from tissues of marine and freshwater fish born and raised under identical laboratory conditions. Significant expression differences were detected with Agilent microarrays using eBayes (limma R package). GO category enrichments were analysed using GOstats (BioConductor 2.7). Additional methods and analyses are provided in online Supplementary Information.
Stickleback sequencing at Broad Institute was supported by grants from the National Human Genome Research Institute (NHGRI). R.M.M., J.S., J.G. and D.M.K were supported by NHGRI CEGS Grant P50-HG002568; Y.F.C by a Stanford Affymetrix Bio-X Graduate Fellowship; C.T.M. by the Jane Coffins Childs Fund; and B.R.S.,T.R.H., and A.A.P by graduate fellowships from NSF and NDSEG. D.M.K. is an investigator of the Howard Hughes Medical Institute. K.L-T. is a EURYI award recipient funded by ESF. We thank William Cresko for the BEPA individual used for reference genome sequencing; Michael Bell, Jeffrey McKinnon, Bjarni Jónsson, Seiichi Mori, Catherine Peichel, Dolph Schluter, Martin Kalbe, Tom Reimchen, Dánjal-Petur Højgaard, Michael McLaughlin, Bjørki Geyti, and Benjamin Blackman for valuable discussions and assistance with specimens used in population surveys; and Gill Bejerano for useful discussions and assistance with computational analysis.
ContributionsK.L-T., F.D., and D.M.K. planned and oversaw the project and K.L-T. and D.M.K are co-senior authors. E.M. assembled; J.S., J.G., M.C.D., A.K.K. and R.M.M. anchored; and S.W., E.B. and S.S. annotated the reference genome. C.A. constructed the BEPA BAC library. F.C.J., Y.F.C., and D.M.K. designed the whole genome re-sequencing experiment. F.C.J. and Y.F.C. performed morphometric analyses. M.G.G., M.P. and M.C.Z. analysed pilot data and performed simulations to evaluate sequencing strategies. M.G.G., P.R., E.M., F.C.J., and Y.F.C and R.S. analysed polymorphisms. P.R. and M.G.G. developed and carried out the SOM/HMM analysis. F.C.J. and Y.F.C developed and carried out the CSS and DDC analysis. P.R. analysed gene and non-coding element density, and performed phylogenetic analysis. T.H. analysed GO term enrichments. F.C.J. and Y.F.C. carried out hybrid zone analysis. C.T.M., B.R.S., J.G., J.S., Y.F.C., and F.C.J. analysed chromosome inversions. F.C.J. and D.M.K. performed analysis of coding and regulatory changes. H.Z., A.A.P, and T.H. performed the whole genome expression study. D.M.K., F.C.J, Y.F.C., P.R., F.D. and K.L-T. wrote the paper with input from other authors.
The authors declare no competing financial interests.
UCSC Genome browser tracks showing genome-wide analyses are available at http://sticklebrowse.stanford.edu. Microarray expression data are deposited at http://www.ncbi.nlm.nih.gov/geo (accession GSE34783). BAC-end sequences are deposited at http://www.ncbi.nlm.nih.gov/dbGSS (accession JS583469-JS583576).