Previous studies have shown that repeated armour evolution in sticklebacks occurs through ancient variants at the
EDA locus, which are reused in multiple freshwater populations
11 and subject to strong selection
31. To identify loci where alleles have similarly been used repeatedly during adaptive divergence of marine and freshwater fish, we used two methods to look for regions where sequences of most freshwater fish were similar to each other, but differed from sequences typically found in marine populations. Note that this pattern will not identify adaptive variants that are unique to individual freshwater populations, but instead focuses on variants with striking evidence of biological replication across populations.
First, we developed a self-organizing map-based iterative Hidden Markov Model (SOM/HMM) to identify the twenty most common patterns of genetic relationships (“trees”) among the 21 individuals. Genomic regions were assigned to pattern-types based on likelihood, with boundaries defined using HMM transitions. This method iteratively models recurring phylogenetic patterns on a local genomic basis with increasing resolution ( and
Supplementary Information). Most of the genome was assigned to trees describing geographic relationships between populations (e.g., distinct Pacific vs. Atlantic clades, each containing marine and freshwater fish;
Supplementary Table 7 and
Supplementary Fig. 2 & 3). Two hundred fifteen regions comprising 2,096,101 bp (0.46% of the genome; median size: 4,684 bp) were assigned to one of four trees separating most marine from most freshwater fish (
Supplementary Fig. 3 trees a-d). After filtering, the most prevalent marine-freshwater divergent tree identified 90 genomic regions with a median size of 4,266 bp covering 848,691 bp (0.18% of the genome).
Second, we used a genetic distance-based approach () based on building 21x21 pairwise nucleotide divergence (π) matrices for each of 877,568 overlapping windows across the genome (2,500 bp, step size: 500 bp). Each distance matrix was used to calculate a marine-freshwater cluster separation score (CSS), quantifying the average distance between marine and freshwater clusters after accounting for variance within ecotypes (
Supplementary Information). The score is highly correlated with F
ST distances, but provides increased resolution under high divergence (
Supplementary Fig. 4). After permutation testing, we recovered 174 marine-freshwater divergent regions, covering a total of 1,214,500 bp (0.26% of the genome; median size: 3,000 bp) at a 5% false discovery rate (FDR), and 84 divergent regions covering 479,500 bp (0.1% of the genome; median: 4,000 bp) at 2% FDR. To verify cluster membership in highly divergent genomic regions, we also employed an unguided Bayesian model-based data-driven clustering (DDC; ;
Supplementary Information). For each window of the genome, we estimated the most likely number of distinct clusters of fish (k=0 to 5) and their cluster memberships.
The independent SOM/HMM and CSS approaches both successfully recover the previously described chromosome IV
EDA locus among the top-scoring marine-freshwater regions (). Notably, the cluster membership assigned by DDC successfully recapitulates the breakpoints of the minimal 16 kb shared freshwater
EDA haplotype () previously defined by a multi-year positional cloning study of the major locus controlling armour plate differences in sticklebacks
11. Additional regions were identified on the same chromosome with similar marine-freshwater divergence patterns, including regions surrounding the developmental signaling gene
WNT7B (
Supplementary Fig. 5), and a locus involved in hormone and neurotransmitter binding and metabolism (sulfotransferase 4a1,
SULT4A32). SOM/HMM and CSS defined many other loci that also show globally-shared marine-freshwater divergence, including 242 regions identified by either method (0.5% of the genome), and 147 regions identified by both (0.2% of the genome). The median size of recovered regions (<5 kb) approaches the size of individual genes, and often highlights purely intergenic regions, such as the exclusively non-coding region identified between
BANP and
RAS on chromosome XIX (
Supplementary Fig. 6). The genomic distribution, sizes, and overlaps of recovered regions are described in ,
Supplementary Fig. 7 and
Supplementary Table 8, including a list of specific genes identified in top-scoring regions (
Supplementary Data 1). Using genotyping assays for SNPs in 11 regions recovered by both SOM/HMM and CSS analyses, we found 91% of tested regions show significant enrichment of ecotypic alleles in independent marine and freshwater populations (
Supplementary Information). These results confirm that our experimental design successfully identifies both known and novel loci consistently associated with parallel evolution of distinct marine and freshwater ecotypes.
Compared to the genome overall, the 242 regions implicated in repeated marine-freshwater evolution show higher gene density (
Supplementary Fig. 8,
P<4.5×10
–13) and higher concentration of conserved non-coding sequences in intergenic regions (
Supplementary Fig. 9,
P<1.9×10
–11), likely reflecting a more complex regulatory architecture
33. Gene Ontology analysis shows significant enrichment of genes involved in cellular response to signals, behavioural interaction between organisms, amine and fatty acid metabolism, cell-cell junctions, and WNT signaling (
Supplementary Table 9). Changes in these biological processes, and in the individual genes defined by parallel divergence analysis, likely underlie recurrent differences in morphology, physiology, and behaviour previously described in marine and freshwater sticklebacks
7. For example, the
WNT7B and
WNT11 family members identified by the genomic survey have previously been implicated in a paracrine signaling pathway that controls kidney collecting tubule length and diameter
34. Fish living in freshwater produce copious hypotonic urine compared to marine fish
35, and long-term adaptation to freshwater may select for variants in the same developmental signaling pathways that polarize epithelial cell divisions and regulate kidney tubule formation in other animals.