The sample of 96 individuals included pairs of individuals from 25 local “populations” (typically sampled within a few hundred meters of each other, often much closer) as well as a worldwide survey of commonly used stock center accessions (Tables S1
). Where possible, four populations were sampled from each of several regions.
The sample was generated by screening a larger set of accessions with a small number of markers to avoid inbred siblings or extensively heterozygous individuals (E. B., E. Stahl, C. T., M. N., M. K., and J. B., unpublished data). Accessions were genotyped using 11 unlinked markers (five microsatellites, two indel R-genes, and four housekeeping genes with previously identified polymorphisms). To ensure that individuals sampled from local populations were not part of inbred sibships, four (three in one case) individuals from each of 37 populations were tested. Polymorphism was found in 25 of these populations, and a pair of nonidentical individuals was selected at random from each (Table S1
). Some accessions not from the same population were also found to be identical with respect to these markers (Col-0 and Lp2-2; Ts-1 and Shahdara), but these were included nonetheless. Five accessions were found to be heterozygous and were eliminated. Four of these were from the population samples, and one, Ms-0, was from the stock center. Further testing of two additional Ms-0 lines revealed one more heterozygote and one homozygote, which was included. In spite of these precautions, one sequenced stock center accession, Van-0, turned out to be extensively heterozygous and was eliminated from the analyses in this paper (bringing the sample size to 95).
We used direct, PCR-based sequencing of genomic DNA, with primers designed from the A. thaliana
reference sequence to cover the genome relatively uniformly. To achieve uniform density of our fragments, the reference genome (releases January 7, 2002, and April 17, 2003) was first divided into equally spaced regions. The last 10 kb of each region then served as an input record to Primer3 (v. 0.6). The designed primer pairs returned from Primer3 for each region were then screened for uniqueness and quality. To screen for uniqueness, all primer pairs were BLASTed (BLAST v. 2.2.3) against both the reference genome as well as BAC datasets downloaded from the Arabidopsis Information Resource (http://www.arabidopsis.org/
). Any primer pair that produced a hit in the same region (≤2,300 bps) was removed. Self-amplifying primers were also removed based on this same criterion. Additionally, primers with more than five BLAST hits against the reference were also discarded. To improve the quality of each fragment, any primer pair that amplified a target sequence that contained a homonucleotide run of nine bases or more was removed.
All sequencing was done using ABI 3700 automated sequencers (Applied Biosystems, Foster City, California, United States). All fragments were sequenced in both directions.
Chromatograms were initially base-called with Phred (v. 0.020425.c) and trimmed based on quality value. The start and end of each read was trimmed until the average quality value was 25 in a window of ten bases, and internal bases were converted to missing data when their quality value was below ten. Accessions missing one read of data were trimmed more severely (different setting were used). A combination of Phrap (v. 0.020425.c) and ClustalW (v. 1.82) was used for producing alignments using a modified weight matrix that allowed us to incorporate quality values into the ClustalW algorithm. Alignments were then visually inspected and adjusted as necessary using Consed (v. 13.0). Polyphred (v. 4.20) was used to flag potential heterozygotes, which were confirmed by visual inspection of chromatograms.
Additional trimming was performed as necessary for accessions with multiple false polymorphisms and low-quality sequence after a visual inspection of chromatograms. Whenever two reads from the same accession disagreed, the final call was made by visually inspecting chromatograms unless the difference in quality value made the final call obvious.
Potential polymorphisms in each alignment were then verified by a second person. All alleles found only once or twice in the sample were verified by visually inspecting the chromatograms. When this inspection did not reveal a chromatogram peak clearly different from the other accessions, the base was changed to missing data. This would, if anything, produce a slight underestimate in alleles of frequency one and two. Higher-frequency polymorphisms with generally low-quality values (20 or lower) were also verified by checking the chromatograms.
A total of 876 high-quality fragment alignments were obtained from 979 PCR primers and used for the analyses in this paper. Of the remaining PCR primers, some failed at the stage of PCR amplification and sequencing, while some produced sequencing output that could not be base-called with certainty when the sequence quality was particularly low or when there was evidence that the primer pairs amplified two or more different products in some of the accessions.
To calculate genetic distances, we used a set of markers that have been genetically mapped to the Lister and Dean recombinant inbred lines and that also can be mapped to the AGI reference genome. Some markers were removed so that both physical position and genetic position were monotonically increasing functions.
To infer population structure and assign accessions to populations, we used a model-based clustering algorithm implemented in Structure v. 2.0 [12
]. Since A. thaliana
is largely homozygous, we used a haploid setting. We used the “linkage model” with “correlated allele frequencies” in Structure, where genetic distances (calculated by fitting a third-order polynomial to the Lister and Dean recombinant inbred mapping data) were used to indicate locus proximity. The algorithm was run with a burn-in length of 50,000 MCMC iterations and then 20,000 iterations for estimating the parameters. This was repeated ten times for each K
(ranging from one to 17). In these analyses, each fragment-haplotype was treated as a marker at a multiallelic locus, so that two accessions had a different type if they differed at any site in the fragment.
The likelihood of the data increases with K from K = 1 until K = 7 (using the Wilcoxon two-sample test to compare the ten runs for each K; two-sided p = 0.001 for K = 7 versus K = 6). The likelihoods of K = 7 and K = 8 were similar (two-sided p = 0.97). For K > 7, the likelihoods of different runs were more variable than for K ≤ 7, with the added variability caused only by runs with lower likelihoods. Moreover, the additional clusters for K > 8 do not have a majority of the genome for any of the accessions. These observations taken together indicate that it is less meaningful to choose K > 8.
In displaying the output from Structure, we computed an average of the ten runs for each K. Because there are K! distinct permutations of the clusters that all correspond to equivalent assignments of membership coefficients to accessions, and because independent runs may produce different permutations, to compute an average we first permuted the clusters to align the solutions. For R runs, there are (K!)R− 1 ways of aligning clusters across runs. To determine which of the clusters of each of the other runs corresponds to a specific cluster in a given run, the symmetric similarity coefficient (SSC) was used with the matrices of membership coefficients (based on the genome-wide average). For a given K, the SSC was calculated for all combinations of pairs of runs:
are the membership matrices of runs i
is a permutation; the minimum is taken over all permutations, S is a probability matrix of K
columns where all elements equal 1/K,
and A F
is the Frobenius matrix norm [52
]. This is a slight adaptation of the asymmetric similarity coefficient used in previous work [17
= 2, the runs were permuted to the arrangement that maximizes the sum of SSC across pairs of runs, and an average of the membership matrices across runs was then taken. For K
> 2, it was not feasible to test all possible arrangements; therefore, the following greedy algorithm was used. (1) Fix a permutation, P1
, of one (randomly chosen) run,
. (2) Randomly choose a second run, Q2
, and fix the permutation, P2
, that maximizes
. (3) Continue sequentially with each remaining run, Qx
, where x
= 3,…, R
, and fix the permutation, Px
, that maximizes
for the current run, Qx
. Because the choice of starting run can affect the result, we tested all ten possibilities for the starting run. For K
= 2 to K =
8, there were thus 70 possible ways of starting the algorithm, and in only two of 70 possible cases was a different result obtained. These two solutions differed from the common solution by switching one pair of clusters in one run (2.5% of the clusters differed from the common solution), and switching one pair of clusters in two different runs (5%).
We tested for monophyly as follows. For every variable site in a fragment, each cluster was checked for the presence of both alleles as well as for the presence of both alleles outside the cluster. If a variable site in a fragment had both alleles within the cluster as well as outside the cluster, then the whole fragment was deemed nonmonophyletic for that specific cluster. Clusters that failed to show nonmonophyly for a fragment were considered monophyletic for that fragment. Fragments with less than five variable sites and clusters with less than five accessions were always considered to be nonmonophyletic.
FST for the inferred clusters was computed as:
where P, total
is the average number of pairwise differences per site for all pairs of accessions, and P,within
is the average number of pairwise differences per site for all pairs within cluster i
Of the total of 95 accessions, 40 were hierarchically sampled in pairs from four populations in each of five regions (Table S1
). The total amount of variation among these 40 accessions, P,among40
, was computed by taking the total average pairwise difference for all pairs of the 40 accessions, whereas the amount of variation within populations, P,withinpop
, was calculated by taking the mean of the total average pairwise difference for the pairs of accessions in the 20 populations. The level of variation among geographical regions, P,amongreg
, was computed as the difference between P,among40
and the mean of the total average pairwise differences for all pairs of accessions within regions. The level of variation among populations, P,amongpop
, was calculated from the following expression:
Genomic patterns of polymorphism
Correlations were identified between levels of polymorphism and local gene density or degree of duplication (). The local gene density was measured as open reading frames per centimorgan in windows of size greater than or equal to 1 Mb (using genetically mapped markers from the Lister and Dean recombinant inbred data as endpoints). The number of open reading frames (excluding pseudogenes and RNA genes) from the annotated reference sequence that fell between these window endpoints was counted, and length in centimorgans of each window was estimated from the genetic distance of the markers used as window endpoints.
Correlations were quantified using Spearman's rank correlation, and the significance of the observed values was evaluated using 50,000 permutations that maintained the chromosomal order of all observations but that shuffled the relative positions of the two variables. (For each variable, the lists representing the consecutive values within each chromosome were concatenated in random order and direction to form a circle. The two circles were then randomly aligned with each other.) This is necessary to avoid inflated significance values due to autocorrelations along the chromosomes (of both variables). Using this procedure, the rank correlation between S
in nonexon sequences and gene density is −0.27 (p
= 0.0014), and the rank correlation between S
in nonexon sequences and the negative log of the second-best BLAST e-value is 0.13 (p
To investigate the effect of population structure, all analyses (except those of population structure) were repeated with the outliers in removed (Cvi-0, Mr-0, and all but one randomly chosen member of each closely related group). All conclusions remain qualitatively the same.