The amount of LD and the proportion of SNPs with r
2 > 0.8 showed similar patterns: heterogeneity among continental regions and higher homogeneity among populations within each geographical region, with the exception of the Americas. Although correlated, these two measures of overall LD capture different aspects: mean r
2 offers a broad picture of LD, while the proportion of r
2 values > 0.8 focuses on the higher end of the LD spectrum, where information redundancy between SNPs is higher to the point that it is the usual threshold where tagSNPs are designed. We have confirmed the previously observed trend of an LD decline from Sub-Saharan Africa (with the lowest levels of LD) to successively increasing amounts of LD in Middle East-North Africa, Central South Asia, Europe, East Asia, Oceania and America (with the highest amount of LD) [
10,
12]. Previous observations were based either on a few genes and a similar geographical range of samples [
10,
12] or, on the contrary, on a higher number of markers limited to a small number of populations, such as the three HapMap or Perlegen populations [
15,
28]. A basic description of LD decline with distance has been published for a subset of the HGDP-CEPH panel for > 500 000 SNPs [
25], outlining the general trends that we have analyzed in detail. We have explored the LD patterns in 39 worldwide populations by means of the r
2 measure of LD between 21 685 SNPs pairs covering 211 autosomal gene regions. Moreover, we have specifically focused on those SNP pairs with high LD (so that each one can be used to tag the other) in a relevant subset of genes, as most of them may be implicated in common diseases. Note that our results may not be applied to the whole genome, but they are highly relevant to candidate-gene association studies. In this context, we have extended the previous observation that more tagSNPs are needed in the Yoruba than in Europeans or Asians [
29] to a wider range of African populations that show similarly low levels of LD. On average, 3.14% of the SNP pairs in our study showed r
2 > 0.8 in Sub-Saharan Africans vs. 7.52% in Europeans; that is, 2.4 times as many SNP pairs showed high levels of information redundancy in Europeans than in Africans. We found a general pattern of greater LD differences between continents than within them, which would imply that genetic association studies should be more easily replicated within than between continents, as previously indicated in a set of dopamine and serotonin pathway genes [
30]. Similarly, our results point to a high transferability of tagSNPs within continents [
31], with the exception of America. This pattern reflects the extremely heterogeneous nature of the American populations as reflected, for instance, in their STR allele frequencies [
32]. Apparently, after the bottleneck associated with the first colonization of the Americas, which increased LD, genetic drift has acted extensively to differentiate American populations in their allele and haplotype frequencies as well as in their levels of LD.
A role has been suggested for genetically isolated populations in genetic epidemiology because of their predicted high levels of LD, which would facilitate the detection of genes involved in complex diseases by indirect association [
19]. In the HGDP-CEPH panel, several populations can be considered as cultural and genetic isolates (Additional file
2). Such populations showed moderate increases in the proportion of SNP pairs with r
2 > 0.8 per gene region when compared with the non-isolates in their respective continents. Conversely, if we take the proportion of SNP pairs with r
2 < 0.8 as a rough indication of the minimum proportion of SNPs that are needed to capture the haplotype variation in a gene region, then the difference in the number of the SNPs that need to be typed in isolated populations compared to their non-isolate continental counterparts would be of 0.4% in the European isolates, 1.8% in the Yakut, 2.7% in the Kalash, and 11.3% in the Surui. Thus, genotyping costs may be slightly more economical in isolated than in outbred populations. However, association studies designed in the latter may have two practical advantatges: i) possibly larger sample sizes can be obtained in general populations, and ii) allele frequencies may be closer to those in reference HapMap populations, which allows more precise a priori statistical power calculations and prevents genotyping SNPs that can result monomorphic.
It follows, then, that being labelled a population isolate by genetic, linguistic or cultural evidence is not sufficient to harbor increased LD to a point that would justify a significant reprieve in the genotyping burden for genetic association studies. This result agrees with a separate analysis [
33] in the CEPH-HGDP panel, in which the microsatellite-based estimates of the
θ = 4N
eμ paramater were not significantly lower in isolated than in mainstream populations within each continent. Considering mutation rates (
μ) as equal across populations, it follows that effective population sizes are not detectably lower in population isolates. A presumably reduced effective population size is indeed the condition that would increase LD in isolated populations. The levels of isolation required to decrease Ne significantly and subsequently increase LD appear to have been rare in the human demographic history, at least in the populations sampled for the CEPH-HGDP panel. Examples of isolated populations with significantly increased LD are the Kuusamo Finns [
21] and the Micronesian Kosrae [
22]; in the CEPH-HGDP panel, the only isolated populations with consistently increased LD in all distance classes with respect to their continent are the Kalash (Central South Asia) and the Surui (America). The Kalash were noticed as an outlier for their allele frequencies in 377 STRs [
34], although a more recent survey of 642 690 SNPs failed to replicate this finding [
24]; the Surui, even though all presumed related individuals have been dropped from the analysis, may share many recent common ancestors [
35].
The present study was designed to mimic the conditions under which most genomewide association studies are performed, namely: i) focus on gene regions; ii) common SNPs, usually defined through a MAF threshold in a reference population, and iii) use of tagSNPs, often defined with a r2 > 0.8 threshold. We have shown that, under these conditions, the SNPs that would be needed to be typed are just slightly less in isolates. It is increasingly being recognized, and we provide empirical results to that effect, that the value of isolated populations in genetic epidemiology lies not in their higher LD brought about by a presumably reduced Ne, but due to other characteristics such as large and accessible families, deep genealogical records or a low environmental variance.