With the advent of high-throughput chip genotyping technology, population structure can be detected with very high resolution (1
). Results from the Human Genome Diversity Panel (HGDP) suggest that one dimension of genetic structure in human population falls along geographic/continental lines. Similar evidence was derived from a study of 1056 individuals from 52 populations using 377 microsatellite markers (2
) as well as a follow-up examination of the same individuals using 650 000 common single nucleotide polymorphisms (SNPs) (1
). Although HGDP is not a random sample of the world’s populations, the results are consistent with the hypothesis of a serial founder effect with a single origin in sub-Saharan Africa. The HGDP tended to sample from smaller, more isolated groups and it is still too early to know how well the description arising from the HGDP data set correlates with what would be found in larger out-bred population groups. Nonetheless, it is clear that geographic differentiation will have a significant impact on disease association studies. Self-identified ancestry may therefore be an effective proxy for population structure and therefore useful in association studies (3
) for populations which have been relatively isolated. The meaning of this potential heterogeneity could likewise be of interest. Neutral variants will diverge due to random genetic drift, whereas functional variants may be under selective pressure due to adaptation to the new environments encountered during migration. Distinguishing these two sources of variation may be helpful in identifying true disease variants in association studies.
In addition to its effect on the frequency of true causal mutations, geographic differentiation can create analytic challenges when data from two or more populations are combined. The potential for confounding that can arise has been recognized for many years, and a variety of statistical methods have been developed to control for false associations in this setting (4
). The extension of GWAs into more populations will likely mean that the significance of structure will now have to be investigated at a much higher level of resolution. The issue of population structure will become even more complex as the sample size increases (9
). While it has generally been assumed that the HapMap provides a reasonable guide to coverage with tagging SNPs in the three regional populations studied in detail to date, a recent analysis based on extensive re-sequencing suggests this assumption may not be correct (10
). Using near-complete data from 76 genes as a reference standard, it was found that even with the dense coverage provided by a gene chip, such as the Affymetrix 6.0, only about 45% of SNPs were tagged (r2
>0.8) in the YRI sample (10). These findings demonstrate that the current commercial chips may miss important variants in association studies in African-origin populations. In fact, very few chronic disease studies have been carried out among African populations, and Yorubans have served as the prototype in most intensive surveys of African genomes (11
). An on-going study sponsored by the NHGRI will add three additional populations from East Africa and offer important new insights into what must be a complex pattern of population structure within Africa. In fact a regional survey of Latin American populations has demonstrated an almost bewildering array of possible ancestral groupings where there is thought to be less geographic differentiation than in Africa as a result of a shorter history of human occupation (14
Quantitative estimates of variation in haplotype frequency and structure have recently been obtained with the data in the HGDP (15
) (Fig. ). With similar reference data, it should be possible to test the consistency of the haplotype structure at the location of tagging SNPs that have been shown to be associated with disease risk. These data might provide insight into the generalizability of the associations that are based on proxy markers, rather than causal mutations. Given the sampling frame for this study, however, as noted, it is not clear how well these estimates apply to large national populations; for example, many of the samples from China were from minority groups (15
). Analyses which attempt to resolve issues related to finer level structure will also have to devise new sets of ancestry informative markers (AIMs). On the other hand, given current technology, it may actually be more reasonable to use all the marker data available in GWA studies to infer ancestry information rather than AIMs, since the selection of AIMs is often biased to the known population structure.
Figure 1. Haplotype cluster frequencies for 156 consecutive SNPs on chromosome 2 in the region surrounding the LCT gene (136.373–136.478 Mb). At each SNP, relative frequencies of haplotype clusters are displayed on a thin vertical line. Each color depicts (more ...)
The development of methods based on principal components analysis (PCA) will be a substantial help in confronting the problem of more subtle structure. For example, even within the same ethnic population, such as European Americans, detectable population stratification still exists (17
). Principal components analysis methods may also make it possible to pool participants from different geographic/ancestral origins into a single analytic sample. When dense markers are available, such as those from GWA studies, we have recently demonstrated that a PCA of a marker matrix can eliminate spurious association due to stratification and allow pooling of individual level data from heterogeneous populations (4
). Principal components analysis can use either AIMs or random markers without consideration of LD among the markers. Further, the PCA approach is more powerful than both the genomic-control method (7
) and meta-analysis that combines P
-values from individual studies and is less computationally intensive than the Structure
association method (6
). Combining data from geographically distant populations can definitely enhance fine mapping analyses. Experience with common variants in the structural gene associated with circulating levels of angiotensin-converting enzyme (ACE) demonstrated how pooling can increase power dramatically (4
). Populations with a simpler haplotype structure, and thereby greater coverage with genome-wide tagging SNPs, can be useful to identify broad regions of interest. Contrariwise, populations with shorter and more numerous haplotypes can help localize influential variants. It must also be recognized, however, that pooling ethnic different samples can, under some circumstances, introduce additional noise since multiple population-specific variants (some of which may be rare) may affect the trait variation.