Although most researchers traditionally focus on the top few axes of variation in a dataset, substantial information about population structure exists in lower-ranked chromosomal level PCs. Adjustment of global ancestry between study subjects may lead to false positives when chromosomal (local) population structure is an important confounding factor [53
]. Using chromosome-based analysis, fine-scale substructure was detectable beyond the broad population level classifications that previously have been explored using genome-wide average estimates in this dataset. The study of population structure in terms of chromosomes has broader practical relevance to researchers who use genetics and genomics approaches in gene mapping because genetic diversity is directly related to recombination rate (meiosis), which differs among chromosomes, and genes are not randomly distributed along chromosomes.
By restricting our analysis to each chromosome independently, instead of using global average estimates, we have reported for the first time that the number of fine-scale subpopulations is chromosome dependent. For example, chromosome 2 has two significant PCs which account for population differentiation, whereas chromosome X has 31. This result suggests that one has to examine a large enough number of PCs in order to find all the significant population differences. Thus, the variation in the number of chromosome-specific significant PCs might indicate the detection of a population structure that could have been missed if the average of all chromosomes was used. Even though chromosome 1 is the largest chromosome, followed by chromosome 2, the number of significant PCs that account for structure is lower in both of these chromosomes than in the rest of the chromosomes, indicating that genome size does not correlate with the biological complexity of organisms [54
]. Interestingly, similar results were reported by Becquet et al
] in their study of chimpanzee population genetics structure. In plants, a recent study showed that the optimal number of subpopulations required to correct population structure is trait dependent [56
]. This study reminded us that the number of subpopulations for one trait may not be optimal for other traits. The current analytical approach using genome-wide average PCs as a covariate will control for confounding due to global ancestry but will not control for confounding due to the local (chromosome-based) ancestry effect. It is increasingly important to recognise intra-chromosomal variation, especially when populations have been recently admixed.
Similar to the results of chromosome-based PCA analysis, DA shows that the classification of populations to their correct geographical regions of origin is chromosome dependent. For example, in our analysis, the number of CHB individuals correctly classified to their geographical regions of origin ranged from 23 (for Chr 6) to 35 (for Chr X), while correctly classified individuals in the JPT population ranged from 25 (for Chr 9) to 36 (for Chr 19). Given the growing interest in tracing ancestral origins or contributions in genetically mixed populations, DA is informative and appealing because misclassified individuals can be identified and grouped into appropriate populations prior to large-scale genotyping.
To avoid single-marker FST
-based inferences for selection, which can be misleading,[57
] we ran an in-depth investigation of the patterns of genetic variation in and around the highly differentiated loci and their effects on the phenotype using network/ontology analyses. We overlaid 126 genes (selected based on FST
> 0.5) onto the Ingenuity Pathways Knowledge Database (http://www.ingenuity.com
). Using this analytical approach, we confirmed the over-representation of genes implicated in hair and skin development (OCA2, HERC2, EDAR
) in two of the top networks (Table ). EDA-A1 and EDA-A2 are two isoforms of ectodysplasin that are encoded by the anhidrotic ectodermal dysplasia (EDA
) gene. Genetic variability in the EDA
ligand has been associated with loss of hair, sweat glands and teeth [58
]. The non-synonymous SNP rs1385699 identified within the EDA2 receptor gene (EDA2R
) is fixed in both Asian populations, where as an R57K substitution in EDA2R
has derived-allele (T) frequencies of 100 per cent. The EDA2R
gene product is involved in the positive regulation of NF-κB transcription factor activity specifically within the hair follicle, TNF receptor activity, embryonic development and apoptosis [60
]. These genes were previously reported as candidates involved in human pigmentation phenotypes and in the development of skin cancer [61
]. The most striking difference provided by our more direct approach was the over-representation of canonical pathways related to androgen and oestrogen metabolism (Supplementary Figure S3 (Figure )) and gene groups implicated in the functional category of inflammation, as well as hair and skin development (Figure S4 (Figure )).
Figure S3 Global canonical pathways of the 126 genes linked to genomic regions of major population differentiation. The significance threshold, shown in yellow, represents a p value of greater than 0.05. The first four sets of functions shown represent a p-value (more ...)
Figure S4 The 16 most significant functional categories from IPA linked to the 126 genes of major population differentiation. The significance threshold, shown in yellow, represents a p value of greater than 0.05. Bars that are above the line indicate significant (more ...)
In critically evaluating our results, it is important to note that our analyses, and hence interpretations, are subject to several limitations. First, an important caveat in the use of population-level genetic databases such as HapMap is the ascertainment criterion that was imposed during the initial selection of polymorphic SNPs to be assayed,[62
] and the subsequent release of the HapMap database primarily focused on SNPs that were common. The fundamental theorem underpinning HapMap is the common disease common variance (CD/CV) hypothesis [63
Secondly, the HapMap study (Phase III) is currently being extended to include additional samples and diverse populations (http://www.hapmap.org
). The number of SNPs genotyped in Phase III is substantially fewer (~1.5 million SNPs) than in the present study, however, thereby providing less density and coverage. Such low coverage may miss important loci in regions of elevated molecular divergence in related populations, such as between CHB and JPT [64
]. When whole-genome sequences (such as (http://www.1000genomes.org
)) become widely available, the ability to use many rare variants to identify short shared genomic segments will perhaps allow routine identification of geographical regional or village-level ancestries, given a suitably large and carefully collected reference sample [65
]. The 1000 Genomes Project, which aims to provide a whole-genome sequence resource for at least 1,200 individuals sampled from multiple population groups globally, will be invaluable for understanding the practical consequences of SNP ascertainment biases.
Thirdly, a SNP with a large difference in allele frequency between populations is a strong candidate to explain large differences in disease prevalence between populations [67
]. This is because disease is tightly linked to survival and reproductive success, and genes responsible for variation in disease should have the most differentiated SNP frequencies between human populations. Indeed, studies have suggested that genes associated with complex diseases such as cardiovascular disease and type 2 diabetes have been targets for positive natural selection [69
]. If disease genes have often been targeted by selection, then identifying loci that have experienced selection may aid in disease-related research [68
]. Further studies are required to determine the extent to which differences in allele frequencies between populations predict disease prevalence differences between populations, however.
The study of population genetic structure between chromosomes is a fundamental issue in population biology because it helps us to obtain a deeper understanding of the ancestral population and associated evolutionary processes. For example, understanding heterogeneity in chromosomal ancestry in an admixed population is important because it can be a confounding factor when variation in admixture levels among individuals across chromosomes causes false-positive associations in genetic association studies. In addition, this analysis can be a source of statistical power for ancestry -- phenotype correlation studies that use observed racial/ethnic differences to find mosaic regions of the genome and map loci influencing complex phenotypes [70
]. The distribution of SNP density along chromosomes will inform us about chromosomal segments that are more susceptible to selective pressures or differential patterns. Understanding how chromosomal variations in ancestry relate to disease risk is a major challenge to the biomedical research community [71
]. Particularly, in the USA, there has been a significant intermixing among racial/ethnic groups, thereby creating a complex pattern of ancestral populations which are a mosaic of multiple continental populations. The development of population structure adjustment based on chromosome will provide higher-resolution genographic maps and offer investigators designing genetic association studies more powerful tools for detecting stratification.
The final question we need to answer is, what causes population differentiation? Humans have wide altitudinal and latitudinal distribution ranges, and hence, different individuals may face very different environmental constraints and selection pressures. Population differentiation could arise as a result of geographical separation and subsequent drift and/or bottlenecks; natural selection (ie the local adaptation process by which organisms become adapted to their environments); differential admixture with other populations; and (possibly) different mutation rates (eg differential exposure to ionising radiation, environmental toxins, etc.). A central theme in evolutionary biology is that natural selection acting on heritable phenotypic variation will result in adaptation and differentiation among local populations inhabiting environments differing in their selective regimes [72
]. Natural selection may confer an adaptive advantage to individuals in a specific environment if an allele provides a competitive advantage. Alleles under selection are likely to occur only in those geographical regions where they confer an advantage. Alleles associated with harmful traits decrease in frequency, while those associated with beneficial traits become more common. Local adaptation acting in concert with other processes (eg recombination) is sufficiently pervasive to confound measurements of population differentiation, making a single such genome-wide measurement somewhat unreliable, especially when applied to any specific chromosome or region.
In summary: population differentiation, at a genetic level, is the result of numerous processes; differentiation is measurable and quantifiable by a variety of approaches; and most of the processes leading to differentiation affect all autosomes equally, except for natural selection, which leads to extreme values that reflect local adaptation due to natural selection. We also note that rather than some 'normal' distribution of FST values, with exceptional values occasionally reflecting natural selection, there is substantial inter-chromosomal variation in the inferred patterns and characteristics of population structure. These inter- and intra-chromosomal variations, either across the genome as a whole or along single chromosomes, may directly affect population divergence. This study underlines the potential of chromosome-based analysis of genome-wide data to quantify substructure in populations that might otherwise appear relatively homogeneous. Before embarking on a large-scale genomic study, proper control of chromosome-wise stratification/confounding, predicting population memberships is crucial.