The SNP databases are important resources for performing genetic linkage, association, and admixture studies. Both academic and commercial groups are developing large numbers of genome-wide SNP datasets. These databases now contain over 12.6 million SNPs. However, only a small fraction of these SNPs are well characterized and validated [
21]. Users of these data sets have several common questions regarding the existing databases, including the following: What is the frequency spectrum of the SNPs in these databases? What is the distribution picture of these SNPs across different ethnic and geographic populations? What fraction of the total number of SNPs is already captured by these databases?
We mined and compared the HapMap SNP database against Affymetrix 500 K and the gene centric Illumina 100 K SNP chips. This comparison suggests that a relatively large fraction (> 80%) of SNPs in these databases do not meet the cutoff for acceptable markers as AIMs [
10], which means that they are either of very low frequency or not ancestry informative between the 2 ancestral populations. As a result, we developed and preset the AIM panels for each database individually. Our analyses showed that the SNP databases in their current status might have some limitation for studies of complex disorders, especially in different ethnic groups, as a result of incomplete or uneven representation of SNPs along the genome [
23]. As indicated above, the different databases have different sets of SNPs. Because the SNP allele frequencies were determined by different genotyping labs that used different sample sizes and genotyping methods (see Methods), it would be difficult to perform several tests to assess data quality and identify sources of experimental variation. In critically evaluating our results, it is important to note that our analyses, and hence interpretations, are subject to several limitations. First, many of our analyses relied on data derived from available databases with contents that are, and will continue to be for some time, in a state of change. Moreover, the allele frequencies across the platforms were based on different sets of DNA samples. Therefore, our results represent a snapshot based on currently available data, and ultimately, when the human genome annotation becomes more stable, it will be important to verify these results. Second, the SNP allele frequencies were determined by using relatively small sample sizes (see Methods), and stochastic variation could affect the robustness of our conclusions.
Several studies discussed the similarities between human populations in terms of genetic constituents, and hence a large sample size may enable the detection of small differences in rare outcomes. Although we observed a strong correlation in allele frequencies between SNPs from different platforms (data not shown), confirming these allele frequency estimates in a larger sample size will be important. The analytical caveats associated with each database, such as how surrogates are Yorubans or CEU to each ancestral population and how much of the data (for example, in HapMap) is transferable to the diverse populations in Africa where there is extreme adaptive variation along the various countries is also debatable.
Most studies consider Europe as a relatively homogeneous population. Consequently, it has been argued that European population stratification does not represent a substantial source of bias in epidemiologic studies [
36]. However, recent autosomal SNP studies have highlighted significant patterns of structure within Europe along a north-south axis [
37] and also the presence of several significant axes of stratification within Europe, most prominently in a northern-southeastern trend, but also along an east-west axis. The study emphasized the importance of considering population stratification in studies using European and European-American individuals, and the need to develop EuroAIMs (European ancestry informative markers) for ancestry estimation and correction [
38]. Moreover, the fundamental theorem underpinning HapMap is the common disease common variance (CD/CV) hypothesis [
39]. How much information we can capture from rare variants is not clear [
40].