Our study evaluated the allele frequency and genotype prevalence of polymorphisms that have known or proposed associations with common diseases in a large, minority-enriched, and nationally representative sample of the US population. This is the first relatively large-scale, population-based effort in the United States to obtain such data by race/ethnic group. These data and future planned analyses will serve as an important reference for investigations into US population structure, for examinations of gene–disease associations in other investigations of the NHANES data set, for calculation of attributable risk, and for use as a reference by researchers in the design of further studies to discover associations of alleles and genotypes with common diseases.
Estimates of allele frequency and genotype prevalence are available from a number of existing gene variant databases, including the International HapMap Project (21
) and the SNP500Cancer Database (45
). However, comparisons between NHANES III and such databases are limited because of significant differences in inclusion criteria, study populations, and classification of racial and ethnic groups between NHANES III and the other studies. Especially important is that these public databases function as genomic discovery tools. Consequently, their study populations were drawn largely from a small number of non-population-based samples. These small numbers of participants preclude accurate estimation of allele frequency and genotype prevalence, especially for rare variants or those that vary significantly by race and ethnicity. We compared two variants in MTHFR
with other data resources and found substantial differences in allele frequencies (). SNP500Cancer reports the C allele frequency of rs731236 (VDR
) as 48.3% (95% confidence interval (CI): 35.3, 61.3) in non-Hispanic whites and as 35.4% (95% CI: 21.4, 49.4) in the African-American population. However, NHANES III estimates are 38.1% (95% CI: 36.0, 40.3) and 28.2% (95% CI: 26.8, 29.6), respectively. In conclusion, the NHANES III estimates of allele frequency and genotype prevalence in the US population are more representative and stable than are those calculated from previously available data.
In this study, allele frequency (in 88 of 90 genetic variants) and genotype prevalence (in 87 of 90 variants) differed significantly by race/ethnic group. Non-Hispanic blacks had considerable differences in minor allele frequency compared with non-Hispanic whites, with almost one-quarter of variants differing by at least 20% (absolute difference). In contrast, less than 10% of variants differed by at least 20% in allele frequency between Mexican Americans and non-Hispanic whites. Differences in allele and genotype frequency could partially contribute to differences in disease occurrence between population subgroups. As an example, the Pro12Ala variant of PPARG
(rs1801282) has been studied extensively in relation to type 2 diabetes, with the Pro allele (C) being associated with increased disease prevalence (50
). This finding has been duplicated in some genome-wide association studies (13
), although not in all populations (51
). The higher CC genotype prevalence in non-Hispanic blacks (95.0%) compared with non-Hispanic whites (75.8%) may be a strong contributing factor to the increased risk of type 2 diabetes among non-Hispanic blacks, as this PPARG
variant has been estimated to have a large population attributable risk of ~25% (50
). Because differences in the occurrence of common human diseases among populations reflect variation in genetic factors, environmental factors, and their interaction, population-based genotype data, when coupled with other disease risk factors, will give us better insight into the causes of population differences in the occurrence of various diseases.
On the other hand, allele frequency and genotype prevalence did not differ significantly between men and women for most of the genetic variants studied (≥97.8%). Similar findings on allele frequency or genotype prevalence by sex have also been reported in some large studies (32
). Although we report statistically significant differences by age for approximately one-fifth of the genetic variants studied, most of these differences were no longer present after adjustment for race/ethnicity. This finding is likely attributable to the differences in age distribution between the race/ethnic groups (). Some of the significant differences in allele frequencies by age may indicate survival advantage, and other studies have found variants in or near MTHFR
), and TNF
) associated with aging or longevity. However, few genes have been reproducibly shown to do so (59
), and our results could be due to insufficient sample sizes or due to statistical chance in analyses.
There has been a concern that multiple individuals from a household were included (average household, 1.59 individuals; range, 1–11) in NHANES III for the estimation of allele and genotype frequencies. However, the estimates were calculated by using methods specifically designed to analyze data from surveys with complex designs. These methods adopt NHANES III sample weights and adjust the variance of the estimate among the correlated observations. NHANES III is a population-based survey that reflects the actual and overall genetic structure of the general US population, which contains many related individuals within or between subpopulations. Thus, inclusion of related individuals in the NHANES III survey should enhance the generalizability of estimates derived from these data.
Some potential limitations of this study are notable. First, NHANES III categorizes race and ethnicity according to self-reported affiliation, as do most epidemiologic studies. There is considerable literature on the accuracy of this social measure as a proxy for genetic ancestry (61
). Despite the possible misclassification or oversimplification of genetic ancestry, these data may help to elucidate the uncertain contribution of genetic variation (65
) to the complex interactions among social, environmental, and behavioral influences in a diverse population that contribute to racial and ethnic health disparities. Another concern is that homozygotes were not detected for some rare polymorphisms in this study, and thus the statistical tests for these genetic variants may not be reliable. In addition, future studies of gene–disease associations and gene–environment interactions with rare variants may be limited by insufficient sample sizes when analyses are performed separately for each race/ethnic group and control for large numbers of variables.
In the near future, we plan to use race and ethnicity, as well as geographic information, to conduct a focused examination of the genetic substructure of the US population and subpopulations. This issue is generating increased interest, because latent population substructure has been discovered in populations previously thought to be relatively homogeneous (67
). Such analyses are, therefore, especially important for the heterogeneous US population and considering the high levels of admixture within African-American and Mexican-American populations (69
). Multiple studies demonstrate that population substructure must be taken into account in the design and interpretation of genetic association studies (67
). Further research on population characteristics and genetic diversity will be invaluable in conducting genetic epidemiologic studies in the United States.
Determination of the prevalence of genetic polymorphisms associated with common diseases of public health importance in the US population and in subgroups of the population is a critical first step in evaluating the genetic epidemiology of complex diseases. These prevalence estimates can be used in predicting sample size requirements for future epidemiologic studies to evaluate genetic determinants of susceptibility to chronic and infectious diseases, the severity of disease, and interactions with other risk factors. Because data on genotype frequency are particularly sparse for non-Hispanic blacks and Mexican Americans, our estimates are useful in sample size calculations for studying the genomic contribution to the health of these populations. Investigations currently underway examine the associations of the reported genetic variants with select nutritional, biochemical, and clinical characteristics in the NHANES III data set that serve as markers or risk factors for numerous health outcomes. These outcomes include asthma and chronic obstructive pulmonary disease, diabetes, cardiovascular disease, viral infections, and osteoporosis.
With the recent successes of genome-wide association studies, the resource of the NHANES III DNA bank offers significant opportunities to move beyond investigations of candidate genes, as was done here. Many recent genome-wide association studies have uncovered replicable genetic associations with diseases such as breast (4
), prostate (7
), and colorectal (8
) cancers; heart disease (11
); diabetes (13
); and obesity (18
). However, many of these large-scale, case-control studies did not use representative samples of the underlying populations from which the cases were derived. NHANES is the only nationally representative, population-based sample survey that systematically collects physical, physiologic, imaging, laboratory, and interview data on a large number of individuals in the United States. Use of a whole-genome approach to assess the prevalence of genetic polymorphisms, including copy number variants, in the NHANES III DNA bank will be an important next step toward identifying genetic variants that can help to predict disease susceptibility and progression. This approach will also provide the basis for estimating the numbers of people in the United States who may benefit from genome-based tools, such as risk factor reduction; disease screening efforts; or diagnostic tests, drugs, or other preventive or therapeutic interventions. Current and future NHANES III prevalence estimates will be deposited into a publicly accessible database for research.
Thus, this first effort in NHANES begins to lay a strong scientific foundation for studying the impact of genetic variation on common diseases in the United States and in the future evaluation of biomarkers and diagnostic tests. Information derived from NHANES will provide an important reference and will enhance the translation of genomic information into clinical and public health practice.