Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Cancer Epidemiol Biomarkers Prev. Author manuscript; available in PMC 2009 June 1.
Published in final edited form as:
PMCID: PMC2507870

Comparing Genetic Ancestry and Self-Described Race in African Americans Born in the United States and in Africa


Genetic association studies can be used to identify factors that may contribute to disparities in disease evident across different racial and ethnic populations. However, such studies may not account for potential confounding if study populations are genetically heterogeneous. Racial and ethnic classifications have been used as proxies for genetic relatedness. We investigated genetic admixture and developed a questionnaire to explore variables used in constructing racial identity in two cohorts – 50 African Americans (AAs) and 40 Nigerians. Genetic ancestry was determined by genotyping 107 ancestry informative markers. Ancestry estimates calculated with maximum likelihood estimation were compared with population stratification detected with principal component analysis. Ancestry was approximately 95% west African, 4% European, and 1% Native American in the Nigerian cohort and 83% west African, 15% European, and 2% Native American in the AA cohort. Therefore, self-identification as AA agreed well with inferred west African ancestry. However, the cohorts differed significantly in mean percentage west African and European ancestries (P < 0.0001) and in the variance for individual ancestry (P ≤ 0.01). Among AAs, no set of questionnaire items effectively estimated degree of west African ancestry, and self-report of a high degree of African ancestry in a three-generation family tree did not accurately predict degree of African ancestry. Our findings suggest that self-reported race and ancestry can predict ancestral clusters, but do not reveal the extent of admixture. Genetic classifications of ancestry may provide a more objective and accurate method of defining homogenous populations for the investigation of specific population-disease associations.

Keywords: African American, race, ancestry, genetic admixture


Genome-wide case-control association studies provide a powerful tool for investigating possible genetic factors that may contribute to the health disparities observed among different racial and ethnic populations. Populations with different ancestral backgrounds may carry different genetic variants, and these may contribute to the variations in disease incidence and outcomes seen in specific racial and ethnic groups(1). Association studies can most easily identify disease-associated alleles when study groups are genetically similar, sharing a similar ancestral background(2). However, individual ancestry is not an easily assayed, simple category, and consequently race continues to be used as a proxy for genetic relatedness in clinical and other biological studies(3-6). There is currently no consensus on how best to examine or characterize different racial or ethnic groups when designing and conducting such studies. Two main approaches have been used to approximate individual ancestry in biologic studies: (1) using self-identified race and ethnicity (SIRE) which may capture common environmental influences as well as ancestral background and (2) genotyping a panel of markers that show large frequency differentials between major geographic ancestral groupings(7, 8). Both approaches have limitations. Self-identified racial categories may not always consistently predict ancestral population clusters, and evidence suggests it may take large sample sizes and numerous markers to describe genetic clusters that correspond to SIRE groupings(9-11). Racial categories are also imprecise and inconsistent, since they may potentially vary within the same individual over time(12, 13). Furthermore, their use risks reinforcing racial divisions in society. On the other hand, more objective analyses that genotype markers which are highly informative for ancestry may not be economically practical and are limited by the requirement of serum or fresh tissue for DNA extraction. Genetically determined ancestry may not capture unmeasured social factors that may impact on differences in health outcomes. There are also unique ethical challenges when linking biological phenotypes with genetic markers for specific racial groups, and caution must always be used when attributing biological differences (e.g., disease risk, treatment response) to different populations.

Understanding the ancestral background of study subjects is most important in genetic studies of admixed populations, such as African Americans (AAs), who represent an admixture of Africans, Europeans, and Native Americans(14). Genetic studies have demonstrated that AAs form a diverse group with percent European admixture estimated to range between 7 and 23%(14-16). Genotyping of self-identified AAs participating in the Cardiovascular Health Study (CHS) revealed that among self-reported Africans there are differences in genetic ancestry that are correlated with some clinically important endpoints(15).

In this study, we compared the degree of genetic admixture in two cohorts – AAs and Nigerians – by genotyping a panel of 107 single nucleotide polymorphisms (SNPs) that are highly informative for ancestry (ancestry informative markers, AIMs). We developed a 26-item questionnaire to explore the variables used in constructing racial identity. We assessed how well self-reported race and ancestry matched genetic ancestry, as determined using our panel of AIMs. We also tested the association between questionnaire responses with degree of west African ancestry to identify questions and combinations of questions that may serve as proxies for estimating the proportion of west African ancestry. Specifically, we assessed whether selfreport of grandparents' ancestry among African Americans could be used to predict genetic ancestry.

Materials And Methods

Study Subjects

Study subjects were self-identified United States (US)-born AAs or Nigerian immigrants from either the Yoruba or Ibo cultures. The west African ancestral population that contributed to our panel of AIMs consisted of 37 people from West Africa, the majority of who were from Nigeria (please refer to Ancestral Populations section and unpublished correspondence). Furthermore, a significant number of Nigerians from the Yoruba and Ibo cultures were known to either work or reside in the communities where recruitment was planned (unpublished correspondence). Thus, we chose Nigerian immigrants as the comparison group. Subjects were recruited from the Washington Heights and Brooklyn communities of New York through postings, newspaper advertisements, and word of mouth, and through discussion with investigators at the Brooklyn campus of Long Island University (LIU). Study recruitment was conducted in collaboration with the Herbert Irving Comprehensive Cancer Center (HICCC) Research Recruitment and Minority Outreach Core, a shared facility for recruitment and retention of human subjects in clinical research which maintains a strong commitment to recruiting minority subjects to clinical trials. This study was approved by the Columbia University Medical Center (CUMC) Institutional Review Board (IRB; Protocol AAAA1500) and Long Island University IRB. Subjects were screened for participation using the following criteria: AAs – subjects identified themselves as AA, were born in the US, and identified both parents as AAs who were born in the US; Nigerians – subjects were from the Yoruba or Ibo cultures, and subjects were either born in Nigeria or both parents were born in Nigeria and immigrated to the US. These were the only entry criteria for study participation, and medical history information was not collected during screening. Subjects consented to participation in the study, donated blood specimens, and completed questionnaires at the General Clinical Research Center at the Irving Center for Clinical Research of Columbia University, CUMC (ICCR; Protocol 3324). Subjects received $40 compensation after completing all aspects of the study.

Ancestral Populations

Ancestral groups that were studied consisted of west African, European, and Native American populations. The west African ancestral population consisted of 37 people from West Africa, and DNA samples were provided by Paul McKeigue. We specifically chose to focus on West Africa since the history of AAs reflects the forced migration of slaves from mainly West Africa(14). The European population consisted of 42 European American samples from Coriell's North American Caucasian panel. The Native American population consisted of 15 people who were Mayan and 15 who were Nahua, with DNA samples provided by Mark Shriver.

Selection Of AIMs

The AIMs used in this study were bi-allelic SNPs that were selected from the Affymetrix 100K SNP chip (Affymetrix, Santa Clara, CA) based on “informativeness” for ancestry in the ancestral population samples genotyped. We used an iterative process for selecting our AIMs since the AA population is a mixture of three ancestral populations, west Africans, Europeans and Native Americans. For each of the three possible pairs of ancestral populations, we identified markers where the difference in allele frequency (δ) was at least 0.5 between any two ancestral populations. Once we identified such markers, we selected a group of 107 AIMs that were adequately distributed across the genome, with the markers being far enough apart that they were in linkage equilibrium in the ancestral populations. The average distance between markers was about 2.4 × 107 bp.

Table 1 lists the AIMs examined, their rs number, chromosomal location, calculated allele frequencies in each ancestral population, and delta values for each ancestral population pairing.

Table 1
Table of 107 Ancestry Informative Markers

Genotyping approximately 100 AIMs was predicted to provide estimates of ancestry with correlation coefficients greater than 0.9 for the true individual ancestral proportions. Simulations with different numbers of AIMs using each of three major methods to estimate ancestry (maximum likelihood estimation, ADMIXMAP, and Structure) by Tsai et al. indicate that the Pearson's product moment correlation coefficient for agreement between individual ancestry estimates and true individual ancestry proportions is 0.79-0.81, 0.87-0.88, and 0.93 for 25, 50, and 100 AIMs, respectively(17). These simulations are based on a model where the markers being tested have a mean informativeness of 0.15. Furthermore, 100 markers were predicted to be an adequate number for identifying admixture proportions in a three-way population admixture as seen among AAs(17).

Genotype Analysis

All study participants signed an informed consent document for blood donation and DNA preparation and testing. Peripheral venous blood samples (12 mL) were collected by venipuncture from each participant in tubes containing EDTA. DNA extraction was conducted using the Gentra DNA isolation platform (Minneapolis, MN) as previously described ( Briefly, whole blood was combined with RBC Lysis Solution and then centrifuged to isolate the buffy coat. Peripheral blood leukocytes were then lysed with Cell Lysis Solution and mixed with Protein Precipitation Solution. This mixture was centrifuged and the protein pellet was discarded. DNA was precipitated from the supernatant using 100% isopropanol and cleaned with 70% ethanol. Final DNA concentrations were within the range of 220-660 μg/mL.

Genotyping of AIMs was performed using iPLEX reagents and protocols for multiplex polymerase chain reaction (PCR), single base primer extension (SBE), and generation of mass spectra, as per the manufacturer's instructions (for complete details see iPLEX Application Note, Sequenom, San Diego, CA). Genotyping was conducted at the Functional Genomics Core, Children's Hospital Oakland Research Institute (Oakland, CA). Four multiplexed assays containing 28, 27, 26, and 26 SNPs (total = 107 SNPs) were performed using each DNA sample. Briefly, initial multiplexed PCR was performed in 5-μL reactions on 384-well plates containing 5 ng of genomic DNA. Reactions contained 0.5 U HotStar Taq polymerase (Qiagen, Valencia, CA), 100 nM primers, 1.25X HotStar Taq buffer, 1.625 mM MgCl2, and 500 μM dNTPs. Following enzyme activation at 94 °C for 15 min, DNA was amplified with 45 cycles of 94 °C x 20 sec, 56 °C x 30 sec, 72 °C × 1 min, followed by a 3-min extension at 72 °C. Unincorporated dNTPs were removed using shrimp alkaline phosphatase (0.3 U, Sequenom, San Diego, CA). Single-base extension was carried out by addition of SBE primers at concentrations from 0.625 μM (low molecular weight primers) to 1.25 μM (high molecular weight primers) using iPLEX enzyme and buffers (Sequenom) in 9-μl reactions. Reactions were desalted, and SBE products were measured using the MassARRAY Compact system (Sequenom). Mass spectra were analyzed using TYPER software (Sequenom) to generate genotype calls and allele frequencies.

Development Of Study Questionnaire

Development of Study Questionnaire. We developed a 26-item questionnaire that explored beliefs about race, ethnicity, and nationality. This questionnaire asked participants standard demographic information, including gender, age, household income, and place of birth. The questionnaire included closed-ended questions that have been used previously to measure Racial-Ethnic Identity (REI) using the Racial-Ethnic Identity Subscales ( This tool, based on the Oyserman, Gant, and Ager tripartite model of REI, assesses REI by measuring connectedness, embedded achievement, and awareness of racism(18). Using the 5-point Likert response scale (1 = strongly disagree, 2 = disagree, 3 = neither agree nor disagree, 4 = agree, 5 = strongly agree), participants indicated their agreement with 13 statements that tested all three aspects of REI. Alpha reliability for each of these aspects has been reported in the 0.6 to 0.7 range(19). The 26-item study questionnaire also used the Likert response scale to assess the importance of different physiognomic characteristics when estimating African ancestry. The questionnaire included a three-generation family tree in which participants filled in the race and birth-country of their grandparents and parents. To test response reliability, two separate questions that asked participants to write down their ethnicity were included in the questionnaire (questions 12 and 19).

Calculation Of Population Admixture And Estimates Of Ancestry

Population admixture proportions were calculated and compared using two methods, maximum likelihood estimates (MLE) and principal components analysis (PCA). The MLE approach was implemented in a JAVA program (available from S. Huntsman upon request). This program uses information on allele frequencies from each ancestral population and all the study participants(20, 21). (For a detailed description of the application of MLE to admixture proportion calculations, please refer to Tsai, 2005)(17). PCA was applied to the genotype data for the study participants using S-PLUS 7.0 for Windows (Insightful Corporation, Seattle, WA) to order participants by degree of genetic variation. The first principal component indicates a continuous axis of genetic variation which codes the greatest degree of variance among study participants.

Statistical Methods

Independent T-tests with equal variances not assumed were used to compare mean admixture proportions in the two cohorts. Levene's test for equality of variances was used to compare variances for each ancestral grouping in the two cohorts. Cohen's kappa measurement of reliability for questionnaire responses was calculated by comparing responses on two questions that asked for the same information with slightly different wording. Questionnaire responses were analyzed with univariate chi-square (X2) analyses to identify significant correlations between participants' responses and percentage west African ancestry. PCA and factor analysis were used to identify a set of questionnaire items that predicted degree of west African ancestry. PCA was applied to determine if the 26-item questionnaires could be reduced to fewer variables that describe the total variation in questionnaire responses. Factor analysis was then used to determine which questionnaire items contributed most to the variance in questionnaire responses. Logistic regression was applied to assess for an association between the identified factors and percentage west African ancestry. Self-reported grandparent race was used to assess the sensitivity, specificity, and positive predictive value of self-reported ancestry in a three-generation family tree for calculated degree of west African ancestry. Statistical significance was defined as P < 0.05. Statistical software used were SPSS 14.0 for Windows (SPSS Inc., Chicago, IL) (T-tests, Levene's test, Cohen's kappa, X2-analyses) and S-PLUS 7.0 for Windows (Insightful Corporation) (factor analysis, PCA).


Participant Characteristics

50 AAs and 40 Nigerians from the New York metropolitan area participated in this study. Participant characteristics are shown in Table 2. Mean age, median household income bracket, and mean years of schooling all varied significantly between the two cohorts (P < 0.05). These demographic variables were therefore included in the factor analysis of the questionnaire items.

Table 2
Demographic Characteristics of Study Participants

Population Substructure and Admixture in the Two Cohorts

Population Substructure and Admixture in the Two Cohorts. Genetic scoring using AIMs and MLE revealed that ancestry was approximately 95% west African, 4% European, and 1% Native American in the Nigerian cohort and 83% west African, 15% European, and 2% Native American in the AA cohort (Table 3). The mean percentage of west African and European ancestries varied significantly between the two cohorts (P < 0.0001). The mean percentage of Native American ancestry did not vary significantly between the two cohorts (P = 0.087). Furthermore, the variance of individual ancestry within each cohort differed significantly between the two groups for all three geographic ancestries (P = 0.002, P < 0.0001, and P = 0.011 for the variance of west African, European, and Native-American ancestry, respectively; data not shown).

Table 3
Ancestry Admixture Proportions of Study Participants

We analyzed AA study participants according to quartiles of increasing percentage of west African ancestry to determine how well self-reported race accorded with inferred genetic population cluster (Table 4). Of the 50 participants who self-identified as AA, only one was found to have a minority (i.e., < 50%) of west African ancestry upon genotype testing.

Table 4
Quartiles of Percentage African Ancestry (African-American Subjects)

Reliability of Questionnaire Responses

To assess the level of intra-participant reliability of the questionnaire responses, a pair of redundant questions was inserted into the questionnaire (questions 12 and 19). These items asked participants to indicate their ethnicity. The first question asked participants what they consider their ethnicity to be; the second question asked participants what they record when ethnicity information is requested on forms or surveys. The Cohen's kappa measurement of agreement for responses to these two questions was 0.658. This indicates a good reliability of questionnaire responses.

Correlation of Genetic Ancestry with Self-Reported Ancestry

For the AA cohort, we used a three-generation family tree to assess the correlation between self-reported grandparent race and genetic ancestry calculated from the AIMs. For this analysis, we dichotomized responses for race into those consistent with African ancestry and those consistent with non-African ancestry. Responses of “African,” “African-American,” and “black” were all keyed as consistent with African ancestry. Using this definition, all of the AAs described that the race of at least three of their four grandparents was consistent with African ancestry. Thus a self-report that three or more grandparents were of a race consistent with African ancestry had a sensitivity of 100%, specificity of 0%, and positive predictive value of 80% for determining a calculated west African ancestry of 76-100%.

We performed univariate X2-analyses to identify any questionnaire items that were significantly associated with percentage west African ancestry. West African ancestry was divided into high and low percentage ancestry at the mean value of 85% among AAs. When the entire dataset was analyzed, the following questionnaire items were found to significantly predict a high percentage of west African ancestry: birthplace, self-described nationality, language spoken at home, number of generations in one's family that have lived in the US, self-described ethnicity, and a high estimation of the importance for one's community that one succeed in school (P < 0.05). However, these effects were lost when the US-born AA cohort was examined alone (data not shown), and we therefore did not test these factors for an independent effect. Thus no single questionnaire item significantly predicted the degree of west African ancestry in the US-born AA cohort.

We next tried to identify a group of questions that may possibly serve as a proxy for estimating west African ancestry. Using PCA, we found that the two top components explain 42% of the total variation in the data. Factor analysis was then used to identify two factors, consisting of a group of questions, which explain this large portion of the variance in survey responses. The first factor included the REI subscales, participants' rating of the importance of different physiognomic characteristics when estimating African ancestry, their rating of agreement with the statement that “people of African ancestry share physical traits,” and their rating of agreement with the statement, “I have similar physical traits to other people in my racial/ethnic group.” The second factor consisted of an average of household income bracket and years of schooling. Scores from each factor were used in a logistic regression with percentage west African ancestry in the AA cohort. Unfortunately, the regression analysis revealed no significant association between responses on these two sets of questions (i.e., first two components) and degree of west African ancestry (P = 0.47).

Comparison of Different Methods to Estimate Ancestry

We applied PCA to the AIMs genotypes of our study population to investigate the variance in both the AIMs selected to determine ancestry and the actual study subjects. We attempted to identify a first principal component consisting of the AIMs that describe the largest part of the variance in genotype frequencies among participants. This could be used to order participants by degree of genetic variation. The relative level of genetic variation should approximate degree of African ancestry and would then be compared with the ancestry estimates calculated using MLE. Unfortunately, we were unable to identify a principal AIMs component using this method (Figure 1). As shown in Figure 1, all 107 AIMs appeared to be fairly independent and unrelated. This is consistent with the method by which the AIMs were selected to be widely spaced throughout the genome. (Please refer to Materials and Methods – Selection of AIMs.) Interestingly, when we performed PCA to investigate the variance in actual study subjects (Figure 2), we found that the first principal component separated out two groups of subjects that significantly varied in percentage west African ancestry as determined using the 107 AIMs and MLE (P < 0.0001). The first principal component explained 62% of the variability in west African ancestry and consisted of 62 of the study participants. For the subjects included in the first principal component, the average and median percentage west African ancestries were 93% and 95%, respectively. In other words, the majority of study subjects in both cohorts (AA and Nigerian) were found to be highly similar, and these “more similar” subjects comprised the principal component. For the remaining 28 subjects that were not included in the first principal component, the average and median percentage west African ancestries were 78% and 81%, respectively.

Figure 1
Principal Components Analysis of the Ancestry Informative Markers
Figure 2
Principal Components Analysis of Participant Genotype Frequencies


In this study, we genotyped a panel of 107 AIMs to investigate the degree of west African ancestry among AAs and Nigerians and administered a questionnaire to explore the variables used by each cohort in constructing racial identity. We found that while nearly all selfidentified AAs had a majority of west African ancestry, AAs had significantly more European admixture and greater admixture variability than the Nigerians. Self-report of a high degree of African ancestry in a three-generation family tree did not accurately predict the degree of west African ancestry calculated from our AIMs. Analysis of questionnaire responses revealed that no simple question proxy effectively estimates the degree of west African ancestry among US-born AAs. However, relative degree of west African ancestry could be effectively determined using both MLE and PCA. The results of our study thus suggest that while self-identified race could identify a cohort of individuals with a high degree (>80%) of west African ancestry, an admixture-matched case-control design may be more accurate and objective for conducting genetic association studies in admixed populations.

There are conflicting reports in the literature regarding the ability of self-identified race to serve as an accurate predictor of population clusters. In our study, self-reported race generally accorded well with inferred genetic population cluster. The mean west African ancestry among the AA participants was 83%, and only one of the participants in the US-born AA cohort did not have a majority of west African ancestry on genotype analysis. Thus, based on our genotyping data, if members of the AA cohort were assigned to one of the five major population genetic clusters (African, Caucasian, Pacific Islander, East Asian, and Native American, as defined by Risch, 2002)(7), all but one of the participants would be classified together in the African cluster. (Of note, our study is limited by the relatively small sample sizes of both cohorts, which may lead to skewing of both mean ancestry estimates and collected questionnaire information. Thus, the population level inferences should be taken with caution.) In contrast, in a previous study, Wilson et al. used microsatellite markers to analyze individuals from eight populations and observed that genetically inferred clusters corresponded poorly to commonly used racial labels(11). However, this study used far fewer markers and also classified Ethiopians as Blacks and New Guineans as Asians, while more recent population studies suggest that the genetic ancestries of these groups are European and Pacific Islander, respectively(7). Other studies have found that given sufficient numbers of markers and sample sizes, self-defined race may correspond well with inferred genetic clusters. Rosenberg et al. tested the correspondence of predefined population groups with those inferred from individual multilocus genotypes and found general agreement between the genetic and predefined populations(9). Tang et al. studied 3,636 subjects participating in the Family Blood Pressure Program who identified themselves as belonging to one of four racial groups (white, African American, East Asian, and Hispanic). Subsequent genetic cluster analysis using microsatellite markers produced four major clusters with near-perfect correspondence with the four racial categories(10).

The AA cohort in our study had a mean of 15% European admixture, which is consistenty with previous reports of a range of 7-23% European admixture among US AAs(14-16). Of note, the estimates of 4% European and 1% Native American ancestry in the Nigerian population is likely due to bias in ML estimates due to the limited number of markers. We found that among participants, there was a significantly higher proportion of admixture and higher variability in admixture proportions, in the US-born AA cohort compared to a population that emigrated from Africa (i.e., Nigerians) (Table 3). The significant variation in individual ancestry estimates among the AA cohort suggests that this group, like the CHS AA cohort(15), represents a diverse population consisting of several subpopulations. For participation in the AA cohort, subjects identified both parents as AAs who were born in the US. Although data regarding grandparental race were not used to screen study participation, these data were collected through a threegeneration family tree during administration of the questionnaire. In this study population, all AA subjects described that the race of at least three of their four grandparents was consistent with African ancestry. Individuals and society have historically classified children of mixed race ancestry as AA, even when one parent is Caucasian, Asian, or Native American. For AAs, this is a remnant of the “Jim Crow” laws and the “One Drop” rule or “Rule of Hypodescent”. Thus, identification as AA would still occur in cases where the parents and grandparents were of mixed race ancestry. This could also contribute to the greater European admixture and greater admixture variability seen in the AA cohort.

The two cohorts were found to differ significantly in income bracket and education level, raising the possibility of a confounding relationship between socioeconomic status (SES) and degree of west African ancestry. In fact, others have found significant interactions between SES, genetic ancestry, and disease outcome(22). In our study however, analyses of income and education within each cohort suggested that SES is unlikely to represent an important bias in our study population as neither income nor education significantly correlated with degree of west African ancestry within either group (data not shown). SES is a complex construct, operates on multiple levels, and may be time dependent(23). In our study, income and education within each cohort were not significant confounders. However, it is possible that there were unmeasured confounders for which no amount of correction would control. Therefore, generalizations cannot be made regarding the relationship between ancestry and SES, and the potential confounding effect of SES must be addressed specifically in individual studies.

The limited ability of self-reported race to effectively reveal population substructure was also seen in a recent study that compared population structure inferred from individual ancestry estimates with self-reported race(24). In a case-control study of early-onset lung cancer, Barnholtz-Sloan et al. reported that the frequency of the drug-metabolizing gene GSTM1 null “risk” genotype varied both by individual European ancestry and by case-control status within self-reported race, particularly among the AA study participants. Furthermore, they found that genetic risk models that adjusted for European ancestry provided a better fit for this relationship between GSTM1 genotype and lung cancer risk compared with the model that adjusted for selfreported race. The results of this and other studies suggest that the likelihood of identifying disease-susceptibility loci will be lower in studies that rely on less accurate measures of population stratification (e.g., self-reported race)(25). Thus, genetic classifications of ancestry may provide a more objective and accurate method of defining homogeneous populations which can be used to investigate specific population-disease associations.

Because of cost and feasibility issues that may discourage the incorporation of admixture testing in the design of both preclinical and clinical studies, we developed a questionnaire to search for questions or combinations of questions that may reliably serve as a proxy for west African ancestry. We are not aware of any previous reports that have investigated relationships between factors used by individuals in constructing racial identity and individual ancestry estimates as determined through genotyping. When the entire dataset (i.e., two cohorts) was examined as a whole, several questionnaire items were found to have a significant association with percentage west African ancestry. Many of these items appear to be related to characteristics of an immigrant population (e.g., Nigerians), such as birthplace, self-described nationality, language spoken at home, number of family generations living in the US, self described ethnicity, and estimation of importance of one's success in school for his/her community. However, when the AA cohort was examined separately, no question or set of questions significantly predicted degree of west African ancestry, as determined both by univariate analysis and factor analysis of survey items. Self-reported ancestry using a threegeneration family tree also could not accurately predict degree of west African ancestry. Although reported grandparent race was highly sensitive for ancestry, it was not specific. Since all participants in our study reported that at least three grandparents were of a race consistent with African ancestry, this information could not distinguish those who actually had a high degree of African ancestry. The lack of specificity of reported grandparent race likely is due to the imprecision of racial categories. Our family tree analysis was limited by the relatively similar background of our study participants; for example, all AA participants indicated that three or all of their grandparents were of a race consistent with African ancestry. Thus, studying a population with a greater degree of admixture may be more appropriate for investigating the utility of a three-generation family tree in predicting degree of African ancestry. A recent study by Burnett et al., however, suggests that self-reported ancestry may have poor reliability(26). In this study, Burnett et al. prospectively asked siblings to list the countries of origin of both parents. Participants in this study were recruited at the Mayo Clinic and were primarily Caucasian. Nevertheless, Burnett et al. found that only 49% of sibling pairs agreed completely on the countries of origin of both parents and this agreement only increased to 68% when named countries were postcoded into six population genetic clusters (Eurasia, East Asia, Oceania, America, Africa, and the Kalash group of Pakistan).

Applying PCA to the AIMs genotypes of our study population, we were unable to identify a principal component that could be used to order participants by relative degree of west African ancestry and compared with percentage west African ancestry calculated using MLE. However, PCA of the study participants' genotype frequencies was able to identify a cluster of subjects with highly similar SNP distributions. The majority of subjects (both AA and Nigerian) with a high percentage of MLE-calculated west African ancestry were included in the first principal component, indicating an overall concordance between individual ancestry estimates calculated with MLE and subjects groupings with PCA. A few subjects included in the first principal component had a lower percentage west African ancestry calculated with MLE. We suspect this difference may result from methodological differences and our use of point estimates rather than confidence intervals for estimated ancestry in our MLE calculations.

We began this study to investigate the degree of admixture among self-reported AAs following our previous studies of breast cancer tumor biology in AAs(3, 4, 27). Genetic heterogeneity in study subjects could impair the ability of a study to detect true biological differences between racially-defined, apparently uniform groups. We have found that genetic ancestry proportions can vary significantly within groups of individuals who would self-identify as the same racial group. Our work suggests that to maximize the predictive value of clinical inferences from genome-wide association studies, one must consider within-as well as between-population association. Thus, while self-identified race can identify a cohort of individuals with a high degree of African ancestry, admixture-matched case-control studies will be more effective in studying differences in disease incidence and outcomes in specific racial populations.


We wish to acknowledge Dr. I. Bernard Weinstein for invaluable advise during the preparation of this manuscript and Scott Huntsman for assistance in analyzing the AIM data. We also wish to thank Dr. Wendy K. Chung, Grace Ajuluchukwu, Marline Anderson-Slater, and Barbara Castro for assistance with this project.

Grant support: Long Island University/Herbert Irving Comprehensive Cancer Center Minority Institution/Cancer Center Partnership (CA91372, CA101388, CA101598) (AKJ, NSC, MGB); NIH fellowship T32CA09529 (RY); National Center for Minority Health Disparities Center of Excellence in Nutritional Genomics (5P60MD00022) (KBB); Tobacco-Related Disease Research Program New Investigator Award (15KT-0008) (SC); and National Institute of Heart, Lung and Blood (R01 HL078885), RWJ Amos Medical Faculty Development Award, NCMHD Health Disparities Scholar, Extramural Clinical Research Loan Repayment Program for Individuals from Disadvantaged Backgrounds (2001-2003), Sandler Center for Basic Research in Asthma and the Sandler Family Supporting Foundation (EGB).