The recent revolution in genetics promises enormous gains for understanding and improving health. In all genome-wide association studies since 2007, genetic variants at nearly 100 regions of the genome have been associated with an increased risk for diseases with complex genetic causes, such as diabetes, inflammatory bowel disease, heart disease, and cancer (Chanock and Hunter, 2008
). Twenty-eight specific genetic variants have been linked to cancers of the breast, prostate, colon, lung, and skin (Easton and Eeles, 2008
). Research is progressing rapidly (Lin et al., 2006
) to determine risks conferred by newly-discovered types of genetic variation such as copy-number variants (Feuk et al., 2006
), and to elucidate the joint effects of multiple genetic variants in concert with non-genetic factors.
However, the hotly-debated question remains about how to use genetic information to better develop, target, and evaluate policies for population-level disease prevention (Pharoah et al., 2008
; Gail, 2008
). Although the found genetic variants are common, each has small effect on disease risks, and so modify disease risks only slightly for most individuals. However, reliable identification of population subgroups at high disease risk has major implications for population health (Pharoah et al., 2008
As genetic findings accrue, evaluating their potential impact on population health requires population-representative data. In response to this pressing need, the Centers for Disease Control and Prevention and the National Cancer Institute have collaborated to conduct genotyping on a subset of the Third National Health and Nutrition Examination Survey (NHANES III). NHANES III is the nationally-representative household-interview and medical examination survey of the U.S. non-institutionalized civilian population conducted from 1988-1994 by the National Center for Health Statistics (NCHS) (NCHS, 1994
). The nationally representative sample is obtained from a complex, stratified, multistage probability sample design with unequal selection probabilities.
These NHANES genetic data are the first U.S.-population-based genetic data. The continuing NHANES survey is the first major periodic official health survey in the world to collect genetic data. These data are a unique and paramount resource for analyzing the distribution of genetic variation in the U.S. and for estimating the potential population impact of genomic strategies for disease prevention. In addition, NHANES III oversamples non-Hispanic blacks and Mexican-Americans, important yet genetically understudied populations who also suffer from health disparities. These NHANES III data will integrate existing social, environmental, behavioral, and biologic data with genetic data to understand the determinants of health and health disparities in the U.S. (Chang et al., 2009
However, before these impending analyses can be conducted, accurate information about familial relationships within households must be available. Related individuals in a household cannot be treated as an independent sample for genetic analyses. NHANES III collected no self-reported family relationship information. Instead, family relationships were reported with respect to a single person in the household who is often not in the sample (U.S. Department of Health and Human Services (DHHS). National Center for Health Statistics., 1996, see HFRELR). As a result, it is impossible to determine exactly the reported relationship between two sample members. For example, one cannot presume that the adult female sample persons in the household are the mothers of the children/youth sample people in the household. Thus the data on reported family relationships within NHANES III households are incomplete and inconclusive with regards to actual biological relatedness of family members.
We use the NHANES III genetic data to infer familial relationships within NHANES III households. DNA labs usually track biosamples using what is colloquially called a 'DNA fingerprint' (more properly, a DNA profile), a system of DNA loci useful for forensic identification. One popular system is AmpFlSTR® Identifiler® PCR Amplification Kit (Applied Biosystems, Foster City, CA, USA). Identifiler® contains the DNA loci used by the Combined DNA Index System (CODIS; http://www.fbi.gov/hq/lab/html/codis1.htm
) that is commonly used by law enforcement agencies for forensic identification. While these loci have a track-record for addressing if two DNA profiles are from the same person (or, equivalently, identical twins), the performance of these loci for inferring family relationships more distant than identical twin is less understood (Bieber et al., 2006
We assess the use of the Identifiler® DNA loci for inferring family relationships with nationally-representative survey data. We compared two methods that estimate the likelihood ratio that a pair of household members have a hypothesized relationship versus being unrelated. The first method (”exact method” (Evett and Weir, 1998
, Ch. 5-8)) uses allele frequencies and the second (”IBS (Identical By State) method” (Presciuttini et al., 2002
)) uses only the fact that alleles match between individuals. The exact method extracts information out of matches on rare alleles, as matching rare alleles are more indicative of a familial relationship than matching common alleles. However, the IBS method does not require allele frequencies and is thus robust to inaccurate or inappropriate allele frequencies. Since the genotyped DNA samples were cell lysates with widely varying DNA concentrations, we modified both methods to account for genotyping errors. Finally, we used a modification of the exact method to account for “cryptic relatedness” (Devlin and Roeder, 1999
) (also called population substructure): the fact that all ostensibly unrelated humans still share small amounts of DNA from distant common ancestors. Cryptic relatedness implies that ostensibly unrelated individuals have a residual relatedness, which can violate the independence assumptions of standard methods for relationship inference. We assess how much cryptic relatedness reduces the evidence in favor of familial relationships. We also hope that this work will introduce survey statisticians to the swiftly-arriving era of genetic data from surveys.
A by-product of our work are the first explicitly nationally-representative and ethnicallyspecific estimates of these important allele frequecies. Our allele frequency estimates could be relevant to forensic calculations requiring U.S. population-based allele frequencies.
1.1 Data Description
During the second phase of NHANES III (1991-1994), lymphocytes were frozen and cell lines were immortalized to create a DNA bank. Genetic variation data were collected from 7,159 participants aged 12 years and older. DNA was extracted by cell lysis and the genotyping used in this paper was conducted by the Core Genotyping Facility at the National Cancer Institute (http://cgf.nci.nih.gov
). See (Chang et al., 2009
) for all details.
We use genetic data from Identifiler® for each participant. Identifiler® tests for genetic variants at 15 DNA loci called Short Tandem Repeats (STRs). STRs are multiple copies of an identical DNA sequence arranged in direct succession in a particular region of a chromosome (Butler, 2006
). For example, the DNA locus D7S820
is in . This locus is on chromosome 7 (hence the D7
). In the middle of this locus, the tetranucleotide sequence gata is repeated 13 times. The number of repeats names the genetic variant (called an allele), and a person has two alleles (one on each chromosome 7 inherited from the mother and father). D7S820
typically has 6-14 gata
repeats. However, there can be variants in the repeated sequence motif as well; for example, the allele named 13.1 has an extra DNA base inserted in the sequence of 13 ”gata” repeats in D7S820
. See (Butler, 2006
) for details on each possible allele.
Figure 1 DNA locus D7S820 with the tetranucleotide motif repeat gata upcased. This version has 13 gata repeats, so is named allele 13. The locus is broken into chunks of length 10 for ease of counting the position of each nucleotide (the numbers give the position (more ...)
Identifiler® contains the 13 CODIS loci commonly used by law enforcement agencies for forensic identification: TPOX, CSF1PO, D5S818, D13S317, D16S539, TH01, D18S51, D7S280, VWA, FGA, D3S1358, D8S1179, D21S11; Identifiler® also includes D19S433 and D2S1338
. Both CODIS and Identifiler® also have the STR AMEL
, but AMEL
provides information only on sex. For all details on these loci, see (Butler, 2006
A fictitious example of a participant's DNA profile is in . Each allele at each locus is shown, e.g. 13/10 means alleles 13 and 10 are observed. The pair of alleles is called the genotype. We also have the demographic variables of race/ethnicity, sex, and age. Sex and age for each pair of household members can help narrow down the possible familial relationships, and ethnicity is needed to select the proper allele frequencies to use in relationship inference. Given a feasible region of familial relationships, we use the genetic information to infer family relationships.
From the 7159 participants, we excluded 346 due to poor DNA quality or low DNA concentration (samples with less than 250 relative flourescence units; these samples had data at fewer than 12 of the 16 Identifiler® loci). Furthermore, 72 participants who had a mismatch between the reported sex and the (AMEL) genetically-determined sex (indicative of lack of data quality) were excluded, yielding 6741 participants. The distribution of genotyped household size is 1:2781, 2:1070, 3:329, 4:137, 5:27, 6:13, 7:3, 8:4, 9:1, and 11:1. The genotyped household size does not count individuals who were not genotyped. Thus 3960 were in multiple-person households, yielding 3610 possible pairs of genotyped relatives within households. The 2781 participants who are the only genotyped member of their household are included to estimate allele frequencies. To estimate nationally-representative allele frequencies, NCHS statisticians provided a sample weight for each participant to weight our dataset up the U.S. population. We categorized the race/ethnicity of participants as ’non-Hispanic White’, ’non-Hispanic Black’, and ’Mexican-American’. Participants who self-identified as Mexican-American in NHANES III represent a heterogeneous race-ethnic population of primarily Hispanic American Indian and Hispanic White. Because specific information on which current Office of Management and Budget categorization each of these participants represents is not available, we will use the term Mexican-American for the purposes of this publication.