|Home | About | Journals | Submit | Contact Us | Français|
Developing, targeting, and evaluating genomic strategies for population-based disease prevention require population-based data. In response to this urgent need, genotyping has been conducted within the Third National Health and Nutrition Examination (NHANES III), the nationally-representative household-interview health survey in the U.S. However, before these genetic analyses can occur, family relationships within households must be accurately ascertained. Unfortunately, reported family relationships within NHANES III households based on questionnaire data are incomplete and inconclusive with regards to actual biological relatedness of family members. We inferred family relationships within households using DNA fingerprints (Identifiler®) that contain the DNA loci used by law enforcement agencies for forensic identification of individuals. However, performance of these loci for relationship inference is not well understood. We evaluated two competing statistical methods for relationship inference on pairs of household members: an exact likelihood ratio relying on allele frequencies to an Identical By State (IBS) likelihood ratio that only requires matching alleles. We modified these methods to account for genotyping errors and population substructure. The two methods usually agree on the rankings of the most likely relationships. However, the IBS method underestimates the likelihood ratio by not accounting for the informativeness of matching rare alleles. The likelihood ratio is sensitive to estimates of population substructure, and parent-child relationships are sensitive to the specified genotyping error rate. These loci were unable to distinguish second-degree relationships and cousins from being unrelated. The genetic data is also useful for verifying reported relationships and identifying data quality issues. An important by-product is the first explicitly nationally-representative estimates of allele frequencies at these ubiquitous forensic loci.
The recent revolution in genetics promises enormous gains for understanding and improving health. In all genome-wide association studies since 2007, genetic variants at nearly 100 regions of the genome have been associated with an increased risk for diseases with complex genetic causes, such as diabetes, inflammatory bowel disease, heart disease, and cancer (Chanock and Hunter, 2008). Twenty-eight specific genetic variants have been linked to cancers of the breast, prostate, colon, lung, and skin (Easton and Eeles, 2008). Research is progressing rapidly (Lin et al., 2006) to determine risks conferred by newly-discovered types of genetic variation such as copy-number variants (Feuk et al., 2006), and to elucidate the joint effects of multiple genetic variants in concert with non-genetic factors.
However, the hotly-debated question remains about how to use genetic information to better develop, target, and evaluate policies for population-level disease prevention (Pharoah et al., 2008; Gail, 2008). Although the found genetic variants are common, each has small effect on disease risks, and so modify disease risks only slightly for most individuals. However, reliable identification of population subgroups at high disease risk has major implications for population health (Pharoah et al., 2008).
As genetic findings accrue, evaluating their potential impact on population health requires population-representative data. In response to this pressing need, the Centers for Disease Control and Prevention and the National Cancer Institute have collaborated to conduct genotyping on a subset of the Third National Health and Nutrition Examination Survey (NHANES III). NHANES III is the nationally-representative household-interview and medical examination survey of the U.S. non-institutionalized civilian population conducted from 1988-1994 by the National Center for Health Statistics (NCHS) (NCHS, 1994). The nationally representative sample is obtained from a complex, stratified, multistage probability sample design with unequal selection probabilities.
These NHANES genetic data are the first U.S.-population-based genetic data. The continuing NHANES survey is the first major periodic official health survey in the world to collect genetic data. These data are a unique and paramount resource for analyzing the distribution of genetic variation in the U.S. and for estimating the potential population impact of genomic strategies for disease prevention. In addition, NHANES III oversamples non-Hispanic blacks and Mexican-Americans, important yet genetically understudied populations who also suffer from health disparities. These NHANES III data will integrate existing social, environmental, behavioral, and biologic data with genetic data to understand the determinants of health and health disparities in the U.S. (Chang et al., 2009).
However, before these impending analyses can be conducted, accurate information about familial relationships within households must be available. Related individuals in a household cannot be treated as an independent sample for genetic analyses. NHANES III collected no self-reported family relationship information. Instead, family relationships were reported with respect to a single person in the household who is often not in the sample (U.S. Department of Health and Human Services (DHHS). National Center for Health Statistics., 1996, see HFRELR). As a result, it is impossible to determine exactly the reported relationship between two sample members. For example, one cannot presume that the adult female sample persons in the household are the mothers of the children/youth sample people in the household. Thus the data on reported family relationships within NHANES III households are incomplete and inconclusive with regards to actual biological relatedness of family members.
We use the NHANES III genetic data to infer familial relationships within NHANES III households. DNA labs usually track biosamples using what is colloquially called a 'DNA fingerprint' (more properly, a DNA profile), a system of DNA loci useful for forensic identification. One popular system is AmpFlSTR® Identifiler® PCR Amplification Kit (Applied Biosystems, Foster City, CA, USA). Identifiler® contains the DNA loci used by the Combined DNA Index System (CODIS; http://www.fbi.gov/hq/lab/html/codis1.htm) that is commonly used by law enforcement agencies for forensic identification. While these loci have a track-record for addressing if two DNA profiles are from the same person (or, equivalently, identical twins), the performance of these loci for inferring family relationships more distant than identical twin is less understood (Bieber et al., 2006).
We assess the use of the Identifiler® DNA loci for inferring family relationships with nationally-representative survey data. We compared two methods that estimate the likelihood ratio that a pair of household members have a hypothesized relationship versus being unrelated. The first method (”exact method” (Evett and Weir, 1998, Ch. 5-8)) uses allele frequencies and the second (”IBS (Identical By State) method” (Presciuttini et al., 2002)) uses only the fact that alleles match between individuals. The exact method extracts information out of matches on rare alleles, as matching rare alleles are more indicative of a familial relationship than matching common alleles. However, the IBS method does not require allele frequencies and is thus robust to inaccurate or inappropriate allele frequencies. Since the genotyped DNA samples were cell lysates with widely varying DNA concentrations, we modified both methods to account for genotyping errors. Finally, we used a modification of the exact method to account for “cryptic relatedness” (Devlin and Roeder, 1999) (also called population substructure): the fact that all ostensibly unrelated humans still share small amounts of DNA from distant common ancestors. Cryptic relatedness implies that ostensibly unrelated individuals have a residual relatedness, which can violate the independence assumptions of standard methods for relationship inference. We assess how much cryptic relatedness reduces the evidence in favor of familial relationships. We also hope that this work will introduce survey statisticians to the swiftly-arriving era of genetic data from surveys.
A by-product of our work are the first explicitly nationally-representative and ethnicallyspecific estimates of these important allele frequecies. Our allele frequency estimates could be relevant to forensic calculations requiring U.S. population-based allele frequencies.
During the second phase of NHANES III (1991-1994), lymphocytes were frozen and cell lines were immortalized to create a DNA bank. Genetic variation data were collected from 7,159 participants aged 12 years and older. DNA was extracted by cell lysis and the genotyping used in this paper was conducted by the Core Genotyping Facility at the National Cancer Institute (http://cgf.nci.nih.gov). See (Chang et al., 2009) for all details.
We use genetic data from Identifiler® for each participant. Identifiler® tests for genetic variants at 15 DNA loci called Short Tandem Repeats (STRs). STRs are multiple copies of an identical DNA sequence arranged in direct succession in a particular region of a chromosome (Butler, 2006). For example, the DNA locus D7S820 is in Figure 1. This locus is on chromosome 7 (hence the D7). In the middle of this locus, the tetranucleotide sequence gata is repeated 13 times. The number of repeats names the genetic variant (called an allele), and a person has two alleles (one on each chromosome 7 inherited from the mother and father). D7S820 typically has 6-14 gata repeats. However, there can be variants in the repeated sequence motif as well; for example, the allele named 13.1 has an extra DNA base inserted in the sequence of 13 ”gata” repeats in D7S820. See (Butler, 2006) for details on each possible allele.
Identifiler® contains the 13 CODIS loci commonly used by law enforcement agencies for forensic identification: TPOX, CSF1PO, D5S818, D13S317, D16S539, TH01, D18S51, D7S280, VWA, FGA, D3S1358, D8S1179, D21S11; Identifiler® also includes D19S433 and D2S1338. Both CODIS and Identifiler® also have the STR AMEL, but AMEL provides information only on sex. For all details on these loci, see (Butler, 2006).
A fictitious example of a participant's DNA profile is in Figure 1. Each allele at each locus is shown, e.g. 13/10 means alleles 13 and 10 are observed. The pair of alleles is called the genotype. We also have the demographic variables of race/ethnicity, sex, and age. Sex and age for each pair of household members can help narrow down the possible familial relationships, and ethnicity is needed to select the proper allele frequencies to use in relationship inference. Given a feasible region of familial relationships, we use the genetic information to infer family relationships.
From the 7159 participants, we excluded 346 due to poor DNA quality or low DNA concentration (samples with less than 250 relative flourescence units; these samples had data at fewer than 12 of the 16 Identifiler® loci). Furthermore, 72 participants who had a mismatch between the reported sex and the (AMEL) genetically-determined sex (indicative of lack of data quality) were excluded, yielding 6741 participants. The distribution of genotyped household size is 1:2781, 2:1070, 3:329, 4:137, 5:27, 6:13, 7:3, 8:4, 9:1, and 11:1. The genotyped household size does not count individuals who were not genotyped. Thus 3960 were in multiple-person households, yielding 3610 possible pairs of genotyped relatives within households. The 2781 participants who are the only genotyped member of their household are included to estimate allele frequencies. To estimate nationally-representative allele frequencies, NCHS statisticians provided a sample weight for each participant to weight our dataset up the U.S. population. We categorized the race/ethnicity of participants as ’non-Hispanic White’, ’non-Hispanic Black’, and ’Mexican-American’. Participants who self-identified as Mexican-American in NHANES III represent a heterogeneous race-ethnic population of primarily Hispanic American Indian and Hispanic White. Because specific information on which current Office of Management and Budget categorization each of these participants represents is not available, we will use the term Mexican-American for the purposes of this publication.
Denote the genotype (the pair of alleles) for participant k at locus j as lj,k. The full DNA profile of the 15 Identifiler® loci is Pk = (l1,k, l2,k, …, l15,k). Statistical evidence in favor of a hypothesized familial relationship R (such as parent-child, full-siblings, etc.) between the two participants providing DNA profiles (P1, P2) is measured by the likelihood ratio (LR)
We use the exact method and the IBS method to compute the LR as well as maximum likelihood estimates of relationships.
The likelihood for two profiles P1, P2 within the same household given a relationship is
because the loci are on different chromosomes and are thus independent.
To make further progress, relationships can be parameterized in terms of Identical-By-Descent (IBD) probabilities. Two people can share 0, 1, or 2 alleles IBD. For the pair in Table 1, they share 0 alleles IBD at D13S317, at most 1 allele IBD at D16S539 (at most 1 because their matching allele 11 could be from two different ancestors, so they could share 0 alleles IBD), and at most 2 alleles IBD at CSF1PO. The probability of sharing i alleles IBD is denoted by ki and Σki = 1.
All familial relationships are defined by their IBD probabilities (Thompson, 1991) (Table 2). For example, a person must share both alleles IBD within himself or his monozygotic (identical) twin. Two unrelated people cannot share any alleles IBD (they can merely appear to share to due to chance; their shared alleles would be from different ancestors and so cannot be IBD). Since each parent contributes 1 allele at each locus for their child, they must share exactly 1 allele IBD. For example, the pair in Table 1 cannot be the same person or monozygotic twins, and they cannot be parent-child because they share no alleles at D13S317. As Table 2 shows, the 2nd degree relationships (grandparent-grandchild, uncle-nephew, half-sibling) have the exact same IBD probabilities and so cannot be distinguished based on IBD alone. However, IBD plus age information usually suffices. We note that 2nd degree relationships can be distinguished from each other by using correlated genetic loci (McPeek and Sun, 2000).
Using IBD, the likelihood for two profiles is
The second term is the k0, k1 and k2 probabilities for each relationship in Table 2. The genotype probabilities P(lj,1, lj,2) are independent of R given the IBD sharing because IBD defines which alleles are fixed (from a common ancestor) and the others are random.
The genotype probability calculations P(lj,1, lj,2|IBD = i) are in Table 3 (Thompson, 1991). Although the Identifiler® alleles are labeled by numbers, we label the four alleles in a pair of genotypes generically by A, B, C, D. The constant factors of two and four in Table 3 reflect the fact that alleles within genotype are unordered (i.e. AB is equivalent to BA). When IBD=0, the two genotypes are independent, so P(lj,1, lj,2|IBD = 0) = P(lj,1)P(lj,2). When IBD=2, the two genotypes are completely dependent, so P(lj,1,lj,2|IBD = 2) = P(lj,1) = P(lj,2). When IBD=1, the calculations are more complex, e.g.
because the conditional probability involves the probabilities of having an A, the probability the IBD allele is indeed A (1), and the probability that the IBD allele is in the second position (0.5) (first term) or in the first position (0.5) (second term). See (Wagner et al., 2006; Evett and Weir, 1998) for more details. In Table 3, we also represent each probability as the probability over the alleles not IBD, so that, e.g. P(lj,1 = AA, lj,2 = AA|IBD = 1) = P(AAA). This notation will be convenient for section 2.1.1 that relaxes the assumption of independent non-IBD alleles.
We note that throughout this paper, we assume that no participants within households are inbred and that all particiants between households are unrelated. Furthermore, we assume that no loci are missing data in any way informative of relatedness; in this way, the product (2) can be safely done over the observed loci alone.
Identifiler® allele frequencies vary between ethnicities, and even within ethnicity, allele frequencies can vary between subpopulations within ethnicities (Budowle et al., 2001). For example, the particular European/African ancestry of the non-Hispanic whites/blacks in NHANES III is not collected, and allele frequencies can vary within these groups. The effect of unknown subpopulations means that heterogeneous allele frequences between the unknown subpopulations will cause intraclass correlation of alleles within ethnicity (Devlin and Roeder, 1999), violating the independence assumption required to calculate the genotype probabilities of Table 3. Furthermore, all unrelated humans still share small amounts of DNA IBD from distant common ancestors, and this cryptic relatedness (Devlin and Roeder, 1999) results in nebulous subpopulations that further increase the intraclass correlation.
Genotype probability calculations can be extended to account for the intraclass correlation of alleles within ethnicity, called FST (Wright, 1969) (or the coancestry coefficient (Evett and Weir, 1998)). FST is positive and can be interpreted as the probability that two alleles are IBD from an unknown common ancestor. FST is accounted for by using a Dirichlet-Multinomial distribution (Evett and Weir, 1998, pg. 123-5). The genotype probability calculation for a single participant is altered as (Balding and Nichols, 1994)
since with probability FST the two A's are IBD from a distant common ancestor. Similarly,
since different alleles cannot be IBD. The genotype probability calculations for pairs of genotypes (for IBD=0) can be calculated using the Dirichlet-Multinomial recursion relation (Balding and Nichols, 1995)
The recursion relation is used to calculate the probability of observing the alleles that are not IBD; these alleles are no longer independent but have correlation FST . For concreteness, the genotype probabilty calculations for these IBD=0 alleles are
For example, P(AAAA) = P(A|AAA)P(AAA). Under independent alleles P(A|AAA) = P(A), but with positive intraclass correlation FST , the probability of observing another A given that 3 A's have been observed is higher and is specified by the recursion relation. Similarly, P(ABCD) = P(D|ABC)P(ABC) and P(D|ABC) conditions on not having observed D before, so the probability of observing D decreases.
To calculate the final genotype probability, plug in the above expressions into Table 3. Expressions for the LR accounting for FST exist (Ayres, 2000), but the above likelihood contributions are needed for maximum likelihood estimation of relationships via estimating the IBD probabilities k0, k1, k2.
The IBS (Identical By State) method (Chakraborty and Jin, 1993; Presciuttini et al., 2002) estimates the LR using only the fact that alleles match at each locus. For the example pair of Table 1, CSF1PO is considered a match on two alleles, D13S317 a match on zero alleles, and D16S539 a match on one allele. The IBS method relies on heterozygosity Hj, the probability that the two alleles at locus j are different. The probability that i = 0, 1, 2 alleles match at locus j in profiles P1 and P2 for a given relationship R is denoted z(i|Hj, R). The z(i|Hj, R) as a function of heterozygosity at each locus, familial relationship, and for each i = 0, 1, 2 are empirically by cubic functions with little residual variation (Presciuttini et al., 2002, Fig. 1). The empirical estimates of the cubic functions (i|Hj, R) are available (Presciuttini et al., 2002, Table 2), and the IBS method estimates the LR in (1) as
The IBS method does not distinguish between types of alleles and has no need for allele frequencies, and thus loses information versus the exact method by ignoring the rarity or commonality of matches. But the IBS method is robust when allele frequencies are unavailable or inappropriate. For example, in a mass disaster (such as a plane crash), allele frequencies are unavailable for use to match DNA from the remains with DNA samples provided by relatives. For another example, it is unclear how allele frequencies from nonU.S.-population-based databases of DNA profiles are for use in the general population.
As noted in section 1.1, the DNA was extracted by cell lysis, a sub-optimal method of DNA extraction that could introduce more genotyping errors than ordinarily expected. We adopt a simple model that the true genotype is observed with probability 1−, but with probability , the observed genotype is drawn randomly from the population (Broman and Weber, 1998; Epstein et al., 2000). The genotype probability calculations of section 2.1 and 2.1.1 in Table 3 are altered as
because a randomly drawn genotype from the population has no IBD sharing. The exact LR under errors takes a simple form. By (3), the contribution each locus j makes to the usual LR is
With errors (denoting e = (1 − (1 − )2)), the contribution is now
Since the IBS LR is meant to estimate the exact LR, we can use the above functional form to modify the contributions to the IBS LR in (4).
We note that STRs can, on rare occasions, spontaneously mutate. Thus the overall genotyping error rate combines both measurement error and mutation. Since the cells were crudely lysed to extract the DNA, we believe that measurement error dominates the error rate parameter.
Instead of hypothesis testing for relationships, the best relationship can be directly estimated with maximum-likelihood estimates of the IBD probabilities k0, k1 and k2 (Milligan, 2003). We maximize the exact likelihood (3), modifying the genotype probability calculation P(lj,1, lj,2|IBD = i) to account for errors as in (5) and for cryptic relatedness as in section 2.1.1. Within the simplex formed by k0, k1 and k2, the feasible region of maximization for non-inbred families is (Thompson, 1991).
Allele frequencies for the U.S. and for each ethnicity (non-Hispanic white, non-Hispanic black, Mexican American) were estimated in the standard way of estimating a proportion using sample weights in a Horvitz-Thompson estimator (ignoring finite population corrections) (Raj, 1968, pg. 42).
To assess the informativeness of a locus, we calculate the entropy of its allele frequency distribution. The entropy is the sum over each allele i of −piln(pi) where pi is the frequency of allele i. A locus with high entropy will have many alleles and low allele frequencies, and so can better distinguish people than a low entropy locus. Figure 2 plots the entropies for the U.S., each ethnicity, and for each locus, ordered by the entropy of the loci for the U.S.. The least informative locus is TPOX, for which only two alleles account for over 75% of its alleles; at D2S1338 the top two alleles account for only 35% of its alleles. The allele distributions for Non-Hispanic blacks generally have more entropy than those for other ethnicities, especially for the least- and most- informative loci. Thus the Non-Hispanic blacks appear to have more genetic diversity than the other ethnicities in NHANES III.
Figure 3 shows the actual allele frequecies. Alleles are ordered by frequency in the U.S. population and alleles with frequency < 1% are not shown. Some loci have only 5 alleles with frequency >= 1%, some have as many as 11. Most loci have many alleles with frequency 1 – 5%, and the presence of such alleles can be very informative for inferring family relationships. D2S1338 has the highest entropy, due to having 11 alleles, many with frequency 1 – 5%. D3S1358 has a flat allele distribution, but only 5 alleles, so has low entropy. TPOX has a sharply dropping distribution, emphasizing its low entropy. There are clear ethnic differences in allele frequencies at many loci, especially for non-Hispanic blacks (e.g. D13S317, CSF1PO, D18S51 ).
We classified the most likely relationship for a pair of household members by the highest exact or IBS LR in favor of that relationship. If each LR for each relationship for a pair is less than one, we classify the pair as most likely unrelated. When the pair reported the same ethnicity, the LR used the allele frequencies for that ethnicity. The overall U.S. allele frequencies were used for the LRs for the 54 pairs reporting different ethnicities.
Table 4 classifies the most likely relationship for a pair of household members (highest LR in favor of that relationship) by the exact (FST = 0 and = 0) and IBS methods. The two methods strongly agree on which pairs are most likely parent-child or siblings. No IBS LR for cousin was presented in Presciuttini et al. (2002), so for the cousin pairs by the exact method, the IBS LR parcels them out to 2nd degree and unrelated. The Spearman correlations of the exact and IBS LR for parent-child, sibling, and 2nd-degree are 0.97, 0.98, 0.94 respectively, underscoring that the two methods rank relationships equally. This strong agreement changes negligibly with different FST or .
We considered whether the most likely relationship is consistent with the reported ages. Only 5% (59) of the parent-child pairs had an age difference of under 16 years and 9% (42) of the sibling pairs had an age difference over 25 years. A more refined analysis might attribute these pairs to another likely relationship consistent with the ages of the pair. Furthermore, seven pairs had identical observed DNA profiles, implying that they are either identical twins or they are the same individual (some of these pairs have differing ages or ethnicities).
Figure 4 plots the exact and IBS LRs for three relationships, limiting to pairs where either the exact or IBS LR is greater than one. For siblings, the IBS LR underestimates the exact LR (a smooth loess curve is added to make this clear). For parent-child and 2nd-degree, the underestimation is pronouced at higher exact LRs, where the true parent-child or 2nd-degree pairs are likely to be. So while the two methods agree on the ranking of relationships, they can disagree on the quantification of the LR.
We did not estimate error rates or FST from our complex survey data, but instead assessed sensitivity to plausible values. We observed a 1% sex mismatch rate (section 1.1), suggesting that perhaps a 2% error rate overall is reasonable; we also considered 0% and 4%. A National Research Council report recommends using FSTs of 1 – 3% (National Research Council II Report, 1996). We considered FSTs of 0%, 1%, and 3%.
Table 5 shows the distribution of the exact LR for the most likely relationship by FST. The most striking observation are the rather low LRs for 2nd-degree, cousin, and unrelated, suggesting that the Identifiler® loci are not informative enough to conclusively determine these three relationships. Second, the parent-child and sibling LRs are sensitive to FST , with median LRs changing by factors of 3-7 as FST increases, and Q3 LRs changing by factors of 10. Although these intraclass correlations (FST) are small, the exact LR changes a lot because the exact method derives powerful information from matching on rare alleles (Ayres, 2000). Any non-zero FST implies that a match on rare alleles could well be a result of sharing unknown distant relatives rather than sharing a close familial relationship. This result is analogous to the inflation of the variance under cluster sampling with a small intraclass correlation but large clusters (Korn and Graubard, 1999).
Table 6 shows the counts of the best relationship (by exact LR) by error rates and FST . Increasing the error rate increases the number of parent-child relationships because a single genotyping error causing a perfect mismatch at a locus eliminates the possibility of a parent-child relationship. Allowing for an error rate removes this possibility, allowing the other loci to contribute meaningfully to the parent-child LR. Sibling relationships are not sensitive to error rates. 2nd-degree and cousin relationships are sensitive, mostly because the LRs in favor of these relationships are very small and are vulnerable to small changes. Increasing FST decreases the counts of parent-child and sibling relationships on the order of 5%. Thus FST has little effect on the determination of the most likely relationship, but strongly affects the quantification of the LR in its favor.
Table 7 shows the distribution of the exact LR by most likely relationship, by ethnicity. We fixed = 2%, FST = 1% as our most plausible values. When parent-child is most likely, the LR for non-Hispanic blacks tends to be the highest, possibly due to greater entropy in the non-Hispanic black allele frequencies. However, when sibling is most likely, the LRs seem somewhat more comparable, although somewhat lower for Mexican Americans.
Without complete and conclusive reporting of family history, we cannot formally verify how close the inferred familial relationships are to the truth. But as an approximation to truly unrelated individuals, we consider to be unrelated the 500 household pairs where either member is over the age of 40, have ages within 12 years of each other, and are of opposite sexes. Most likely, these are married or unmarried couples, i.e., pairs who are highly likely to be unrelated. We use the LR assuming = 2% and FST = 1%. To make decisions about relationships, we have set LR cutoffs. To be conservative, we consider an LR > 104 to be strong evidence for the relationship. Since we expect far more unrelated individuals than second-degree relationships, we consider an LR < 103 to be evidence for being unrelated. We are equivocal for LRs between 103 and 104.
Of these 500 pairs, the LR maximizes at unrelated for 382. Another 17 and 86 LRs maximize at halfsib or cousin, respectively. The maximum LR for halfsib is only 16 and for cousin is merely 3, indicating that each of these 103 pairs are most likely unrelated. Four pairs had maximum LR at parent-child (with maximum LR of 457), but these pairs all had age differences under 7 years, so they are not parent-child relationships, and their small LRs indicate that they are most likely unrelated. The remaining 11 pairs have maximum LR for full siblings; 6 have LR under 1000 (most likley unrelated), one has LR of 6000 (equivocal), the remaining four have LRs of 106, 107, 108, and 1013 (overwhelmingly full sibling). Thus we believe the genetic data naturally infers the 495 unrelated pairs in this group of 500 pairs, identifies another 4 who are most likely full siblings, and only one pair is unresolved.
The LRs can be used to help infer the most likely family structure within each household. We can infer family structure only amongst household members with genotyping results. We use only the exact LR assuming = 2% and FST = 1%. We restrict our presentation to households with two- or three-persons with genotyping results (88% of the households) as larger households are more likely to contain half-siblings and cousins, non-immediate relationships that our LR has little ability to detect. We assume that all pairs with LRs maximizing at half-sibling or cousin are truly unrelated. A thorough analysis that infers family structure by carefully considering age, race, ethnicity, and other demographic variables, especially to account for household members without genotyping results (who are not in our dataset), is beyond the scope of this article.
For the 1070 households with two individuals with genotyping results, the LR maximizes at: 576 unrelated, 350 parent-child, and 144 full-sibling relationships. Purely full-sibling relationships in a two-person household may imply the presence of other household members for whom we do not have genotyping results. For the 329 households with three individuals with genotyping results, the LR maximixes at: 95 2-parent 1-child trios, 81 parent-child plus an unrelated, 51 single parent raising two full-siblings, 19 unrelated person raising 2 full-siblings, 53 completely unrelated, and 30 maximized at inbred or impossible family structures.
We estimated maximum-likelihood estimates of the IBD probabilities k0, k1 and k2 numerically using the Nelder-Mead simplex algorithm as implemented by the R function optim(). We fixed = 2% and FST = 1% as our most plausible values. It took 5 hours on a Pentium 4 3Ghz computer to compute IBD probabilities for all 3610 pairs.
Table 8 shows the distribution of the estimated IBD probabilites by ethnicity. For parent-child, there is not much difference in the distributions of 1, 2 by ethnicity. For siblings, non-Hispanic whites in our NHANES III sample tend to have the highest 1, 2, followed by non-Hispanic blacks for 2. In particular, both of their median 2 are elevated over 0.25 and the non-Hispanic white median 1 = 0.524 is also elevated over 0.5. These slight elevations suggest that non-Hispanic white siblings in our NHANES III sample may be more closely related than expected, and suggest the presence of cryptic relatedness.
An advantage of estimating IBD probabilities is flagging potentially non-standard relationships. Twenty pairs had 0.4 ≤ 2 < 1; these are non-standard (possibly inbred) familial relationships for which we do not compute an LR.
The NHANES III genetics data will be an unparalleled resource for incorporating genetics into a comprehensive understanding of the determinants of health in the U.S. and for developing, targeting, and evaluating policies for disease prevention that use genetic information. However, these analyses are handicapped until family relationships amongst household members are inferred. We evaluated two methods for relationship inference, the exact and IBS methods, and find that while they often agree on the most likely relationship, the IBS method generally underestimates the LR in favor of a relationship. This underestimation occurs because the exact method can take advantage of the informativeness of matches on rare alleles. The IBS method is robust to inadequate or inappropriate allele frequencies, but the NHANES III allele frequencies are a large population-based sample, so the exact method seems appropriate (notwithstanding possible concern about the stability of the less common (1 – 5%) allele frequency estimates). Accounting for genotyping errors and FST is critical for quantifying the LR, but has little effect on which relationship is judged most likely. The genotyping error rate has most effect on parent-child relationships. LRs for non-Hispanic blacks tend to be most informative because their allele frequencies have the most entropy.
Other health surveys worldwide have begun collecting genetic information, such as the Health 2000 survey (Samani et al., 2008) and the Canadian Health Measures Survey. We expect that future health surveys will routinely collect genetic information. Our results are likely relevant to other surveys since Identifiler® is usually conducted for specimen tracking by most DNA labs.
Furthermore, even if family relationships are believed to be accurately reported, genetic data are critical to verify reported relationships and identify data quality issues. We noted an observed gender/chromosomal sex mismatch rate of 1% (so overall would be about 2%). In our experience with other datasets, we have observed non-trival sex mismatch rates of 1-4%. In section 3.2, we found that 5% of the parent-child pairs and 9% of the full-sibling pairs had implausible reported ages, seven participants were either identical twins with household members or else duplicates in the data, and twenty pairs may have non-standard (inbred) relationships. Furthermore, our methods could also be used to detect unsuspected familial relationships across households. Discrepancies could reflect either on misreporting by household members (perhaps lack of knowledge of true paternity) or on specimen handling/analysis problems in the survey. Regardless of the source of the discrepancy, the genetic analysis helps identify such problems.
Both the exact and IBS methods can also be viewed from the perspective of probabilistic record linkage (Herzog et al., 2007) because they compare two data vectors for matches. For relationship inference, there are ”record linkages” of different types based on the different possible relationships. The difference between the exact and IBS methods is whether to extract information from the commonality or rareness of matches, akin to a similar debate in the record linkage literature (Herzog et al., 2007, Ch. 9).
While the LR can be used to infer the most likely familial relationship, estimating IBD probabilities has two advantages. IBD probabilities provide a continuous measure of the amount of DNA shared by two household members. All relatives have, only on average, the IBD sharing in Table 2, and the estimated IBD probabilities estimate the true IBD sharing. Another advantage of IBD probability estimates is their ability to improve regression modeling of survey data, regardless of whether the model uses genetic information, via specifying the correlation structure of a continuous outcome measured on household members. For example, a simplified model for the correlation of two household members' outcomes y1, y2 using the average IBD sharing I = 0k0 + 1k1 + 2k2 is
(Lange, 2002, pg 101). Specified within-household correlation of outcomes can be exploited in regression modeling to improve efficiency of parameter estimates (Korn and Graubard, 1999). This correlation matrix applies regardless of whether the model involves genetic information.
Our ethnically-specific allele frequecy estimates are unique because they are explicitly nationally-representative. Comparing our estimates to other established estimates (Budowle et al., 2001; Einum and Scarpetta, 2004) shows agreement on common allele frequencies, but disagreements on rarer (1 – 5%) allele frequencies that are the most informative yet most vulnerable to small uncertainties. Our estimates may be helpful for calculating the probability that a given DNA profile matches by chance with a random individual from the U.S. population, or to infer whether individuals in a genetic database may be relatives of an individual with a given DNA profile (Bieber et al., 2006).
Our STR loci are not informative enough to conclusively determine that pairs are 2nd-degree relatives, cousins, or unrelated. Relationship inference could be improved by using large numbers of Single Nucleotide Polymorphisms (SNPs) instead of STRs. The available SNPs in NHANES III were chosen as candidate polymorphisms from a priori hypotheses for association with diseases of potential public health significance (Chang et al., 2009). These SNPs are unlikely to be included as a group in SNP panels used for sample tracking, quality control and assessment of cyptic relatedness. However, future NHANES genotyping may involve dense genotyping panels including over one million SNPs, and these data will be important for resolving distant relationships.
We thank the National Center for Health Statistics for use of their Research Data Center to conduct this research. This research was supported in part by the Intramural Research Program of the NIH/National Cancer Institute.
Conflict of Interest: None declared.