Search tips
Search criteria 


Logo of sagmbStatistical Applications in Genetics and Molecular BiologySubmit to Statistical Applications in Genetics and Molecular BiologySubscribeStatistical Applications in Genetics and Molecular Biology
Stat Appl Genet Mol Biol. 2009 January 1; 8(1): 37.
Published online 2009 September 9. doi:  10.2202/1544-6115.1469
PMCID: PMC2861329

Identifying Individuals in a Complex Mixture of DNA with Unknown Ancestry


A new test was recently developed that could use a high-density set of single nucleotide polymorphisms (SNPs) to determine whether a specific individual contributed to a mixture of DNA. The test statistic compared the genotype for the individual to the allele frequencies in the mixture and to the allele frequencies in a reference group. This test requires the ancestries of the reference group to be nearly identical to those of the contributors to the mixture. Here, we first quantify the bias, the increase in type I and type II error, when the ancestries are not well matched. Then, we show that the test can also be biased if the number of subjects in the two groups differ or if the platforms used to measure SNP intensities differ. We then introduce a new test statistic and a test that only requires the ancestries of the reference group to be similar to the individual of interest, and show that this test is not only robust to the number of subjects and platform, but also has increased power of detection. The two tests are compared on both HapMap and simulated data.

1. Introduction

Given a mixture of DNA samples from numerous individuals, it is often desirable to determine whether a specific individual contributes DNA to that mixture. Using forensics as an example, this mixture can be a specimen from a crime scene, and the goal can be determining whether a suspect’s DNA is included in that specimen. Many methods have been proposed to identify the presence of an individual within a mixture. Most of them focus on cases where only a few people contribute to the mixture. These methods usually compare short tandem repeats (STR) in the mixture to those in the individual (Fung and Hu, 2002; Balding, 2003; Foreman et al., 2003). When only interested in males, the comparison can be limited to STRs on the Y chromosome (Jobling and Gill, 2004). In cases where the DNA has degraded, a second approach, comparing Mitochondrial DNA (mtDNA), specifically the hypervariable region, between mixture and individual can be a better alternative (Stoneking et al., 1991). The limitations of each method were discussed in Homer et al. (2008), but in brief, methods based on STRs require the DNA to be in good condition and methods based on mtDNA, even when augmented by informative SNPs, can have limited discriminatory power (Homer et al., 2008).

In a ground-breaking paper, Homer et. al. propose a new method using a genome-wide set of SNPs (Homer et al., 2008) to identify an individual in a mixture. Their method compares YM={YM1,,YMN}, where YM j is the proportion or frequency of the “A” allele in the mixture at SNP j, with Y0={Y01,,Y0N}, where Y0 j is the proportion of the “A” allele in the individual. Clearly, the possible values of Y0 j are 0, 0.5, and 1. Note that we have assumed there are N SNPs and have labeled the two possible alleles at each SNP as A and B. Not only is their new method based on a more robust technique, it has higher resolution, able to identify an individual in a mixture containing thousands of subjects. This method also highlighted the potential for a new problem, important enough to be discussed in Science (Couzin, 2008) and require an immediate statement by the NIH. Their results proved that the practice of publicly releasing group-level results from case/control GWAS studies may jeopardize participants’ anonymity.

The test statistic, T, proposed by Homer et. al., compares the similarity between Y0 and YM to the similarity between Y0 and YR, where YR are the allele frequencies in a reference mixture. If T is large enough, the corresponding test will reject the null hypothesis that the individual of interest does not contribute DNA to the mixture. For this test to perform well, the subjects for the reference group must be carefully chosen so that their ancestral composition matches that of the subjects contributing to the mixture (e.g. if Caucasians contribute to the mixture, then the reference group must also be Caucasian), or only a select subgroup of ancestry independent SNPs can be used (Kidd et al., 2006). The need for similar reference groups is clear. Assume we fail to select a similar reference group, and the individual is more similar, in terms of ancestry, to those subjects contributing to the mixture. Then, even if the subject’s DNA is absent from the mixture, Y0 will be more similar to YM, T will be large, and the test will likely result in a false positive. Similarly, if the individual and the reference group are more similar, the test can lead to a false negative. The obvious problem is that the identities, and therefore the ancestries, of the individuals in the mixture are unknown, and this required matching can be very difficult.

The first goal of this paper is to quantify the magnitude of the bias, type I error, and type II error that can occur if the ancestries of the two groups are poorly matched. Then we identify two other possible sources of bias, the type of platform (e.g. Illumina, Affymetrix) and the number of subjects in the reference group. We use HapMap data to further quantify the extent of the bias in real samples (The International HapMap Consortium, 2003). After demonstrating the severity of the potential problems caused by using T, we propose a new statistic that only requires selecting a reference group with an ancestry that matches the ancestry of the individual of interest. This is a far simpler task, as the individual’s ancestry is usually known. Other benefits of the new statistic are that it is unbiased and has a known null distribution. We then compare the performance of the two statistics using simulated data. The paper is therefore ordered as follows. Section II starts by introducing notation and then continues with a discussion of the properties of both statistics and their associated tests. Section III demonstrates the performance of the statistics and their associated tests using both HapMap and simulated data. Finally, section IV contains our brief concluding remarks.

2. Methods

2.1. Notation

Let the individual of interest be indexed by 0, and let the nR subjects in the reference population and the nM subjects in the mixture be indexed by {1, ...,nR + nM}. For subject i, i [set membership] {0, 1, ...,nR + nM}, let Ri = 1 (Mi = 1) if subject i is in the reference group (mixture), Ri = 0 (Mi = 0) otherwise. Unless explicitly stated otherwise, order the subjects so the first nM belong to the mixture (i.e. Mi = 1 for 1 ≤ inM). Note that we use the shorthand that “subject i is in the mixture” or “subject i is in the mixture group” if subject i contributes DNA to the mixture. Let ei be the ancestry, or ethnicity, of subject i. We start by assuming homogenous groups and therefore ei [set membership] {e0, eM, eR}, with ei = eR if Ri = 1, and ei = eM if Mi = 1.

Let there be N SNPs in the study. Let Qi j1 and Qi j2 indicate whether the minor allele (relative to ethnicity e0) is on chromosome 1 and 2, respectively, for subject i at SNP j. Let Yi j = 0.5 (Qi j1 + Qi j2) be the genotype for subject i at SNP j; Yi j [set membership] {0, 0.5, 1}. We define the population allele frequencies by pM j [equivalent] P(Qi jk = 1 | Mi = 1, i > 0), pR j [equivalent] P(Qi jk = 1 | Ri = 1, i > 0), and p0 j [equivalent] P(Q0 jk = 1) where k [set membership] {1, 2}. For calculating genotype probabilities, we assume Hardy-Weinberg Equilibrium.

Most genotyping platforms directly measure fluorescent intensity, which should be proportional to allele frequency. We shall label the intensity measures for allele A and allele B at SNP j from the mixture group as IAM j and IBM j. These intensities are measurements from a “pooled sample”. We label the ratio of intensities by γM j = IAM j / (IAM j + IBM j). If intensity measurements are also available for individuals, then we can calculate a constant kM j using known formulas (Pearson et al., 2007; Macgregor et al., 2008) and use a better proxy for the A allele frequency, γM j = IAM j / (IAM j + kM jIBM j). Similarly, define {IARj, IBRj, γR j, kR j} for the reference group. Therefore, the distance measure at each SNP from the Homer et. al. paper can be defined as


Their test statistic can then be defined as


where DL11NjDL1(j) and μ0 = 0. In this manuscript, we will also discuss


and the corresponding values DL2 and TL2. The “L1” and “L2” subscripts emphasize the distance measure used for the statistic. We discuss the case when intensities perfectly reflect underlying allele frequencies and define X (j) [equivalent] | Y0 j[p with hat]R j | – | Y0 j[p with hat]M j|, where p^MjnM1i:Mi=1Yij and p^RjnR1i:Ri=1Yij.

In general, we assume that γM j and γR j can be described by the following statistical models:

γMj=i:Mi=1YijnM+βMj+εMj=p^Mj+βMj+εMj     γRj=i:Ri=1YijnR+βRj+εRj=p^Rj+βRj+εRj

where εMj  N(0,σMj2), εRj  N(0,σRj2), and {βM j, βR j} are SNP/platform specific biases. Therefore, the error is now linearly associated with the ratio of intensities, instead of the intensities themselves. As studies tend to exclude SNPs with rare minor alleles, we assume that 0 ≤ γM j, γR j ≤ 1 will generally hold. This description of γM j and γR j as normal variables is a simplification, but we use it only to show the existence of and then approximate expected biases. Note that the properties of our suggested test statistic do not depend on the validity of equation 4. We define βM={βM1,,βMN}, βR={βR1,,βRN}, σM2={σM12,,σMN2}, and σR2={σR12,,σRN2}. Finally, we use the abbreviation Θ to represent the set of all parameters, Θ={nM,nR,βM,βR,σR2,σM2,e0,eM,eR}.

2.2. DL1 : The Original Test

The definition of any test requires a statement of the null hypothesis and the list of outcomes that would lead to the rejection of that null. Therefore, the desired null hypothesis is that the individual of interest is not in the mixture, M0 = 0, and the original test is to reject this null when TL1, with μ0 = 0, is large. The next goal is to calculate the appropriate threshold, type I error rate, and power. Unfortunately, TL1 is not a pivotal statistic, in that its distribution still depends on the parameter set Θ. Stating M0 = 0 does not lead to a single distribution of TL1. Therefore, for the original paper to discuss the type I error rate and power, there needed to be three additional assumptions

  1. Identical Ethnicities: e0 = eM = eR
  2. Identical Sample Sizes: nM = nR
  3. Identical Platforms: βM j = βR j and σMj2=σRj2  j

With these added assumptions, TL1 no longer depends on Θ and they could choose a threshold tα so that the type I error of the test would be


The power was then calculated as


Unfortunately, if one or more of these three assumptions are false, the type I error rate and power can differ from α and βPOW. Here, we consider a test to be biased for a given Θ if P(TL1 > tα |M0 = 0, Θ) ≠ α or P(TL1 > tα |M0 = 1, Θ) ≠ βPOW. In the following sections, we show the magnitude of the bias when each of the assumptions is violated.

2.2.1. DL1 : Bias from Ancestry

In this section, we show that unless the ancestries of the reference and mixture groups are extremely well matched, the false positive rate can easily be near 1, as opposed to the predicted α. Ignoring the sample size and platform parameters, there are seven possible scenarios (Table 1). The original paper thoroughly discussed scenarios H0 and H1 with assumptions II and III holding, but only warned of potential bias in the other cases (i.e P(TL1 > tα |M0 = 0, eM, eR, e0) ≠ α and P(TL1 > tα |M0 = 1, eM, eR, e0) ≠ βPOW for all combinations of e0, eM, and eR). Here, we want to estimate the magnitude of the type I error for the scenario Hf p where the result is likely to be a false positive (Table 1). Under Hf p, the individual is not in the mixture, but only the ancestries of the individual of interest and the mixture are the same (e0 = eMeR).

Table 1:
There are seven scenarios depending on whether the individual of interest is in the mixture (column 2) and depending on whether the ethnicity of the individual of interest (e0), the ethnicity of the mixture group (eM), and the ethnicity of the reference ...

Estimating the true type I error rate requires understanding the distribution of TL1, or equivalently, DL1 (j), for the three scenarios, H0, H1, and Hf p. As the allele frequencies at many of the SNPs will likely be the same for all populations, p0 j = pR j = pM j, the type I error rate will depend on s, defined to be the proportion of all SNPs that are ancestry independent, and the differences between pR j and pM j for the other (1 – s)N SNPs. To start, we calculate the asymptotic distribution of DL1 (j) (see appendix 5.1), the fundamental component of TL1, in all three scenarios. When n is large, n [equivalent] nM = nR, and βR j = βM jj,

H0:DL1(j)  N(μ0(j),σ2(j))H1:DL1(j)  N(μ1(j),σ2(j))Hf p:DL1(j)  N(μf p(j),σf p2(j))


μ0(j)=0μ1(j)=1n2pMj(1pMj)2μfp(j)=(pMjpRj)(12(1pMj)2) if pRj<0.5            pMj(2pMj26pMj+3)+pRj(12pMj2) if pRj>0.5σ2(j)=pMj(1pMj)n+σMj2+σRj2σfp2(j)=12npMj(1pMj)+12npRj(1pRj)+σRj2+σMj2+        (pMjpRj)24pMj(2pMj)(1pMj)2

Figure 1 shows the type I error rate as a function of s assuming that pM j is uniformly distributed over the interval (0, 0.5), pR j = pM j at SNPs 1,...,sN, and pR j is uniformly distributed over the interval (0,0.5) for the remaining (1-s)N SNPs independent of pM j. This will overestimate the type I error rate for a given s, because, in truth, pR j will often be relatively close to pM j. Nevertheless, this simplified example demonstrates the magnitude of the type I error rate. Figure 1 shows that when nM is large (i.e. nM = 1000), over 99% of all SNPs would need to be ancestry independent to prevent the false positive rate from being near 1. Therefore, the false positive rate can easily equal or exceed the power of the test. Next, we can calculate the value of s needed for the false positive rate to equal the power, which is essentially when μ1 =[mu]f p, where μ1 [equivalent] E[μ1 (j)] and [mu]f p [equivalent] (1 – s) E[μf p (j)]. In appendix 5.2, assuming the allele frequencies have the above uniform distributions, we show that μ1 = 0.23/n and [mu]f p = 0.062(1 – s). Therefore, if 1 – s = 3.7/n, μ1 = [mu]f p. If our mixture contained 1000 subjects, the reference and mixture groups would only need to differ at 0.37% of all SNPs for the false positive rate to exceed the power. Had we allowed the allele frequencies for those (1-s)N SNPs to be randomly distributed over the interval [0, 1], even fewer SNPs would need to differ. Here, μ1 = [mu]f p if 1 – s = 1.2/n, and only 0.12% of all SNPs would need to differ.

Figure 1:
The false positive rate can easily exceed power when tests are based on TL1. Using the normal approximations given in equations 7 and 8, we plot the false positive rate when Hf p is true, showing its decrease with the proportion, s, of SNPs that are ancestry ...

2.2.2. DL1 : Bias from Sample Size

In this section, we show that unless the number of subjects in the mixture and reference groups are equal, either type I or type II error can be higher than expected. Here, we keep assumptions I and III, identical ancestries and platforms, and examine the test’s bias if nM and nR are finite and unequal. Before discussing the origin and extent of this bias, we state a useful fact, which is proven in appendix 5.3.

Inequality of Absolute Values:

Let p < 0.5, p1  N(p,σ12), p2  N(p,σ22), and p1 [perpendicular] p2.

If σ12>σ22, then E [|0.5 – p1|] > E [|0.5– p2|]

Our goal is to show that if nMnR, then E[DL1 (j)] ≠ 0 when M0 = 0, even under otherwise ideal conditions. This is demonstrated empirically by figure 2. A more complete discussion follows. Without loss of generality, assume that we have a large reference sample and nR > nM. Then var([p with hat]M j) > var ([p with hat]R j). Because [p with hat]R j is a more precise estimate of p, the expected L1 distance between Y0 j and [p with hat]R j tends to be smaller than the expected distance between Y0 j and [p with hat]M j, or, equivalently, E[DL1 (j)] < 0. This would be a trivial statement if distance were measured in L2. For L1, we will show that E[Xj] < 0, which is essentially equivalent to E[DL1 (j)] < 0. Recall, X (j) [equivalent] | Y0 j[p with hat]R j | – | Y0 j[p with hat]M j |. To use the Inequality of Absolute Values, we start by assuming p^Rj  N(p,nR1σp2) and p^Mj  N(p,nM1σp2), and decompose X (j) as




It is immediately clear that when the individual of interest is not in the mixture, E[Xj1] = E[[p with hat]R j] – E[[p with hat]M j] = 0 and E[Xj0] = E[[p with hat]R j] – E[[p with hat]M j] = 0. With our assumptions of normality and the above inequality, E[Xj0.5] < 0. Consequently, E[Xj] < 0. For an even more intuitive understanding, let nR ≈ ∞ and [p with hat]R j = pR j [equivalent] 0.5 Then, clearly |0.5 – [p with hat]R j | = 0, and Xj0.5 < 0.

Figure 2:
E[DL1 (j)] ≠ 0 even when pR j = pM j [equivalent] p. In this example, where we let nR = ∞ ([p with hat]R j = p), we find that E[DL1 (j)] decreases as the number of individuals in the mixture shrinks toward 0. We plot E[DL1 (j)] (y-axis) ...

2.2.3. DL1: Bias from Platform

In this section, we show that unless the allele intensities for the mixture and reference group are measured on the same platform, the test may be biased, often leading to an increase in type II error. Normally, we need not concern ourselves with platform bias, as we can ensure that both samples will be measured on the same platform. Unfortunately, this is not the case here. For the mixture, we have a sample of DNA and we can choose our preferred platform. For the reference group, we often do not have access to actual samples of DNA, and instead base our estimates of allele frequencies on previously recorded genotypes for a group of individuals. Comparing γM j and nR1i=Ri=1Yij is equivalent to comparing allele frequencies measured on two different platforms. Also, even if the reference sample were an actual mixture of DNA, that sample might no longer be available for analysis.

First, if βMβR, then E[DL1] ≠ 0 when M0 = 0, even under otherwise ideal conditions. Simple algebra shows that if model (4) were true,


Here, we presume E[DL1 (j)] > 0 implies E[DL1] > 0, or that j=1NβMjj=1NβRj. Fortunately, this bias can be easily removed by substituting γMj*γMjβ^Mj and γRj*γRjβ^Rj into the equations for DL1 (j), where {[beta]M j, [beta]R j} is an unbiased approximation of {βM j, βR j}, or by using the appropriate constants kM j and kR j for calculating γM j and γR j.

However, this does not eliminate the potential bias caused by platform. If σMj2σRj2, then E[| Y0 jγR j | – | Y0 jγM j |] ≠ 0. The origin of this confounding is identical to that for the nMnR case. Here, we would assume that p^Mj+εMj  N(p,σ12) and p^Rj+εRj  N(p,σ22). Next, note that even if E[[beta]M j] = βM j, E[[beta]R j] = βR j, and σMj2=σRj2, there can still be bias if var ([beta]M j) ≠ var ([beta]R j), as E[|Y0jγRj*||Y0jγMj*|]0. The origin of the confounding is easily identified if we assume that p^Mj+β^MjN(p,σ12) and p^Rj+β^RjN(p,σ22). The main importance of these statements is that the reference sample cannot simply be the average of known genotypes. In this case σRj2=0 and βR j = 0 (i.e. var ([beta]R j) = 0). This would create a considerable test bias.

2.3. DL2 : Switching from L1 to L2

Using the L2 distance (recall DL2 (j) = (Y0 jγR j)2 – (Y0 jγM j)2) greatly simplifies calculations. Under the null hypothesis H0, if we keep assumptions II and III, the mean and variance of DL2 (j) can be easily calculated as (see appendex 5.4):


With the same assumptions, under the alternative hypothesis, H1, the mean of DL2 (j) is


and the variance can be approximated by σ2 (j). In contrast to DL1, we have estimated the parameters for the normal approximation of DL2 without assuming large n. Equivalently, the approximation of NDL2N(0,N1jσ2(j)) requires only DL2 (j1) [perpendicular] DL2 (j2) ∀ j1j2 and N being large. In addition to the ease of L2, we found the statistic TL2 to outperform TL1. This can be verified by simulation (data not shown) or by comparing the two normal approximations for TL1 and TL2. In the online supplementary material, we compare the power of the statistics TL1 and TL2 for specific values of n, N, p, and σ2, assuming independent SNPs and pM j ~ uniform[0,0.5]. The plots suggest a mild improvement using TL2.

2.4. DL2*: A New Statistic

Our ideal goal would be to develop a pivotal statistic that, when M0 = 0, is a) independent of the individual of interest’s ethnicity; b) independent of the number of individuals and the ethnicity of those individuals in the mixture; and c) independent of the platform chosen to analyze the mixture. Although we could find no such statistic, we can take advantage of being able to easily identify an individuals’ ethnicity by genotyping a small set of SNPs (< 0.01N). The previous requirement of identifying the ethnic composition of the mixture is a far more difficult task. Therefore, we introduce a new statistic that, given e0, will have a N(0,1) distribution when M0 = 0 regardless of the remaining parameters. The key difference in deriving this statistic is that the individuals in the reference group will be selected to have the same ancestry as the individual of interest, as opposed to the same ancestral composition as the mixture. Note that the small set of SNPs used to identify the individuals’ ancestries can be removed from the later analyses without greatly diminishing power. A suggested list of SNPs will be made available by the authors in the near future. Moreover, we found that the suggested statistic will lead to a test with increased power. The details of the statistic follow.

To describe the new statistic, we change the notation slightly as we no longer have Yi j values for subjects in the mixture. Therefore, in addition to having γM j and Y0 j, 1 ≤ jN, choose nR subjects for a reference group and let Yi j be the genotype for subject i, 1 ≤ inR at SNP j. Note that i is bounded by nR instead of nR + nM. Select the reference subjects so their ancestry is similar to that of the individual of interest (i.e. ei = e0 if Ri = 1).

Step 1:

Create nR + 1 reference samples, γR0, γR1,,γRnR, where γRk={γRk1,,γRkN}, k [set membership] {0, ...,nR} and


Here, σR2=0 in model (4). We have immediately removed a source of variation.

Step 2:

Measure the distance between each individual and the appropriate reference group, (Yi jγRi j)2, and compare the resulting value to the distance between that individual and the mixture


Because the σR2 term is absent from DiL2*(j), var(DiL2*(j))var(DL2(j)). In fact,


All variance and covariance calculations will assume n [equivalent] nR = nM and pj [equivalent] pR j = p0 j = pM j.

Step 3:

We then average those differences over all reference subjects,


to obtain an expected difference between the distance to the reference sample and the distance to the mixture, under the null hypothesis. The covariance between two terms in the sum is (see appendix 5.4)

cov(D1L2*(j),D2L2*(j))=   2σMj4+(2pj2pj2)σMj2n+14pj34pj2+2pj312pj4n3+18pj+118pj252pj3+114pj4n4

Each term on the right side of equation 18 must be positive, so the covariance must be positive (i.e. cov(D1L2*(j),D2L2*(j))>0).

Step 4:

Compare D0L2*(j) to this averaged value


By noting the exchangeability between any two terms Di1L2*(j) and Di2L2*(j), we can calculate the variance of DL2*(j)


The key result is that, except for those values of σMj* where the power is essentially 1 regardless of the chosen statistic, var(DL2*(j))<var(DL2(j)). As the means for the L2 statistics are the same, the smaller variance suggests that our proposed statistic TL2*, based on DL2*, will usually have higher power than the original TL2. This improvement is demonstrated in our simulations. Moreover, the statistic was designed so E[DL2*(j)|H0]=0 if the ancestries of the reference group and individual of interest are matched correctly, regardless of platform and sample size.

Step 5:

Average over all SNPs to get


Because of the large number of SNPs, the CLT suggests that NDL2*N(0,σD2) under the null hypothesis. When allowing for dependency,


Therefore, σD2 can be estimated by


In practice, we restrict the double sum to j1j2 and | j1j2 | ≤ 500. In the on-line supplementary material, we use the HapMap samples to show N1|j1j2|>500(DL2*(j1)DL2*)(DL2*(j2)DL2*)0. We now define our new statistic,


and note that TL2*N(0,1). For purposes of future comparisons, we also define DL1*(j), DL1*, and TL1* by replacing the L2 distance with the L1 distance in DL2*(j), DL2*, and TL2*.

Although TL2* appears to be one of the better performing statistics, it is not necessarily the most intuitive. We first tried a simpler alternative, DL2s, but as we discuss here, this statistic proved to have extremely low power. Let




This alternative also has the desirable characteristics that neither platform nor sample size can invalidate the equalities E[(Y0 jγM j)2] = E[(Yi jγM j)2] and E[DLs(j2)]=0, and that the ancestry of the reference group need only match that of the individual of interest. The problem with such a simple approach is that the var(DL2s)n×var(DL2). To understand the origin of this vast difference between variances, we turn to a simple example, where Yi j [set membership] {0, 1} and P(Yi j = 1) = pjM. Here, DL2(j){2(γMjγRj)+γRj2γMj2,γRj2γMj2}, and the var(DL2 (j)), under H0, asymptotically approaches 0. In contrast, DL2s(j){(1γMj)2+k(YR,γMj),γMj2+k(YR,γMj)}, where k(YR,γMj) is the appropriate function of the reference group. Without the cancelation of the Y0 j terms in the statistic, asymptotically, var(DLs(j2))pjM(1pjM)(1+4pjM2). Therefore, we designed DL2* to not only avoid potential bias, but retain the advantage gained by the cancelation of Y0 js in DL1. Here we also note that the variance of DL2 (j) or any similar statistic will be minimized when E[γM j] = E[γR j].

This same simple example, where Yi j [set membership] {0, 1} and P(Yi j = 1) = p jM, illustrates why the original statistic, DL1 (j) = | Y0 jγM j | – | Y0 jγR j |, or any of our adaptations, needs to include the reference term, | Y0 jγR j |. Under the null hypothesis, var(| Y0 jγR j | – |Y0 jγM j |) = 2p jM (1 – p jM) /n → 0. In contrast, without the cancelation of Y0 j terms, var(|Y0jγMj|)pjM5pjM2+8pjM34pjM4. Even if the ethnicities were well matched, the variance of the genome-wide statistic would be proportional to the sum of these variances over the entire genome, not just over the subset of informative SNPs. For choosing the best statistic, the variance will be the deciding value. The additional term | Y0 jγR j | does not change the expected difference between the test statistic under the null and alternative hypotheses.

2.5. Data and Simulations

The power and type I error for the test based on TL1 has already been thoroughly described when e0 = eR = eM. Here, we use simulations to accomplish two goals. First, we show that tests based on TL2* are more powerful than tests based on TL1. Second, we show that if the threshold tα is chosen so that the type I error rate is α when e0 = eR = eM (equation 5), but, in fact, assumption I is not true, then the true type I error rate can greatly exceed α.

The simulations generated distributions of TL1, TL2, TL1*, and TL2* for each of the three scenarios, H0, H1, and Hf p. Recall that the denominator for TL1 was N1j(DL1(j)DL1)2, which assumes independence across SNPs. Therefore, in order to provide a fair comparison, we let the genotypes be independently distributed and let the denominator for TL2* be N1j(DL2*(j)DL2*)2.

First, we simulated 10,000 datasets (and 10,000 values of each test statistic) under H0, with e0 = eM = eR, nM = nR, and βM = βR = 0. Each dataset is simulated as follows. For each of 5000 SNPs (when nM = 100) or 50,000 SNPs (nM = 1000), we randomly generated p0 j, the frequency of allele ‘A’ for the individual of interest, from a uniform(0.1,0.5) distribution. Then, we set pM j = pR0j = pRM j = p0 j, where pM j, pR0 j, and pRM j are the ‘A’ allele frequencies for subjects in the mixture, reference group R0 (selected to match the individual of interest), and reference group RM (selected to match the mixture). Given the allele frequencies and assuming Hardy-Weinberg Equilibrium, we then generated the genotypes for the individual of interest ( (Y0)), the nm individuals in the mixture (YM), and the nR individuals in each reference group (YR0, YRM). Given knowledge of the individuals in the mixture and reference groups, we generated the microarray intensities at each SNP as γMj=nM1i:Mi=1Yij+εMj and γRj=nR1R0i=1Yij, where εMj  N(0,σM2). Simulations were performed for a range of σM2 values. The results are a single dataset {Y0,YM,YR0,YR} and a set of values {TL1,TL2,TL1*,TL2*}. After simulating all 10,000 datasets under H0, we denote the 99th percentile of the four test statistics by tαL1, tαL2, t^αL1*, and t^αL2*.

Next, we simulated 1000 datasets under H1. To create the datasets, we repeated the above steps, but let γM(j)=nM1(Y0j+i=2nMYij)+εMj. Then the power for the original test was the propotion of these 1000 TL1 values exceeding tαL1. A similar estimation of power was performed for TL2, TL1*, and TL2*. Finally, we simulated datasets under Hf p. The type I error rate will depend on how the allele frequencies differ between the two ancestries. In one case, we assumed that a proportion, s, of the SNPs to be ancestry independent, (i.e. pM j = pR0j = pRM j = p0 j). For j > s × 5000 or j > s × 50, 000, we still enforced pM j = pR0j = p0 j, but generated pRM j from a uniform(0.1,0.5) distribution. In the second case, we assumed that the allele frequencies differed at all SNPs and let pRM j = (1 + kj)±1 p0 j, where {k1, ..., kN} was a vector of equally spaced points over the range 0 to 0.15, 0.20, and 0.25 when nM = 100 and 0 to 0.03, 0.05, and 0.07 when nM = 1000. The superscript ± 1 indicates that the exponent was randomly generated to be −1 or 1 with equal probability. Here, we can quantify the extent of the difference between the two ethnicities by the measure λ, the variance inflation factor typical in case/control studies. For nM = 100, the three ranges correspond to a λ of 1.16, 1.36, and 1.65, and for nM = 1000 the three ranges correspond to 1.09, 1.32, and 1.67. For each of these cases, we simulated 1000 datasets and defined the type I error rate to be the percentage of these 1000 TL1 (TL2) values exceeding tαL1 (tαL2).

To better quantify the type I error rate possible in real experiments, we used the 90 CEU and 45 Japanese individuals from the HapMap samples. For each CEU individual i, we selected 9 unrelated CEU individuals to create a positive mixture, γM+=0.1(Yi+k=19Yk)+εi, or 10 unrelated individuals to create a negative mixture, γM=0.1(k=110Yk)+εi, where εi{εij,,εiN}, εij ~ N(0,0.012) and εi j1 [perpendicular] εi j2 if j1j2. To achieve meaningful levels of power, we chose N = 1000. For each CEU individual, 11 reference groups were similarly created where γRt included t Japanese individuals and 10 – t CEU individuals. We calculated the distribution of TL1 (36000 simulations, 90 subjects × (400,000 SNPs/1000)=400 sets of SNPs) under H0, H1, and Hf p. We chose the threshold tαL1 so that α = 0.05 if H0 were true and then calculated the probability of rejection under H1 (i.e. power) and under Hf p (i.e. type I error).

3. Results

3.1. Comparison of Power

Our first goal is to show that tests based on TL2* are more powerful than tests based on TL1. As the results for the two sets of simulations, nM = nR = 1000 and nM = nR = 100 were similar, we limit our discussion to the latter. Recall, we chose to empirically estimate the 99th percentile of the test statistics under H0. This was a necessity because our reference mixture is an average of known genotypes, σM2σR2, and as discussed in section 2.2.3, TL1 cannot be expected to follow a N(0, 1) distribution. Under H0, when σM = 0.03, 0.05, 0.07, and 0.09, the averages of TL1 over the 10,000 simulations were −0.76, −1.71, −2.77, and −3.91 respectively. Similarly, the averages of TL2 were −1.85, −4.13, −6.46, and −8.68 for those same values of σM. The empirical variances were close to 1. In contrast, TL2* followed a N(0, 1) distribution, as promised, and this empirical step was superfluous. The empirical means for TL1* and TL2* were near 0, –0.015 ≤ T¯L1* ≤ –0.010 and –0.024 ≤ T¯L2* ≤ –0.015, regardless of σM. Moreover, not only were the empirical variances near 1, the empirical 99th percentiles were approximately 2.326 (2.29 ≤ t^αL1* ≤ 2.32 and 2.31 ≤ t^αL2* ≤ 2.32).

For each value of σM, we then calculated the power for the four tests, based on TL1, TL2, TL1*, and TL2*. The alternative hypothesis was H1 with the other parameters unchanged. These values are plotted against signal noise or σM in Figure 3. Clearly, power is larger when the statistics are based on L2 and when the reference group is matched to the individual of interest. Note, had we included an error term in the reference mixture (and possibly attained TL1, TL2 ~ N (0, 1)), the benefit of using the new statistic would have been larger.

Figure 3:
Tests based on TL2* have the highest power. Power for tests based on the four statistics, TL1, TL2, TL1*, and TL2*, are plotted for multiple values of σM2 or noise. Other parameters in the simulation were nM = nR = 100, N = 5000, s = 1, and ...

3.2. Examination of False Positive Rate

Our second goal is to show that when assumption I is violated, the type I error rate can be much larger than α. Again, we presume tα was selected so equation 5 holds. Simulations showed the quick increase in the false positive rate, P(TL1 > tα | Hf p), as the proportion of ancestry-independent SNPs decreased in those tests based on TL1. Parameter values were chosen so that the ideal test (e0 = eM = eR), for both nM = 100 and nM = 1000, would have had power = 0.87 to detect an individual’s presence. Clearly, unless s is close to 1, the false positive rate exceeds this estimated power. Because we have finite sample sizes and σM2σR2, we cannot expect the predicted power and false positive rate to be equal when 1 – s = 3.7/n (Appendix 5.2). However, the simulations can confirm that the two are approximately equal when μ1 =μf p and that value of 1 – s which results in equality is proportional to 1/n. When the power and false positive rate cross in Figure 4 for nM = 100, we find μ1 = 2.40, μf p = 2.41, and 1 – s = 0.088. Similarly for nM = 1000, we find μ1 = 2.65, μf p = 2.64, and 1 – s = 0.0089. For simulations under the second set of pRM j, we found the false positive rates to be 0.11, 0.30, and 0.61 when nM = 100 and 0.035, 0.15, and 0.51 when nM = 1000.

Figure 4:
The type I error rate for tests based on TL1 can exceed power. Type I error rate increases as the proportion, s, of ancestry independent SNPs decreases. Other parameters in the simulation were N=5000 (50,0000), σM2=0.0352(0.012), and nM = nR ...

The false positive rate for the HapMap samples were calculated when the mixture and reference groups each had 10 subjects. Both the individual of interest and the subjects in the mixture were from the CEU population. As the ratio of Japenese:CEU individuals increased in the reference group, the false positive rate increased from the α-level, 0.05, to near 1. When the ratio exceeded 8:2, the false positive rate exceeded the estimated power.

4. Discussion

As the overall popularity of SNP microarray technology increases and the cost of the technology decreases, there will likely be a shift from STRs to genome-wide sets of SNPs as the preferred method for DNA identification. Therefore, databases designed to store genetic identifiers for individuals will likely store genotypes for sets of SNPs in the future. Coupled with our earlier discussion, there will be three main advantages of using high-density SNPs to determine whether an individual contributes DNA to a mixture: accessibility, higher resolving power and the ability to work with low quality, degraded, DNA.

Here, we have demonstrated that tests based on TL1 can suffer from inflated type I and type II errors if the mixture contains individuals with unknown ancestries. Therefore, we introduced a new test statistic, TL2* and an accompanying test that only require matching the ancestry of the reference group to that of the individual of interest. Even if the individual of interest has a mixture of ancestries, there should still be some subjects, with a similar mixture of ancestries, that can be used for comparison. This test is also robust to platform and sample size. We showed that both switching from the L1 to the L2 measure and switching to the new type of statistic increased the power to detect the individual of interest. Therefore, TL2* is not only more robust than TL1, it tends to have increased power.

Figure 5:
The type I error rate for tests based on TL1 can exceed power in simulations based on HapMap samples. The individual of interest and individuals in the mixture group were from the CEU population. The individuals in the reference group were a mixture from ...

Supplementary Material


additional power comparison


We would like to thank Dr. David Craig for providing us with data from his original article and to the anonymous reviewers for their helpful comments. This work was supported by NIH GM59507.

5. Appendix

5.1.  Distribution of DL1 (j)

To estimate the distribution of DL1 (j), we make the following assumptions, where g [set membership] {R, M}:

  1. Model (4) is true.
  2. βR j = βM j = 0
  3. n [equivalent] nM = nR
    And restrict the SNPs examined to those satisfying
  4. |pg j – 0.5 | >δ
    for some small δ such that
  5. P(|[p with hat]g jpg j | > δ /2) ≈ 0.
  6. P(|εg j | > δ /2) ≈ 0.

Assumptions 5 and 6 promise that n is large enough and the magnitude of εg j is small enough so that the signs inside the absolute values are determined by Y0 jpg j. For much of the exposition, we will still leave the βR j and βM j terms in the equations, although we assume they are small enough to not affect the sign inside the absolute values, to demonstrate their effects on bias and variance had they not been assumed to equal 0. To start, divide DL1 (j) into three components



Dj1i=(n+1)2n1Yijn1Y1jni=2n1YijnβRjεRj+βMj+εMjDj0.5i=(n+1)2n0.5Yijn0.5Y1jn                                        i=2n0.5YijnβRjεRj+βMj+εMj if pRj<0.5Dj0.5i=(n+1)2nYij0.5n0.5Y1jn                                        i=2n0.5Yijn+βRj+εRj+βMj+εMj if pRj>0.5Dj0i=(n+1)2nYijn             Y1jni=2nYijn+βRj+εRjβMjεMj

By definition pM j < 0.5. We next calculate the expected values of each component under the two assumptions, H1 and Hf p. We know that E[Yij]=pgj2+pgj(1pgj)=pgj, where g indicates group. Therefore, assuming scenario H1, where without loss of generality, we let the individual of interest be subject 1 of the mixture,




We immediately see that


Let us assume that βM j = βR j, then we get


Next we calculate the expected value under H f p,

E[Dj1|Hfp]=pMjpRjβRj+βMjE[Dj0.5|Hfp]=pMjpRjβRj+βMj if pRj<0.5E[Dj0.5|Hfp]=pMj+pRj1+βRj+βMj if pRj>0.5E[Dj0|Hfp]=pRjpMj+βRjβMj

We immediately see that

E[DL1(j)|Hfp,pRj<0.5]=(pMjpRj+βMjβRi)(12(1pMj)2)E[DL1(j)|Hfp,pRj>0.5]=pMj(2pMj26pMj+3)+pRj(12pMj2)+                                                 βMj(12(1pMj)2)+βRj(12pMj2)

For simplification, let us assume that βM j = βR j, then we get

jE[DL1(j)|Hfp]=j:pRj<0.5(pMjpRj)(12(1pMj)2)+                           j:pRj>0.5pMj(2pMj26pMj+3)+pRj(12(1pMj)2

Now, we turn our attention to the variance of DL1 (j) and recall that


We calculate the var(DL1 (j) | Y0 j), assuming all individuals within a group are unrelated,

E[var(DL1(j)|Y0j)]=12npRj(1pRj)+var(Y1j)n2+                              n12n2pMj(1pMj)+σRj2+σMj2

Under H0, where E[DL1 (j) | Y0 j] = 0 and pR j = pM j, we see that


For simplification, we again assume that βM j = βR j and therefore


Under H1, we find that


and for large n, we see that E[var(DL1 (j) | Yj0, H1)] ≈ E[var(DL1 (j) | Yj0, H0)]. Similarly, because E[DL1 (j) | Y0 j, H1] = E[DL1 (j) |Y0 j, H0] + O( 1n), then var(E[DL1(j)|Y0j,H0])=var(E[DL1(j)|Y0j,H1])(1+O(1n))+O(1n2). Therefore, we can assume that var(DL1 (j) | H1) ≈ var(DL1 (j) | H0).

Next, we turn to scenario Hf p. Clearly,

E[var(DL1(j)|Y0j,Hf p)]=12npMj(1pMj)+12npRj(1pRj)+σRj2+σMj2

To calculate the var(E[DL1 (j) | Y0 j, Hf p]), we use the notation pΔj = pM jpR j and βΔj =βM jβR j, then under Hf p, if we let pR j < 0.5,

var(E[DL1(j)|Y0j,Hfp])=                    [(pΔjβΔj)24pMj(2pMj)(1pMj)2]

Now, if we assume that the minor allele in the mixture is the same as the minor allele in the reference group, or equivalently, that we have at least identified a reference population of moderately similar composition, then we can be satisfied that equation (43) is a satisfactory approximation of var(E[DL1 (j) | Y0 j, Hf p]). Again, with the simplification that βM j = βR j, we find

var(DL1(j)|Hfp)=12npMj(1pMj)+12npRj(1pRj)+σRj2+σMj2+                                    [(pMjpRj)24pMj(2pMj)(1pMj)2]

5.2.  Estimating when μ1 = [mu]f p

Here, we provide a rough approximation of the percentage of SNPs that would need to be ancestry dependent for E[μ1 (j)] = (1 – s) E[μf p (j)]. We base this approximation on the assumption that pM j and pR j are uniform random variables over the intervals [0,0.5] and [0,1] respectively. Start by calculating E[DL1 (j) | pR j < 0.5] from equation 35, letting βM j = βR j where the expectation is over pR j and pM j.

E[DL1(j)|pR<0.5,Hfp]   =400.500.5(pMjpRj)(4pMj12pMj2)dpRjdpMj   =400.500.5(4pRjpMj+4pMj2+pRjpMj+2pMjpMj22pMj3)dpRjdMj   =0.062

Had we assumed pR j > 0.5, E[DL1 (j) | pR j > 0.5, Hf p] = 0.32. Combining the two equalities, we find E[DL1 (j) | Hf p] = 0.19. Obviously, from an evolutionary perspective, pR and pM should often be similar in value, but we ignore that here for simplicity. Furthermore, assume that only some proportion, 1 – s, of those N SNPs differ among ethnicities. Then E[μ f p (j)] = 0.19 and [mu] f p [equivalent] (1 – s) E[μf p (j)] = 0.19(1 – s).

Next, from equation 33, we know nE [DL1 (j)] = 2pM j (1 – pM j)2, and again assuming pM j is a uniform random variable over the interval [0,0.5], we find E[DL1 (j) | H1] = .22/n. Therefore, across all SNPs, assuming no linkage, μ1 [equivalent] E[μ1 (j)] = .22/n.

Therefore, in order for μ1 = [mu] f p, we would need .22/n = 0.19 (1 – s), which shows that false positives can be expected so long as the percentage of ancestry dependent SNPs, 1 – s, exceeds 1.16/n. Remember, that the difference between pR and pM is likely to be overstated by the above simplifications, and therefore this 1.16/n will be low.

5.3.  Inequality of Absolute Values (proof)

Theorem: Let X1N(p,σ12), X2N(p,σ22), g(X) = | 0.5–X |, X1 [perpendicular] X2, σ12>σ22, and p < 0.5. Then


Let Z1 = F1 (X1) and Z2 = F2 (X2), where F1 and F2 are their respective cumulative distribution functions. Then

E[g(X2)]E[g(X1)]=g(X2)f2(X2)dX2g(X1)f1(X1)dX1=01g(F21(Z2))dZ201g(F11(Z1))dZ1=00.5g(F21(Z2))+g(F21(1Z2))dZ2                00.5g(F11(Z1))+g(F11(1Z1))dZ1

There are two possibilities. Case A) Assume F21(1Z2)<0.5. Then, by linearity of g(X),


Clearly, by similar logic,


Case B: F21(1Z2)0.5. Then, clearly


Therefore, for case B we also see


Taking the expectation over both scenarios gives the desired result, E[g (X2)] – E[g (X1)] < 0.

5.4.  Distribution of DL2 (j) and DL2*(j)

The distribution for DL2 (j) and DL2*(j) can be described by the variances and covariances of terms in the following vector:


by simple calculation, the var(V1) can be described by:


In this section, as we focus on only a single locus, we have dropped the subscript ‘j’ from Yi j, pj and σj2. Next, by expanding the terms, and assuming n [equivalent] nM = nR, σR2=σM2 and βR = βM = 0, we get

DL2(j)=       i:Ri=12Y0Yin2Y0εR+i:Ri=1k:Rk=1YiYkn2+i:Ri=12YiεRn+εR2       (i:Mi=12Y0Yin2Y0εM+i:Mi=1k:Mk=1YiYkn2+i:Mi=12YiεMn+εM2)


var(DL2(j))=(8n+4n(n1)n4var(Y1Y2)+(8+8n)var(Y1εM)+                 2n3var(Y12)+2var(εM2)+                 (16n8n2+16n2)cov(Y1Y2,Y1Y3)+                 (8n3)cov(Y1Y2,Y12)+(88n)cov(Y1εM,Y2εM)

By substituting the appropriate values from the variance matrix for V1, we get the variance given in section 2.3. Using the same concept of expanding the terms and straight forward calculation, we were able to find covar( D1L2*,D2L2*).


  • Balding David J. Likelihood-based inference for genetic correlation coefficients. Theoretical Population Biology. 2003;63(3):221–230. doi: 10.1016/S0040-5809(03)00007-8. [PubMed] [Cross Ref]
  • Couzin Jennifer. Genetic privacy: Whole genome data not anonymous, challenging assumptions. Science. 2008 Sep;321(5894):1268–1374. doi: 10.1126/science.321.5894.1278. [PubMed] [Cross Ref]
  • Foreman LA, Champod C, Evett JA, Lambert S Pope. Interpreting dna evidence: A review. International Statistical Review. 2003;71:473–495.
  • Fung Wing K, Hu Yue-Qing. Evaluating mixed stains with contributors of different ethnic groups under the nrc-ii recommendation 4.1. Statistics in Medicine. 2002 Nov;21:3583–3593. doi: 10.1002/sim.1313. [PubMed] [Cross Ref]
  • Homer Nils, Szelinger Szabolcs, Redman Margot, Duggan David, Tembe Waibhav, Muehling Jill, Pearson John V, Stephan Dietrich A, Nelson Stanley F, Craig David W. Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS Genet. 2008 Aug;4(8):e1000167. doi: 10.1371/journal.pgen.1000167. [PMC free article] [PubMed] [Cross Ref]
  • Jobling Mark A, Gill Peter. Encoded evidence: Dna in forensic analysis. Nat Rev Genet. 2004 Oct;5:739–751. doi: 10.1038/nrg1455. [PubMed] [Cross Ref]
  • Kidd Kenneth K, Pakstis Andrew J, Speed William C, Grigorenko Elena L, Kajuna Sylvester LB, Karoma Nganyirwa J, Kungulilo Selemani, Kim Jong-Jin, Lu Ru-Band, Odunsi Adekunle, Okonofua Friday, Parnas Josef, Schulz Leslie O, Zhukova Olga V, Kidd Judith R. Developing a snp panel for forensic identification of individuals. Forensic Science International. 2006;164(1):20–32. doi: 10.1016/j.forsciint.2005.11.017. [PubMed] [Cross Ref]
  • Macgregor Stuart, Zhao Zhen Zhen, Henders Anjali, Nicholas Martin G, Montgomery Grant W, Visscher Peter M. Highly cost-efficient genome-wide association studies using dna pools and dense snp arrays. Nucleic Acids Res. 2008;36:e35. doi: 10.1093/nar/gkm1060. [PMC free article] [PubMed] [Cross Ref]
  • Pearson JV, Huentelman MJ, Halperin RF, Tembe WD, Melquist S, et al. Identification of the genetic basis for complex disorders by use of poolingbased genomewide single-nucleotide-polymorphism association studies. American Journal of Human Genetics. 2007;80:126–139. doi: 10.1086/510686. [PubMed] [Cross Ref]
  • Stoneking M, Hedgecock D, Higuchi RG, Vigilant L, Erlich HA. Population variation of human mtdna control region sequences detected by enzymatic amplification and sequence-specific oligonucleotide probes. American Journal of Human Genetics. 1991;48(2):370–382. [PubMed]
  • The International HapMap Consortium The international hapmap project. Nature. 2003 Dec;426:789–796. [PubMed]

Articles from Statistical Applications in Genetics and Molecular Biology are provided here courtesy of Berkeley Electronic Press