Home | About | Journals | Submit | Contact Us | Français |

**|**Stat Appl Genet Mol Biol**|**PMC2861329

Formats

Article sections

Authors

Related links

Stat Appl Genet Mol Biol. 2009 January 1; 8(1): 37.

Published online 2009 September 9. doi: 10.2202/1544-6115.1469

PMCID: PMC2861329

Copyright © 2009 The Berkeley Electronic Press. All rights reserved

This article has been cited by other articles in PMC.

A new test was recently developed that could use a high-density set of single nucleotide polymorphisms (SNPs) to determine whether a specific individual contributed to a mixture of DNA. The test statistic compared the genotype for the individual to the allele frequencies in the mixture and to the allele frequencies in a reference group. This test requires the ancestries of the reference group to be nearly identical to those of the contributors to the mixture. Here, we first quantify the bias, the increase in type I and type II error, when the ancestries are not well matched. Then, we show that the test can also be biased if the number of subjects in the two groups differ or if the platforms used to measure SNP intensities differ. We then introduce a new test statistic and a test that only requires the ancestries of the reference group to be similar to the individual of interest, and show that this test is not only robust to the number of subjects and platform, but also has increased power of detection. The two tests are compared on both HapMap and simulated data.

Given a mixture of DNA samples from numerous individuals, it is often desirable to determine whether a specific individual contributes DNA to that mixture. Using forensics as an example, this mixture can be a specimen from a crime scene, and the goal can be determining whether a suspect’s DNA is included in that specimen. Many methods have been proposed to identify the presence of an individual within a mixture. Most of them focus on cases where only a few people contribute to the mixture. These methods usually compare short tandem repeats (STR) in the mixture to those in the individual (Fung and Hu, 2002; Balding, 2003; Foreman et al., 2003). When only interested in males, the comparison can be limited to STRs on the Y chromosome (Jobling and Gill, 2004). In cases where the DNA has degraded, a second approach, comparing Mitochondrial DNA (mtDNA), specifically the hypervariable region, between mixture and individual can be a better alternative (Stoneking et al., 1991). The limitations of each method were discussed in Homer et al. (2008), but in brief, methods based on STRs require the DNA to be in good condition and methods based on mtDNA, even when augmented by informative SNPs, can have limited discriminatory power (Homer et al., 2008).

In a ground-breaking paper, Homer *et. al.* propose a new method using a genome-wide set of SNPs (Homer et al., 2008) to identify an individual in a mixture. Their method compares
${\overrightarrow{Y}}_{M}=\{{Y}_{M1},\dots ,{Y}_{MN}\}$, where *Y** _{M j}* is the proportion or frequency of the “A” allele in the mixture at SNP

The test statistic, *T*, proposed by Homer *et. al*., compares the similarity between
${\overrightarrow{Y}}_{0}$ and
${\overrightarrow{Y}}_{M}$ to the similarity between
${\overrightarrow{Y}}_{0}$ and
${\overrightarrow{Y}}_{R}$, where
${\overrightarrow{Y}}_{R}$ are the allele frequencies in a reference mixture. If *T* is large enough, the corresponding test will reject the null hypothesis that the individual of interest does not contribute DNA to the mixture. For this test to perform well, the subjects for the reference group must be carefully chosen so that their ancestral composition matches that of the subjects contributing to the mixture (e.g. if Caucasians contribute to the mixture, then the reference group must also be Caucasian), or only a select subgroup of ancestry independent SNPs can be used (Kidd et al., 2006). The need for similar reference groups is clear. Assume we fail to select a similar reference group, and the individual is more similar, in terms of ancestry, to those subjects contributing to the mixture. Then, even if the subject’s DNA is absent from the mixture,
${\overrightarrow{Y}}_{0}$ will be more similar to
${\overrightarrow{Y}}_{M}$, *T* will be large, and the test will likely result in a false positive. Similarly, if the individual and the reference group are more similar, the test can lead to a false negative. The obvious problem is that the identities, and therefore the ancestries, of the individuals in the mixture are unknown, and this required matching can be very difficult.

The first goal of this paper is to quantify the magnitude of the bias, type I error, and type II error that can occur if the ancestries of the two groups are poorly matched. Then we identify two other possible sources of bias, the type of platform (e.g. Illumina, Affymetrix) and the number of subjects in the reference group. We use HapMap data to further quantify the extent of the bias in real samples (The International HapMap Consortium, 2003). After demonstrating the severity of the potential problems caused by using *T*, we propose a new statistic that only requires selecting a reference group with an ancestry that matches the ancestry of the individual of interest. This is a far simpler task, as the individual’s ancestry is usually known. Other benefits of the new statistic are that it is unbiased and has a known null distribution. We then compare the performance of the two statistics using simulated data. The paper is therefore ordered as follows. Section II starts by introducing notation and then continues with a discussion of the properties of both statistics and their associated tests. Section III demonstrates the performance of the statistics and their associated tests using both HapMap and simulated data. Finally, section IV contains our brief concluding remarks.

Let the individual of interest be indexed by 0, and let the *n** _{R}* subjects in the reference population and the

Let there be N SNPs in the study. Let *Q*_{i j}_{1} and *Q*_{i j}_{2} indicate whether the minor allele (relative to ethnicity *e*_{0}) is on chromosome 1 and 2, respectively, for subject *i* at SNP *j*. Let *Y** _{i j}* = 0.5 (

Most genotyping platforms directly measure fluorescent intensity, which should be proportional to allele frequency. We shall label the intensity measures for allele A and allele B at SNP *j* from the mixture group as *I** _{AM j}* and

$${D}_{{L}_{1}}(j)=|{Y}_{0j}-{\gamma}_{Rj}|-|{Y}_{0j}-{\gamma}_{Mj}|$$

(1)

Their test statistic can then be defined as

$${T}_{L{}_{1}=}\frac{\sqrt{N}({D}_{{L}_{1}}-{\mu}_{0})}{\sqrt{{N}^{-1}{\sum}_{j}{({D}_{{L}_{1}}(j)-{D}_{{L}_{1}})}^{2}}}$$

(2)

where
${D}_{{L}_{1}}\equiv \frac{1}{N}{\sum}_{j}{D}_{{L}_{1}}(j)$ and *μ*_{0} = 0. In this manuscript, we will also discuss

$${D}_{{L}_{2}}(j)={({Y}_{0j}-{\gamma}_{Rj})}^{2}-{({Y}_{0j}-{\gamma}_{Mj})}^{2}$$

(3)

and the corresponding values *D*_{L}_{2} and *T*_{L}_{2}. The “*L*_{1}” and “*L*_{2}” subscripts emphasize the distance measure used for the statistic. We discuss the case when intensities perfectly reflect underlying allele frequencies and define *X* (*j*) | *Y*_{0} * _{j}* –

In general, we assume that *γ** _{M j}* and

$$\begin{array}{l}{\gamma}_{Mj}=\frac{{\sum}_{i:{M}_{i}=1}{Y}_{ij}}{{n}_{M}}+{\beta}_{Mj}+{\epsilon}_{Mj}={\widehat{p}}_{Mj}+{\beta}_{Mj}+{\epsilon}_{Mj}\\ {\gamma}_{Rj}=\frac{{\sum}_{i:{R}_{i}=1}{Y}_{ij}}{{n}_{R}}+{\beta}_{Rj}+{\epsilon}_{Rj}={\widehat{p}}_{Rj}+{\beta}_{Rj}+{\epsilon}_{Rj}\end{array}$$

(4)

where
${\epsilon}_{Mj}\sim N(0,{\sigma}_{Mj}^{2})$,
${\epsilon}_{Rj}\sim N(0,{\sigma}_{Rj}^{2})$, and {*β** _{M j}*,

The definition of any test requires a statement of the null hypothesis and the list of outcomes that would lead to the rejection of that null. Therefore, the desired null hypothesis is that the individual of interest is not in the mixture, *M*_{0} = 0, and the original test is to reject this null when *T*_{L}_{1}, with *μ*_{0} = 0, is large. The next goal is to calculate the appropriate threshold, type I error rate, and power. Unfortunately, *T*_{L}_{1} is not a pivotal statistic, in that its distribution still depends on the parameter set Θ. Stating *M*_{0} = 0 does not lead to a single distribution of *T*_{L}_{1}. Therefore, for the original paper to discuss the type I error rate and power, there needed to be three additional assumptions

- Identical Ethnicities:
*e*_{0}=*e*=_{M}*e*_{R} - Identical Sample Sizes:
*n*=_{M}*n*_{R} - Identical Platforms:
*β*=_{M j}*β*and ${\sigma}_{Mj}^{2}={\sigma}_{Rj}^{2}\forall j$_{R j}

With these added assumptions, *T*_{L}_{1} no longer depends on Θ and they could choose a threshold *t** _{α}* so that the type I error of the test would be

$$P({T}_{{L}_{1}}>{t}_{\alpha}|{M}_{0}=0,{e}_{0}={e}_{M}={e}_{R},{n}_{M}={n}_{R},{\overrightarrow{\beta}}_{M}={\overrightarrow{\beta}}_{R},{\overrightarrow{\sigma}}_{M}^{2}={\overrightarrow{\sigma}}_{R}^{2})=\alpha $$

(5)

The power was then calculated as

$$P({T}_{{L}_{1}}>{t}_{\alpha}|{M}_{0}=1,{e}_{0}={e}_{M}={e}_{R},{n}_{M}={n}_{R},{\overrightarrow{\beta}}_{M}={\overrightarrow{\beta}}_{R},{\overrightarrow{\sigma}}_{M}^{2}={\overrightarrow{\sigma}}_{R}^{2})={\beta}_{POW}$$

(6)

Unfortunately, if one or more of these three assumptions are false, the type I error rate and power can differ from *α* and *β** _{POW}*. Here, we consider a test to be biased for a given Θ if

In this section, we show that unless the ancestries of the reference and mixture groups are extremely well matched, the false positive rate can easily be near 1, as opposed to the predicted *α*. Ignoring the sample size and platform parameters, there are seven possible scenarios (Table 1). The original paper thoroughly discussed scenarios *H*_{0} and *H*_{1} with assumptions II and III holding, but only warned of potential bias in the other cases (i.e *P*(*T*_{L}_{1} *> t** _{α}* |

There are seven scenarios depending on whether the individual of interest is in the mixture (column 2) and depending on whether the ethnicity of the individual of interest (*e*_{0}), the ethnicity of the mixture group (*e*_{M}), and the ethnicity of the reference **...**

Estimating the true type I error rate requires understanding the distribution of *T*_{L}_{1}, or equivalently, *D*_{L}_{1} (*j*), for the three scenarios, *H*_{0}, *H*_{1}, and *H** _{f p}*. As the allele frequencies at many of the SNPs will likely be the same for all populations,

$$\begin{array}{r}{H}_{0}:{D}_{{L}_{1}}(j)\sim N({\mu}_{0}(j),{\sigma}^{2}(j))\\ {H}_{1}:{D}_{{L}_{1}}(j)\sim N({\mu}_{1}(j),{\sigma}^{2}(j))\\ {H}_{fp}:{D}_{{L}_{1}}(j)\sim N({\mu}_{fp}(j),{\sigma}_{fp}^{2}(j))\end{array}$$

(7)

where

$$\begin{array}{l}{\mu}_{0}(j)=0\\ {\mu}_{1}(j)=\frac{1}{n}2{p}_{Mj}{(1-{p}_{Mj})}^{2}\\ {\mu}_{fp}(j)=({p}_{Mj}-{p}_{Rj})(1-2{(1-{p}_{Mj})}^{2})\text{if}{p}_{Rj}0.5\\ -{p}_{Mj}(2{p}_{Mj}^{2}-6{p}_{Mj}+3)+{p}_{Rj}(1-2{p}_{Mj}^{2})\text{if}{p}_{Rj}0.5\\ {\sigma}^{2}(j)=\frac{{p}_{Mj}(1-{p}_{Mj})}{n}+{\sigma}_{Mj}^{2}+{\sigma}_{Rj}^{2}\\ {\sigma}_{fp}^{2}(j)=\frac{1}{2n}{p}_{Mj}(1-{p}_{Mj})+\frac{1}{2n}{p}_{Rj}(1-{p}_{Rj})+{\sigma}_{Rj}^{2}+{\sigma}_{Mj}^{2}+\\ {({p}_{Mj}-{p}_{Rj})}^{2}4{p}_{Mj}(2-{p}_{Mj}){(1-{p}_{Mj})}^{2}\end{array}$$

(8)

Figure 1 shows the type I error rate as a function of *s* assuming that *p** _{M j}* is uniformly distributed over the interval (0, 0.5),

The false positive rate can easily exceed power when tests are based on *T*_{L}_{1}. Using the normal approximations given in equations 7 and 8, we plot the false positive rate when *H*_{f p} is true, showing its decrease with the proportion, *s*, of SNPs that are ancestry **...**

In this section, we show that unless the number of subjects in the mixture and reference groups are equal, either type I or type II error can be higher than expected. Here, we keep assumptions I and III, identical ancestries and platforms, and examine the test’s bias if *n** _{M}* and

Let *p <* 0.5,
${p}_{1}\sim N(p,{\sigma}_{1}^{2})$,
${p}_{2}\sim N(p,{\sigma}_{2}^{2})$, and *p*_{1} *p*_{2}.

If
${\sigma}_{1}^{2}>{\sigma}_{2}^{2}$, then *E* [|0.5 – *p*_{1}|] > *E* [|0.5– *p*_{2}|]

Our goal is to show that if *n** _{M}* ≠

$$X(j)={X}_{j1}1({Y}_{0j}=1)+{X}_{j0.5}1({Y}_{0j}=0.5)+{X}_{j0}1({Y}_{0j}=0)$$

(9)

where

$$\begin{array}{c}{X}_{j1}=|1-{\widehat{p}}_{Rj}|-|1-{\widehat{p}}_{Mj}|\\ {X}_{j0.5}=|0.5-{\widehat{p}}_{Rj}|-|0.5-{\widehat{p}}_{Mj}|\\ {X}_{j0}=|0-{\widehat{p}}_{Rj}|-|0-{\widehat{p}}_{Mj}|\end{array}$$

(10)

It is immediately clear that when the individual of interest is not in the mixture, *E*[*X*_{j}_{1}] = *E*[* _{R j}*] –

In this section, we show that unless the allele intensities for the mixture and reference group are measured on the same platform, the test may be biased, often leading to an increase in type II error. Normally, we need not concern ourselves with platform bias, as we can ensure that both samples will be measured on the same platform. Unfortunately, this is not the case here. For the mixture, we have a sample of DNA and we can choose our preferred platform. For the reference group, we often do not have access to actual samples of DNA, and instead base our estimates of allele frequencies on previously recorded genotypes for a group of individuals. Comparing *γ** _{M j}* and
${n}_{R}^{-1}{\sum}_{i={R}_{i}=1}{Y}_{ij}$ is equivalent to comparing allele frequencies measured on two different platforms. Also, even if the reference sample were an actual mixture of DNA, that sample might no longer be available for analysis.

First, if *β** _{M}* ≠

$$E[{D}_{{L}_{1}}(j)]\approx ({\beta}_{Mj}-{\beta}_{Rj})(1-2{(1-{p}_{Mj})}^{2})$$

(11)

Here, we presume *E*[*D*_{L}_{1} (*j*)] > 0 implies *E*[*D*_{L}_{1}] > 0, or that
${\sum}_{j=1}^{N}{\beta}_{Mj}\ne {\sum}_{j=1}^{N}{\beta}_{Rj}$. Fortunately, this bias can be easily removed by substituting
${\gamma}_{Mj}^{*}\equiv {\gamma}_{Mj}-{\widehat{\beta}}_{Mj}$ and
${\gamma}_{Rj}^{*}\equiv {\gamma}_{Rj}-{\widehat{\beta}}_{Rj}$ into the equations for *D*_{L}_{1} (*j*), where {* _{M j}*,

However, this does not eliminate the potential bias caused by platform. If
${\sigma}_{Mj}^{2}\ne {\sigma}_{Rj}^{2}$, then *E*[| *Y*_{0} * _{j}* –

Using the *L*_{2} distance (recall *D*_{L}_{2} (*j*) = (*Y*_{0} * _{j}* –

$$\begin{array}{ll}{\mu}_{0}& =0\\ {\sigma}^{2}(j)& =4{p}_{Mj}^{2}{\sigma}_{Mj}^{2}+4{p}_{Mj}{\sigma}_{Mj}^{2}+4{\sigma}_{Mj}^{4}-8{p}_{Mj}^{2}{\sigma}_{Mj}^{2}+\\ & \frac{1}{n}(2{p}_{Mj}^{2}-4{p}_{Mj}^{3}+2{p}_{Mj}^{4}-4{p}_{Mj}^{2}{\sigma}_{Mj}^{2}+4{p}_{Mj}{\sigma}_{Mj}^{2})+\\ & \frac{1}{{n}^{2}}({p}_{Mj}^{2}-2{p}_{Mj}^{2}+{p}_{Mj}^{2})+\\ & \frac{1}{{n}^{3}}(\frac{{p}_{Mj}}{4}-\frac{7}{4}{p}_{Mj}^{2}+3{p}_{Mj}^{3}-\frac{3}{2}{p}_{Mj}^{4})\end{array}$$

(12)

With the same assumptions, under the alternative hypothesis, *H*_{1}, the mean of *D*_{L}_{2} (*j*) is

$${\mu}_{1}(j)=\frac{1}{n}{p}_{Mj}(1-{p}_{Mj})$$

(13)

and the variance can be approximated by σ^{2} (*j*). In contrast to *D*_{L}_{1}, we have estimated the parameters for the normal approximation of *D*_{L}_{2} without assuming large *n*. Equivalently, the approximation of
$\sqrt{N{D}_{{L}_{2}}}\sim N(0,{N}^{-1}{\sum}_{j}{\sigma}^{2}(j))$ requires only *D*_{L}_{2} (*j*_{1}) *D*_{L}_{2} (*j*_{2}) ∀ *j*_{1} ≠ *j*_{2} and *N* being large. In addition to the ease of *L*_{2}, we found the statistic *T*_{L}_{2} to outperform *T*_{L}_{1}. This can be verified by simulation (data not shown) or by comparing the two normal approximations for *T*_{L}_{1} and *T*_{L}_{2}. In the online supplementary material, we compare the power of the statistics *T*_{L}_{1} and *T*_{L}_{2} for specific values of *n*, *N*,
$\overrightarrow{p}$, and σ^{2}, assuming independent SNPs and *p** _{M j}* ~ uniform[0,0.5]. The plots suggest a mild improvement using

Our ideal goal would be to develop a pivotal statistic that, when *M*_{0} = 0, is a) independent of the individual of interest’s ethnicity; b) independent of the number of individuals and the ethnicity of those individuals in the mixture; and c) independent of the platform chosen to analyze the mixture. Although we could find no such statistic, we can take advantage of being able to easily identify an individuals’ ethnicity by genotyping a small set of SNPs (*<* 0.01*N*). The previous requirement of identifying the ethnic composition of the mixture is a far more difficult task. Therefore, we introduce a new statistic that, given *e*_{0}, will have a N(0,1) distribution when *M*_{0} = 0 regardless of the remaining parameters. The key difference in deriving this statistic is that the individuals in the reference group will be selected to have the same ancestry as the individual of interest, as opposed to the same ancestral composition as the mixture. Note that the small set of SNPs used to identify the individuals’ ancestries can be removed from the later analyses without greatly diminishing power. A suggested list of SNPs will be made available by the authors in the near future. Moreover, we found that the suggested statistic will lead to a test with increased power. The details of the statistic follow.

To describe the new statistic, we change the notation slightly as we no longer have *Y** _{i j}* values for subjects in the mixture. Therefore, in addition to having

Create *n** _{R}* + 1 reference samples,
${\overrightarrow{\gamma}}_{R0}$,
${\overrightarrow{\gamma}}_{R1},\dots ,{\gamma}_{R{n}_{R}}$, where
${\overrightarrow{\gamma}}_{Rk}=\{{\gamma}_{Rk1},\dots ,{\gamma}_{RkN}\}$,

$${\gamma}_{Rkj}=\frac{{\sum}_{i=0,i\ne k}^{{n}_{R}}{Y}_{ij}}{{n}_{R}}$$

(14)

Here, ${\sigma}_{R}^{2}=0$ in model (4). We have immediately removed a source of variation.

Measure the distance between each individual and the appropriate reference group, (*Y** _{i j}* –

$${D}_{i{L}_{2}}^{*}(j)={({Y}_{ij}-{\gamma}_{Rij})}^{2}-{({Y}_{ij}-{\gamma}_{Mj})}^{2}$$

(15)

Because the ${\sigma}_{R}^{2}$ term is absent from ${D}_{i{L}_{2}}^{*}(j)$, $\mathit{\text{var}}({D}_{iL{}_{2}}^{*}(j))\le \mathit{\text{var}}({D}_{{L}_{2}}(j))$. In fact,

$$\mathit{\text{var}}({D}_{i{L}_{2}}^{*}(j))=\mathit{\text{var}}({D}_{{L}_{2}}(j))-2{\sigma}_{Rj}^{2}[{p}_{j}(1-{p}_{j})+\frac{{p}_{j}(1-{p}_{j})}{n}+{\sigma}_{Rj}^{2}]$$

(16)

All variance and covariance calculations will assume *n* *n** _{R}* =

We then average those differences over all reference subjects,

$${\overline{D}}_{{L}_{2}}^{*}(j)=\frac{{\sum}_{i=1}^{{n}_{R}}{D}_{i{L}_{2}}^{*}(j)}{{n}_{R}}$$

(17)

to obtain an expected difference between the distance to the reference sample and the distance to the mixture, under the null hypothesis. The covariance between two terms in the sum is (see appendix 5.4)

$$\begin{array}{l}\mathit{\text{cov}}({D}_{1{L}_{2}}^{*}(j),{D}_{2{L}_{2}}^{*}(j))=\\ 2{\sigma}_{Mj}^{4}+\frac{(2{p}_{j}-2{p}_{j}^{2}){\sigma}_{Mj}^{2}}{n}+\frac{\frac{1}{4}{p}_{j}-\frac{3}{4}{p}_{j}^{2}+2{p}_{j}^{3}-\frac{1}{2}{p}_{j}^{4}}{{n}^{3}}+\frac{\frac{-1}{8}{p}_{j}+\frac{11}{8}{p}_{j}^{2}-\frac{5}{2}{p}_{j}^{3}+\frac{11}{4}{p}_{j}^{4}}{{n}^{4}}\end{array}$$

(18)

Each term on the right side of equation 18 must be positive, so the covariance must be positive (i.e. $\mathit{\text{cov}}({D}_{1{L}_{2}}^{*}(j),{D}_{2{L}_{2}}^{*}(j))>0$).

Compare ${D}_{0{L}_{2}}^{*}(j)$ to this averaged value

$${D}_{{L}_{2}}^{*}(j)={D}_{0{L}_{2}}^{*}(j)-{\overline{D}}_{{L}_{2}}^{*}(j)$$

(19)

By noting the exchangeability between any two terms ${D}_{{i}_{1}{L}_{2}}^{*}(j)$ and ${D}_{{i}_{2}{L}_{2}}^{*}(j)$, we can calculate the variance of ${D}_{{L}_{2}}^{*}(j)$

$$\mathit{\text{var}}({D}_{{L}_{2}}^{*}(j))=[1+\frac{1}{n}][\mathit{\text{var}}({D}_{0{L}_{2}}^{*}(j))-\mathit{\text{cov}}({D}_{1{L}_{2}}^{*}(j),{D}_{2{L}_{2}}^{*}(j))]$$

(20)

The key result is that, except for those values of
${\sigma}_{Mj}^{*}$ where the power is essentially 1 regardless of the chosen statistic,
$\mathit{\text{var}}({D}_{{L}_{2}}^{*}(j))<\mathit{\text{var}}({D}_{{L}_{2}}(j))$. As the means for the *L*_{2} statistics are the same, the smaller variance suggests that our proposed statistic
${T}_{{L}_{2}}^{*}$, based on
${D}_{{L}_{2}}^{*}$, will usually have higher power than the original *T*_{L}_{2}. This improvement is demonstrated in our simulations. Moreover, the statistic was designed so
$E[{D}_{{L}_{2}}^{*}(j)|{H}_{0}]=0$ if the ancestries of the reference group and individual of interest are matched correctly, regardless of platform and sample size.

Average over all SNPs to get

$${D}_{{L}_{2}}^{*}=\frac{{\sum}_{j}{D}_{{L}_{2}}^{*}(j)}{N}$$

(21)

Because of the large number of SNPs, the CLT suggests that $\sqrt{N}{D}_{{L}_{2}}^{*}\sim N(0,{\sigma}_{D}^{2})$ under the null hypothesis. When allowing for dependency,

$$\mathit{\text{var}}(\frac{{\sum}_{j}{D}_{{L}_{2}}^{*}(j)}{\sqrt{N}})={N}^{-1}\sum _{j}\mathit{\text{var}}({D}_{{L}_{2}}^{*}(j))+{N}^{-1}\sum \sum _{{j}_{1}\ne {j}_{2}}\mathit{\text{cov}}({D}_{{L}_{2}}^{*}({j}_{1}),{D}_{{L}_{2}}^{*}({j}_{2}))$$

(22)

Therefore, ${\sigma}_{D}^{2}$ can be estimated by

$${\widehat{\sigma}}_{D}^{2}=\frac{{\sum}_{j}{({D}_{{L}_{2}}^{*}(j)-{D}_{{L}_{2}}^{*})}^{2}}{N}+\frac{\sum {\sum}_{{j}_{1}\ne {j}_{2}}({D}_{{L}_{2}}^{*}({j}_{1})-{D}_{{L}_{2}}^{*})({D}_{{L}_{2}}^{*}({j}_{2})-{D}_{{L}_{2}}^{*})}{N}$$

(23)

In practice, we restrict the double sum to *j*_{1} ≠ *j*_{2} and | *j*_{1} – *j*_{2} | ≤ 500. In the on-line supplementary material, we use the HapMap samples to show
${N}^{-1}\sum {\sum}_{|{j}_{1}-{j}_{2}|>500}({D}_{{L}_{2}}^{*}({j}_{1})-{D}_{{L}_{2}}^{*})({D}_{{L}_{2}}^{*}({j}_{2})-{D}_{{L}_{2}}^{*})\approx 0$. We now define our new statistic,

$${T}_{{L}_{2}}^{*}=\frac{\sqrt{N}{D}_{{L}_{2}}^{*}}{\sqrt{{\widehat{\sigma}}_{D}^{2}}}$$

(24)

and note that
${T}_{{L}_{2}}^{*}\sim N(0,1)$. For purposes of future comparisons, we also define
${D}_{{L}_{1}}^{*}(j)$,
${D}_{{L}_{1}}^{*}$, and
${T}_{{L}_{1}}^{*}$ by replacing the *L*_{2} distance with the *L*_{1} distance in
${D}_{{L}_{2}}^{*}(j)$,
${D}_{{L}_{2}}^{*}$, and
${T}_{{L}_{2}}^{*}$.

Although
${T}_{{L}_{2}}^{*}$ appears to be one of the better performing statistics, it is not necessarily the most intuitive. We first tried a *s*impler alternative,
${D}_{{L}_{2}}^{s}$, but as we discuss here, this statistic proved to have extremely low power. Let

$${D}_{{L}_{2}}^{s}=\frac{{\sum}_{j=1}^{N}{D}_{{L}_{2}}^{s}(j)}{N}$$

(25)

where

$${D}_{{L}_{2}}^{s}(j)={({Y}_{0j}-{\gamma}_{Mj})}^{2}-\frac{{\sum}_{i=1}^{{n}_{R}}{({Y}_{ij}-{\gamma}_{Mj})}^{2}}{{n}_{R}}$$

(26)

This alternative also has the desirable characteristics that neither platform nor sample size can invalidate the equalities *E*[(*Y*_{0} * _{j}* –

This same simple example, where *Y** _{i j}* {0, 1} and

The power and type I error for the test based on *T*_{L}_{1} has already been thoroughly described when *e*_{0} = *e** _{R}* =

The simulations generated distributions of *T*_{L}_{1}, *T*_{L}_{2},
${T}_{{L}_{1}}^{*}$, and
${T}_{{L}_{2}}^{*}$ for each of the three scenarios, *H*_{0}, *H*_{1}, and *H** _{f p}*. Recall that the denominator for

First, we simulated 10,000 datasets (and 10,000 values of each test statistic) under *H*_{0}, with *e*_{0} = *e** _{M}* =

Next, we simulated 1000 datasets under *H*_{1}. To create the datasets, we repeated the above steps, but let
${\gamma}_{M}(j)={n}_{M}^{-1}({Y}_{0j}+{\sum}_{i=2}^{nM}{Y}_{ij})+{\epsilon}_{Mj}$. Then the power for the original test was the propotion of these 1000 *T*_{L}_{1} values exceeding _{αL}_{1}. A similar estimation of power was performed for *T*_{L}_{2},
${T}_{{L}_{1}}^{*}$, and
${T}_{{L}_{2}}^{*}$. Finally, we simulated datasets under *H** _{f p}*. The type I error rate will depend on how the allele frequencies differ between the two ancestries. In one case, we assumed that a proportion,

To better quantify the type I error rate possible in real experiments, we used the 90 CEU and 45 Japanese individuals from the HapMap samples. For each CEU individual *i*, we selected 9 unrelated CEU individuals to create a positive mixture,
${\gamma}_{M+}=0.1({Y}_{i}+{\sum}_{k=1}^{9}{\overrightarrow{Y}}_{k})+{\overrightarrow{\epsilon}}_{i}$, or 10 unrelated individuals to create a negative mixture,
${\gamma}_{M-}=0.1({\sum}_{k=1}^{10}{\overrightarrow{Y}}_{k})+{\overrightarrow{\epsilon}}_{i}$, where
${\overrightarrow{\epsilon}}_{i}\equiv \{{\epsilon}_{ij},\dots ,{\epsilon}_{iN}\}$, ε_{ij} ~ *N*(0,0.01^{2}) and *ε*_{i j}_{1} *ε*_{i j}_{2} if *j*_{1} ≠ *j*_{2}. To achieve meaningful levels of power, we chose *N* = 1000. For each CEU individual, 11 reference groups were similarly created where *γ** _{Rt}* included

Our first goal is to show that tests based on
${T}_{{L}_{2}}^{*}$ are more powerful than tests based on *T*_{L}_{1}. As the results for the two sets of simulations, *n** _{M}* =

For each value of σ* _{M}*, we then calculated the power for the four tests, based on

Tests based on
${T}_{{L}_{2}}^{*}$ have the highest power. Power for tests based on the four statistics, *T*_{L}_{1}, *T*_{L}_{2},
${T}_{{L}_{1}}^{*}$, and
${T}_{{L}_{2}}^{*}$, are plotted for multiple values of
${\sigma}_{M}^{2}$ or noise. Other parameters in the simulation were *n*_{M} = *n*_{R} = 100, *N* = 5000, *s* = 1, and **...**

Our second goal is to show that when assumption I is violated, the type I error rate can be much larger than *α*. Again, we presume *t** _{α}* was selected so equation 5 holds. Simulations showed the quick increase in the false positive rate,

The type I error rate for tests based on *T*_{L}_{1} can exceed power. Type I error rate increases as the proportion, *s*, of ancestry independent SNPs decreases. Other parameters in the simulation were N=5000 (50,0000),
${\sigma}_{M}^{2}={0.035}^{2}({0.01}^{2})$, and *n*_{M} = *n*_{R} **...**

The false positive rate for the HapMap samples were calculated when the mixture and reference groups each had 10 subjects. Both the individual of interest and the subjects in the mixture were from the CEU population. As the ratio of Japenese:CEU individuals increased in the reference group, the false positive rate increased from the *α*-level, 0.05, to near 1. When the ratio exceeded 8:2, the false positive rate exceeded the estimated power.

As the overall popularity of SNP microarray technology increases and the cost of the technology decreases, there will likely be a shift from STRs to genome-wide sets of SNPs as the preferred method for DNA identification. Therefore, databases designed to store genetic identifiers for individuals will likely store genotypes for sets of SNPs in the future. Coupled with our earlier discussion, there will be three main advantages of using high-density SNPs to determine whether an individual contributes DNA to a mixture: accessibility, higher resolving power and the ability to work with low quality, degraded, DNA.

Here, we have demonstrated that tests based on *T*_{L}_{1} can suffer from inflated type I and type II errors if the mixture contains individuals with unknown ancestries. Therefore, we introduced a new test statistic,
${T}_{{L}_{2}}^{*}$ and an accompanying test that only require matching the ancestry of the reference group to that of the individual of interest. Even if the individual of interest has a mixture of ancestries, there should still be some subjects, with a similar mixture of ancestries, that can be used for comparison. This test is also robust to platform and sample size. We showed that both switching from the *L*_{1} to the *L*_{2} measure and switching to the new type of statistic increased the power to detect the individual of interest. Therefore,
${T}_{{L}_{2}}^{*}$ is not only more robust than *T*_{L}_{1}, it tends to have increased power.

We would like to thank Dr. David Craig for providing us with data from his original article and to the anonymous reviewers for their helpful comments. This work was supported by NIH GM59507.

To estimate the distribution of *D*_{L}_{1} (*j*), we make the following assumptions, where *g* {*R*, *M*}:

- Model (4) is true.
*β*=_{R j}*β*= 0_{M j}*n**n*=_{M}*n*_{R}And restrict the SNPs examined to those satisfying- |
*p*– 0.5 | >δ_{g j}for some small δ such that *P*(|–_{g j}*p*| > δ_{g j}*/*2) ≈ 0.*P*(|*ε*| > δ_{g j}*/*2) ≈ 0.

Assumptions 5 and 6 promise that *n* is large enough and the magnitude of *ε** _{g j}* is small enough so that the signs inside the absolute values are determined by

$${D}_{{L}_{1}}(j)\approx {D}_{j1}1({Y}_{0j}=1)+{D}_{j0.5}1({Y}_{0j}=0.5)+{D}_{j0}1({Y}_{0j}=0)$$

(27)

where

$$\begin{array}{l}{D}_{j1}\equiv {\sum}_{i=(n+1)}^{2n}\frac{1-{Y}_{ij}}{n}-\frac{1-{Y}_{1j}}{n}-\sum _{i=2}^{n}\frac{1-{Y}_{ij}}{n}-{\beta}_{Rj}-{\epsilon}_{Rj}+{\beta}_{Mj}+{\epsilon}_{Mj}\\ {D}_{j0.5}\equiv {\sum}_{i=(n+1)}^{2n}\frac{0.5-{Y}_{ij}}{n}-\frac{0.5-{Y}_{1j}}{n}-\\ \sum _{i=2}^{n}\frac{0.5-{Y}_{ij}}{n}-{\beta}_{Rj}-{\epsilon}_{Rj}+{\beta}_{Mj}+{\epsilon}_{Mj}\text{if}{p}_{Rj}0.5\\ {D}_{j0.5}\equiv {\sum}_{i=(n+1)}^{2n}\frac{{Y}_{ij}-0.5}{n}-\frac{0.5-{Y}_{1j}}{n}-\\ \sum _{i=2}^{n}\frac{0.5-{Y}_{ij}}{n}+{\beta}_{Rj}+{\epsilon}_{Rj}+{\beta}_{Mj}+{\epsilon}_{Mj}\text{if}{p}_{Rj}0.5\\ {D}_{j0}\equiv {\sum}_{i=(n+1)}^{2n}\frac{{Y}_{ij}}{n}-\frac{{Y}_{1j}}{n}-\sum _{i=2}^{n}\frac{{Y}_{ij}}{n}+{\beta}_{Rj}+{\epsilon}_{Rj}-{\beta}_{Mj}-{\epsilon}_{Mj}\end{array}$$

(28)

By definition *p** _{M j}* < 0.5. We next calculate the expected values of each component under the two assumptions,

$$E[{D}_{j1}|{H}_{1}]=\frac{1}{n}-\frac{{p}_{Mj}}{n}-{\beta}_{Rj}+{\beta}_{Mj}$$

(29)

$$E[{D}_{j0.5}|{H}_{1}]=\frac{1}{2n}-\frac{{p}_{Mj}}{n}-{\beta}_{Rj}+{\beta}_{Mj}$$

(30)

$$E[{D}_{j0}|{H}_{1}]=\frac{{p}_{Mj}}{n}+{\beta}_{Rj}-{\beta}_{Mj}$$

(31)

We immediately see that

$$nE[{D}_{{L}_{1}}(j)|{H}_{1}]=2{p}_{Mj}{(1-{p}_{Mj})}^{2}+n({\beta}_{Mj}-{\beta}_{Rj})(1-2{(1-{p}_{Mj})}^{2})$$

(32)

Let us assume that *β** _{M j}* =

$$\sum _{j}E[{D}_{{L}_{1}}(j)|{H}_{1}]=\frac{{\sum}_{j}2{p}_{Mj}{(1-{p}_{Mj})}^{2}}{n}$$

(33)

Next we calculate the expected value under *H* * _{f p}*,

$$\begin{array}{ccc}E[{D}_{j1}|{H}_{fp}]& =& {p}_{Mj}-{p}_{Rj}-{\beta}_{Rj}+{\beta}_{Mj}\hfill \\ E[{D}_{j0.5}|{H}_{fp}]& =& {p}_{Mj}-{p}_{Rj}-{\beta}_{Rj}+{\beta}_{Mj}\text{if}{p}_{Rj}0.5\hfill \\ E[{D}_{j0.5}|{H}_{fp}]& =& {p}_{Mj}+{p}_{Rj}-1+{\beta}_{Rj}+{\beta}_{Mj}\text{if}{p}_{Rj}0.5\hfill \\ E[{D}_{j0}|{H}_{fp}]& =& {p}_{Rj}-{p}_{Mj}+{\beta}_{Rj}-{\beta}_{Mj}\hfill \end{array}$$

(34)

We immediately see that

$$\begin{array}{l}\begin{array}{ccc}E[{D}_{L1}(j)|{H}_{fp},{p}_{Rj}<0.5]& =& ({p}_{Mj}-{p}_{Rj}+{\beta}_{Mj}-{\beta}_{Ri})(1-2{(1-{p}_{Mj})}^{-2})\\ E[{D}_{L1}(j)|{H}_{fp},{p}_{Rj}>0.5]& =& -{p}_{Mj}(2{p}_{Mj}^{2}-6{p}_{Mj}+3)+{p}_{Rj}(1-2{p}_{Mj}^{2})+\end{array}\\ {\beta}_{Mj}(1-2{(1-{p}_{Mj})}^{2})+{\beta}_{Rj}(1-2{p}_{Mj}^{2})\end{array}$$

(35)

For simplification, let us assume that *β** _{M j}* =

$$\begin{array}{l}\sum _{j}E[{D}_{{L}_{1}}(j)|{H}_{fp}]=\sum _{j:{p}_{Rj}<0.5}({p}_{Mj}-{p}_{Rj})(1-2{(1-{p}_{Mj})}^{2})+\\ \sum _{j:{p}_{Rj}0.5}-{p}_{Mj}(2{p}_{Mj}^{2}-6{p}_{Mj}+3)+{p}_{Rj}(1-2{(1-{p}_{Mj})}^{2}\end{array}$$

(36)

Now, we turn our attention to the variance of *D*_{L}_{1} (*j*) and recall that

$$\mathit{\text{var}}({D}_{{L}_{1}}(j))=E[\mathit{\text{var}}({D}_{{L}_{1}}(j)|{Y}_{0j})]+\mathit{\text{var}}(E[{D}_{{L}_{1}}(j)|{Y}_{0j}])$$

(37)

We calculate the *var*(*D*_{L}_{1} (*j*) | *Y*_{0} * _{j}*), assuming all individuals within a group are unrelated,

$$\begin{array}{l}E[\mathit{\text{var}}({D}_{{L}_{1}}(j)|{Y}_{0j})]=\frac{1}{2n}{p}_{Rj}(1-{p}_{Rj})+\frac{\mathit{\text{var}}({Y}_{1j})}{{n}^{2}}+\\ \frac{n-1}{2{n}^{2}}{p}_{Mj}(1-{p}_{Mj})+{\sigma}_{Rj}^{2}+{\sigma}_{Mj}^{2}\end{array}$$

(38)

Under *H*_{0}, where *E*[*D*_{L}_{1} (*j*) | *Y*_{0} * _{j}*] = 0 and

$$\mathit{\text{var}}(E[{D}_{{L}_{1}}(j)|{Y}_{0j},{H}_{0}])=[{({\beta}_{Mj}-{\beta}_{Rj})}^{2}4{p}_{Mj}(2-{p}_{Mj}){(1-{p}_{Mj})}^{2}]$$

(39)

For simplification, we again assume that *β** _{M j}* =

$$\mathit{\text{var}}({D}_{{L}_{1}}(j)|{H}_{0})=\frac{1}{n}{p}_{Mj}(1-{p}_{Mj})+{\sigma}_{Rj}^{2}+{\sigma}_{Mj}^{2}$$

(40)

Under *H*_{1}, we find that

$$E[\mathit{\text{var}}({D}_{{L}_{1}}(j)|{Y}_{j0},{H}_{1})]=\frac{2n-1}{2{n}^{2}}{p}_{Mj}(1-{p}_{Mj})+{\sigma}_{Rj}^{2}+{\sigma}_{Mj}^{2}$$

(41)

and for large n, we see that *E*[*var*(*D*_{L}_{1} (*j*) | *Y*_{j}_{0}, *H*_{1})] ≈ *E*[*var*(*D*_{L}_{1} (*j*) | *Y*_{j}_{0}, *H*_{0})]. Similarly, because *E*[*D*_{L}_{1} (*j*) | *Y*_{0} * _{j}*,

Next, we turn to scenario *H** _{f p}*. Clearly,

$$E[\mathit{\text{var}}({D}_{{L}_{1}}(j)|{Y}_{0j},{H}_{fp})]=\frac{1}{2n}{p}_{Mj}(1-{p}_{Mj})+\frac{1}{2n}{p}_{Rj}(1-{p}_{Rj})+{\sigma}_{Rj}^{2}+{\sigma}_{Mj}^{2}$$

(42)

To calculate the *var*(*E*[*D*_{L}_{1} (*j*) | *Y*_{0} * _{j}*,

$$\begin{array}{l}\mathit{\text{var}}(E[{D}_{{L}_{1}}(j)|{Y}_{0j},{H}_{fp}])=\\ [{({p}_{\Delta j}-{\beta}_{\Delta j})}^{2}4{p}_{Mj}(2-{p}_{Mj}){(1-{p}_{Mj})}^{2}]\end{array}$$

(43)

Now, if we assume that the minor allele in the mixture is the same as the minor allele in the reference group, or equivalently, that we have at least identified a reference population of moderately similar composition, then we can be satisfied that equation (43) is a satisfactory approximation of *var*(*E*[*D*_{L}_{1} (*j*) | *Y*_{0} * _{j}*,

$$\begin{array}{l}\mathit{\text{var}}({D}_{{L}_{1}}(j)|{H}_{fp})=\frac{1}{2n}{p}_{Mj}(1-{p}_{Mj})+\frac{1}{2n}{p}_{Rj}(1-{p}_{Rj})+{\sigma}_{Rj}^{2}+{\sigma}_{Mj}^{2}+\\ [{({p}_{Mj}-{p}_{Rj})}^{2}4{p}_{Mj}(2-{p}_{Mj}){(1-{p}_{Mj})}^{2}]\end{array}$$

(44)

Here, we provide a rough approximation of the percentage of SNPs that would need to be ancestry dependent for *E*[*μ*_{1} (*j*)] = (1 – *s*) *E*[*μ** _{f p}* (

$$\begin{array}{l}E[{D}_{{L}_{1}}(j)|{p}_{R}<0.5,{H}_{fp}]\\ =4{\int}_{0}^{0.5}{\int}_{0}^{0.5}({p}_{Mj}-{p}_{Rj})(4{p}_{Mj}-1-2{p}_{Mj}^{2})d{p}_{Rj}d{p}_{Mj}\\ =4{\int}_{0}^{0.5}{\int}_{0}^{0.5}(-4{p}_{Rj}{p}_{Mj}+4{p}_{Mj}^{2}+{p}_{Rj}-{p}_{Mj}+2{p}_{Mj}{p}_{Mj}^{2}-2{p}_{Mj}^{3})d{p}_{Rj}{d}_{Mj}\\ =0.062\end{array}$$

(45)

Had we assumed *p** _{R j}* > 0.5,

Next, from equation 33, we know *nE* [*D*_{L}_{1} (*j*)] = 2*p** _{M j}* (1 –

Therefore, in order for *μ*_{1} = * _{f p}*, we would need .22

Theorem: Let
${X}_{1}\sim N(p,{\sigma}_{1}^{2})$,
${X}_{2}\sim N(p,{\sigma}_{2}^{2})$, *g*(*X*) = | 0.5–*X* |, *X*_{1} *X*_{2},
${\sigma}_{1}^{2}>{\sigma}_{2}^{2}$, and *p* < 0.5. Then

$$E[g({X}_{1})]>E[g({X}_{2})]$$

(46)

Let *Z*_{1} = *F*_{1} (*X*_{1}) and *Z*_{2} = *F*_{2} (*X*_{2}), where *F*_{1} and *F*_{2} are their respective cumulative distribution functions. Then

$$\begin{array}{ccc}E[g({X}_{2})]-E[g({X}_{1})]& =& {\int}_{-\infty}^{\infty}g({X}_{2}){f}_{2}({X}_{2})d{X}_{2}-{\int}_{-\infty}^{\infty}g({X}_{1}){f}_{1}({X}_{1})d{X}_{1}\\ & =& {\int}_{0}^{1}g({F}_{2}^{-1}({Z}_{2}))d{Z}_{2}-{\int}_{0}^{1}g({F}_{1}^{-1}({Z}_{1}))d{Z}_{1}\\ & =& {\int}_{0}^{0.5}g({F}_{2}^{-1}({Z}_{2}))+g({F}_{2}^{-1}(1-{Z}_{2}))d{Z}_{2}-\\ & & {\int}_{0}^{0.5}g({F}_{1}^{-1}({Z}_{1}))+g({F}_{1}^{-1}(1-{Z}_{1}))d{Z}_{1}\end{array}$$

(47)

There are two possibilities. Case A) Assume ${F}_{2}^{-1}(1-{Z}_{2})<0.5$. Then, by linearity of g(X),

$$\frac{g({F}_{2}^{-1}({Z}_{2}))+g({F}_{2}^{-1}(1-{Z}_{2}))}{2}=\frac{g({F}_{2}^{-1}({Z}_{2})+{F}_{2}^{-1}(1-{Z}_{2}))}{2}=g(p)=0.5-p$$

(48)

Clearly, by similar logic,

$$\frac{g({F}_{1}^{-1}({Z}_{1}))+g({F}_{1}^{-1}(1-{Z}_{1}))}{2}\ge 0.5-p=\frac{g({F}_{2}^{-1}({Z}_{2}))+g({F}_{2}^{-1}(1-{Z}_{2}))}{2}$$

(49)

Case B: ${F}_{2}^{-1}(1-{Z}_{2})\ge 0.5$. Then, clearly

$$\begin{array}{c}g({F}_{1}^{-1}(1-{Z}_{1}))>g({F}_{2}^{-1}(1-{Z}_{2}))\\ g({F}_{1}^{-1}({Z}_{1}))>g({F}_{2}^{-1}({Z}_{2}))\end{array}$$

(50)

Therefore, for case B we also see

$$\frac{g({F}_{1}^{-1}({Z}_{1}))+g({F}_{1}^{-1}(1-{Z}_{1}))}{2}\ge \frac{g({F}_{2}^{-1}({Z}_{2}))+g({F}_{2}^{-1}(1-{Z}_{2}))}{2}$$

(51)

Taking the expectation over both scenarios gives the desired result, *E*[*g* (*X*_{2})] – *E*[*g* (*X*_{1})] < 0.

The distribution for *D*_{L}_{2} (*j*) and
${D}_{{L}_{2}}^{*}(j)$ can be described by the variances and covariances of terms in the following vector:

$${V}_{1}=[{Y}_{{i}_{1}},{\epsilon}_{M},{Y}_{{i}_{1}}{Y}_{{i}_{2}},{Y}_{{i}_{1}}{\epsilon}_{M},{Y}_{{i}_{1}}^{2},{\epsilon}_{M}^{2},{Y}_{{i}_{1}}{Y}_{{i}_{3}},{Y}_{{i}_{2}}{\epsilon}_{M}]\prime $$

(52)

by simple calculation, the *var*(*V*_{1}) can be described by:

$$\left(\begin{array}{cccccccc}\frac{1}{2}p-\frac{1}{2}{p}^{2}& 0& \frac{1}{2}{p}^{2}-\frac{1}{2}{p}^{3}-\frac{3}{4}{p}^{4}& 0& \frac{1}{4}p+\frac{1}{4}{p}^{2}-\frac{1}{2}{p}^{3}& 0& \frac{1}{2}{p}^{2}-\frac{1}{2}{p}^{3}& 0\\ 0& {\sigma}^{2}& 0& p{\sigma}^{2}& 0& 0& 0& p{\sigma}^{2}\\ \frac{1}{2}{p}^{2}-\frac{1}{2}{p}^{3}& 0& \frac{1}{4}{p}^{2}+\frac{1}{2}{p}^{3}-\frac{3}{4}{p}^{4}& 0& \frac{1}{4}{p}^{2}+\frac{1}{4}{p}^{3}-\frac{1}{2}{p}^{4}& 0& \frac{1}{2}{p}^{3}-\frac{1}{2}{p}^{4}& 0\\ 0& p{\sigma}^{2}& 0& (\frac{1}{2}{p}^{2}+\frac{1}{2}p){\sigma}^{2}& 0& 0& 0& {p}^{2}{\sigma}^{2}\\ \frac{1}{4}p-\frac{1}{4}{p}^{2}-\frac{1}{2}{p}^{3}& 0& \frac{1}{4}{p}^{2}+\frac{1}{4}{p}^{3}-\frac{1}{2}{p}^{4}& 0& \frac{1}{8}p+\frac{5}{8}{p}^{2}-\frac{1}{2}{p}^{3}-\frac{1}{4}{p}^{4}& 0& \frac{1}{4}{p}^{2}+\frac{1}{4}{p}^{3}-\frac{1}{2}{p}^{4}& 0\\ 0& 0& 0& 0& 0& 2{\sigma}^{4}& 0& 0\\ \frac{1}{2}{p}^{2}-\frac{1}{2}{p}^{3}& 0& \frac{1}{2}{p}^{3}-\frac{1}{2}{p}^{4}& 0& \frac{1}{4}{p}^{2}+\frac{1}{4}{p}^{3}-\frac{1}{2}{p}^{4}& 0& \frac{1}{4}{p}^{2}+\frac{1}{2}{p}^{3}-\frac{3}{4}{p}^{4}& 0\\ 0& {p}^{2}{\sigma}^{2}& 0& (\frac{1}{2}{p}^{2}+\frac{1}{2}p){\sigma}^{2}& 0& 0& 0& p{\sigma}^{2}\end{array}\right)$$

In this section, as we focus on only a single locus, we have dropped the subscript ‘j’ from *Y** _{i j}*,

$$\begin{array}{l}{D}_{{L}_{2}}(j)=\\ \frac{-{\sum}_{i:{R}_{i}=1}2{Y}_{0}{Y}_{i}}{n}-2{Y}_{0}{\epsilon}_{R}+\frac{{\sum}_{i:{R}_{i}=1}{\sum}_{k:{R}_{k}=1}{Y}_{i}{Y}_{k}}{{n}^{2}}+\frac{{\sum}_{i:{R}_{i}=1}2{Y}_{i}{\epsilon}_{R}}{n}+{\epsilon}_{R}^{2}-\\ (\frac{-{\sum}_{i:{M}_{i}=1}2{Y}_{0}{Y}_{i}}{n}-2{Y}_{0}{\epsilon}_{M}+\frac{{\sum}_{i:{M}_{i}=1}{\sum}_{k:{M}_{k}=1}{Y}_{i}{Y}_{k}}{{n}^{2}}+\frac{{\sum}_{i:{M}_{i}=1}2{Y}_{i}{\epsilon}_{M}}{n}+{\epsilon}_{M}^{2})\end{array}$$

(53)

and

$$\begin{array}{l}\mathit{\text{var}}({D}_{{L}_{2}}(j))=(\frac{8}{n}+\frac{4n(n-1)}{{n}^{4}}\mathit{\text{var}}({Y}_{1}{Y}_{2})+(8+\frac{8}{n})\mathit{\text{var}}({Y}_{1}{\epsilon}_{M})+\\ \frac{2}{{n}^{3}}\mathit{\text{var}}({Y}_{1}^{2})+2\mathit{\text{var}}({\epsilon}_{M}^{2})+\\ (-\frac{16}{n}-\frac{8}{{n}^{2}}+\frac{16}{{n}^{2}})\mathit{\text{cov}}({Y}_{1}{Y}_{2},{Y}_{1}{Y}_{3})+\\ (-\frac{8}{{n}^{3}})\mathit{\text{cov}}({Y}_{1}{Y}_{2},{Y}_{1}^{2})+(-8-\frac{-8}{n})\mathit{\text{cov}}({Y}_{1}{\epsilon}_{M},{Y}_{2}{\epsilon}_{M})\end{array}$$

(54)

By substituting the appropriate values from the variance matrix for *V*_{1}, we get the variance given in section 2.3. Using the same concept of expanding the terms and straight forward calculation, we were able to find *covar*(
${D}_{1{L}_{2}}^{*},{D}_{2{L}_{2}}^{*}$).

- Balding David J. Likelihood-based inference for genetic correlation coefficients. Theoretical Population Biology. 2003;63(3):221–230. doi: 10.1016/S0040-5809(03)00007-8. [PubMed] [Cross Ref]
- Couzin Jennifer. Genetic privacy: Whole genome data not anonymous, challenging assumptions. Science. 2008 Sep;321(5894):1268–1374. doi: 10.1126/science.321.5894.1278. [PubMed] [Cross Ref]
- Foreman LA, Champod C, Evett JA, Lambert S Pope. Interpreting dna evidence: A review. International Statistical Review. 2003;71:473–495.
- Fung Wing K, Hu Yue-Qing. Evaluating mixed stains with contributors of different ethnic groups under the nrc-ii recommendation 4.1. Statistics in Medicine. 2002 Nov;21:3583–3593. doi: 10.1002/sim.1313. [PubMed] [Cross Ref]
- Homer Nils, Szelinger Szabolcs, Redman Margot, Duggan David, Tembe Waibhav, Muehling Jill, Pearson John V, Stephan Dietrich A, Nelson Stanley F, Craig David W. Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS Genet. 2008 Aug;4(8):e1000167. doi: 10.1371/journal.pgen.1000167. [PMC free article] [PubMed] [Cross Ref]
- Jobling Mark A, Gill Peter. Encoded evidence: Dna in forensic analysis. Nat Rev Genet. 2004 Oct;5:739–751. doi: 10.1038/nrg1455. [PubMed] [Cross Ref]
- Kidd Kenneth K, Pakstis Andrew J, Speed William C, Grigorenko Elena L, Kajuna Sylvester LB, Karoma Nganyirwa J, Kungulilo Selemani, Kim Jong-Jin, Lu Ru-Band, Odunsi Adekunle, Okonofua Friday, Parnas Josef, Schulz Leslie O, Zhukova Olga V, Kidd Judith R. Developing a snp panel for forensic identification of individuals. Forensic Science International. 2006;164(1):20–32. doi: 10.1016/j.forsciint.2005.11.017. [PubMed] [Cross Ref]
- Macgregor Stuart, Zhao Zhen Zhen, Henders Anjali, Nicholas Martin G, Montgomery Grant W, Visscher Peter M. Highly cost-efficient genome-wide association studies using dna pools and dense snp arrays. Nucleic Acids Res. 2008;36:e35. doi: 10.1093/nar/gkm1060. [PMC free article] [PubMed] [Cross Ref]
- Pearson JV, Huentelman MJ, Halperin RF, Tembe WD, Melquist S, et al. Identification of the genetic basis for complex disorders by use of poolingbased genomewide single-nucleotide-polymorphism association studies. American Journal of Human Genetics. 2007;80:126–139. doi: 10.1086/510686. [PubMed] [Cross Ref]
- Stoneking M, Hedgecock D, Higuchi RG, Vigilant L, Erlich HA. Population variation of human mtdna control region sequences detected by enzymatic amplification and sequence-specific oligonucleotide probes. American Journal of Human Genetics. 1991;48(2):370–382. [PubMed]
- The International HapMap Consortium The international hapmap project. Nature. 2003 Dec;426:789–796. [PubMed]

Articles from Statistical Applications in Genetics and Molecular Biology are provided here courtesy of **Berkeley Electronic Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |