Complex Mixture Constructions
A total of 8 complex mixtures were constructed (See ). All DNA samples were checked for concentration in triplicates using the Quant-iT PicoGreen dsDNA Assay Kit by Invitrogen (Carlsbad, CA). For accuracy, an eight point standard curve was prepared using Human Genomic DNA from Roche Diagnostics (Cat#: 11691112001, Indianapolis, IN). The median concentrations were calculated for each individual DNA sample.
Mixtures are composed partially of HapMap individuals empirically evaluated on the Illumina 550 K v3, Illumina 450S Duo, and Affymetrix 5.0 microarrays.
Mixtures A1, A2, B1, and B2: Equimolar Mixtures of HapMap Individuals
Shown in , two main mixtures (mixtures A and B) were composed in duplicates resulting in a total of 4 mixtures. Mixture A was composed of 41 HapMap CEU individuals (14 trios minus one individual) and mixture B was composed of 47 HapMap CEU individuals (16 trios minus one individual).
Mixture C1: 90% NA12752 and 10% NA07048
Two CEU males were combined in a single mixture so that one individual (NA12752) contributed 90% (675 ng) of the DNA in the mixture, while the other individual (NA07048) contributed only 10% (75 ng) DNA into the mixture by concentration.
Mixture C2: 90% NA10839 and 10% NA07048
Two CEU individuals, a female and a male, were combined in a single mixture so that one individual (NA10839) contributed 90% (675 ng) of the DNA in the mixture, while the other individual (NA07048) contributed only 10% (75 ng) DNA into the mixture by concentration.
Mixture D1: 99% NA12752 and 1% NA07048
Two CEU males were combined in a single mixture so that one individual (NA12752) contributed 99% (742.5 ng) of the DNA in the mixture, while the other individual (NA07048) contributed only 1% (7.5 ng) DNA into the mixture by concentration.
Mixture D2: 99% NA10839 and 1% NA7048
Two CEU individuals, a female and a male, were combined in a single mixture so that one individual (NA10839) contributed 99% (742.5 ng) of the DNA in the mixture, while the other individual (NA07048) contributed only 1% (7.5 ng) DNA into the mixture by concentration.
Mixture E: 50% Mixture A1 and 50% Mixture of 184 Equimolar Caucasians
Two mixtures were combined into a single mixture so that each of the original mixtures contributed the same amount of genomic DNA by volume into the final mixture. CAU2 mixture contained 184 Caucasian control individuals obtained from the Coriell Cell Repository. Mixture A1 was constructed as above and contained 41 CEU individuals.
Mixture F: 50% Mixture B2 and 50% Mixture of 184 Equimolar Caucasians
Two mixtures were combined into a single mixture so that each mixture contributed the same amount of genomic DNA by volume into the final mixture. CAU3 mixture contained 184 Caucasian control individuals obtained from the Coriell Cell Repository. Mixture B2 was constructed as above.
Mixture G: 5% Mixture A2 and 95% Mixture of 184 Equimolar Caucasians
Two mixtures were combined into a single mixture with Mixture A2 comprising of 5% of the mixture and the CAU3 comprising of 95% of the mixture. CAU3 mixture contained 184 Caucasian control individuals obtained from the Coriell Cell Repository. Mixture A2 was constructed as above.
Mixture H: 5% Mixture B1 and 95% Mixture of 184 Equimolar Caucasians
Two mixtures were combined into a single mixture with Mixture B1 comprising of 5% of the mixture and the CAU2 comprising of 95% of the mixture. CAU2 mixture contained 184 Caucasian control individuals obtained from the Coriell Cell Repository. Mixture B1 was constructed as above.
Four cohorts were assayed on the Illumina (San Diego, CA) HumanHap550 Genotyping BeadChip v3, one cohort was assayed on the Illumina (San Diego) HumanHap450S Duo, and three cohorts were assayed on the Affymetrix (Emeryville, CA) Genome-Wide Human SNP 5.0 array, with each cohort being assayed on a single chip. Probe intensity values were extracted for analysis from the file folders generated by the BeadScan software for the Illumina platform, and from Affymetrix GTYPE 4.008 software for the Affymetrix data, as described in previous studies 
Theoretical Derivation of Test-Statistic
We recognize there are multiple approaches to derive a test-statistic to evaluate the hypotheses that a person is within a mixture, and these are discussed further in later sections. In this primary approach we take a frequentist rather than a Bayesian approach, recognizing that both are possible and each has unique advantages.
An overview of our approach is described in , and this method can be summarized as the cumulative sum of allele shifts over all available SNPs, where the shift's sign is defined by whether the individual of interest is closer to a reference sample or closer to the given mixture. We first introduce our method in terms of genotyping a given SNP for a single person, which addresses the original design of SNP genotyping microarrays for the field of human genetics. We then proceed to adapt our method for mixtures and pooled data.
To give insight into the intuition behind our method, we present for a given SNP three different scenarios for the possible allele frequency of the person of interest corresponding to the genotypes AA, AB, and BB.
Current genotyping microarray technology can assay millions of SNPs. Genotypes are expected to result from an assay and data is categorical in nature, e.g. AA
, or NoCall
symbolically represent the two alleles for a biallelic SNP. However, as evident from copy number, calling algorithm, and pooling-based GWA studies 
, raw preprocessed data from SNP genotyping arrays is typically in the form of allele intensity measurements that are proportional to the quantity of the “A
” and “B
” alleles hybridized to a specific probe (or termed features) on a microarray 
. Individual probe intensity measurements are derived from the fluorescence measurement of a single bead (e.g. Illumina) or 5 micron square on a flat surface (e.g. Affymetrix). On a genotyping array, multiple probes are present per SNP at either a fixed number of copies (Affymetrix) or a variable number of copies (Illumina). For example, recent generation Affymetrix arrays typically have 3 to 4 probes for the A
allele and B
allele respectively, whereas Illumina arrays have a random number of probes averaging approximately 18 probes per allele. With 500,000+ SNPs, there are millions of probes (or features) on a SNP genotyping array. One should note that there are considerably different sample preparation chemistries prior to hybridization between SNP genotyping platforms and thus probes behave differently on the respective platforms.
Before we discuss resolving mixtures, we summarize ‘genotype calling’ in the context of data from a single individual at a single SNP. SNP genotyping algorithms typically begin by transforming normalized data into a ratio or polar coordinates. For simplicity, we will utilize a ratio transformation Yi
), where Ai
is the probe intensity for the A
allele and B
is the probe intensity for the B
allele for the j
th SNP. Multiple papers have shown that Yj
transformation approximates allele frequency, where kj
is the SNP specific correction factor accounting for experimental bias and is easily calculated from individual genotyping data 
. Thus with this transformation, Yi
is an estimate of allele frequency (termed pA
) for each SNP. Since most individuals contain two copies of the genome for autosomal SNPs, values for the A
allele frequency (pA
) in a single individual may be 0%, 50%, or 100% for the A
allele at AA
, or BB
, respectively. Equivocally Yi
will be approximately 0, 0.5, or 1, varying from these values due to measurement noise. By example and assuming kj
1, probe intensity measurements of Aj
450 and Bj
550 yield Yj
0.45 and this SNP would be likely called AB
. For a single individual, we thus expect to see a trimodal distribution for Y
across all SNPs since only AA
, or BB
genotype calls are expected. However, in a mixture of multiple individuals, the assumptions of the genotype-calling algorithm are invalid, since only AA
, or NoCall
are given regardless of the number of pooled chromosomes.
However, this does not prevent us from extracting information and meaning from the relative probe intensity data. In our approach, we compare allele frequency estimates from our mixture (termed M
, where Mi
)) to estimates of the mean allele frequencies of a reference population. The selection of the reference population is important and will be discussed later. For now, we assume that the reference population has a similar ancestral make-up as that of the mixture. We refer to having similar population substructure, ethnicity, or ancestral components interchangeably, and define similar ancestral components for an individual or mixture as having similar allele frequencies across all SNPs. We let Yi,j
be the allele frequency estimate for the individual i
and SNP j
, where Yi,j
}, from a SNP genotyping array. We then compare absolute values for two differences. The first difference |Yi,j−Mj|
measures how the allele frequency of the mixture Mj
at SNP j
differs from the allele frequency of the individual Yi,j
for SNP j
. The second difference |Yi,j−Popj|
measures how the reference population's allele frequency Popj
differs from the allele frequency of the individual Yi,j
for each SNP j
. The values for Popj
can be determined from an array of equimolar pooled samples or from databases containing genotype data of various populations. Taking the difference between these two differences, we obtain the distance measure used for individual Yi
Under the null hypothesis that the individual is not in the mixture, D(Yi,j)
approaches zero since the mixture and reference population are assumed to have similar allele frequencies due to having similar ancestral components. Under the alternative hypothesis, D(Yi,j)>0
since we expect that the Mj
is shifted away from the reference population by Yi
's contribution to the mixture. In the case of D(Yi,j)<0
is more ancestrally similar to the reference population than to the mixture, and thus less likely to be in the mixture. Consistent with the explanation of , D(Yi,j)
is positive when Yi,j
is closer to Mj
is negative when Yi,j
is closer to Popj
. By sampling 500 K+ SNPs, one would generally expect D(Yi,j)
to follow a normal distribution due to the central limit theorem. In our analysis, we take a one-sample t-test for this individual, sampled across all SNPs, and thus obtain the test statistic:
In equation (2) we assume μ0
is the mean of D(Yk)
over individuals Yk
not in the mixture, SD(D(Yi))
is the standard deviation of D(Yi,j)
for all SNPs j
and individual Yi
, and s
is the number of SNPs. We assume μ0
is zero since a random individual Yk
should be equally distant from the mixture and the mixture's reference population and so
. Under the null hypothesis T(Yi)
is zero and under the alternative hypothesis T(Yi)
. In order to account for subtle differences in ancestry between the individual, mixture, and reference populations among other biases we normalize our allele frequency estimates to a reference population.
Ancestry and Reference Populations
Different populations will have different mean SNP allele frequencies based on ancestry, admixture, and population bottlenecks. An obvious assumption of this type of analysis is that the reference population must be either (a.) accurately matched in terms of ancestral composition to the mixture and person of interest or (b.) limited to analysis of SNPs with minimal (or known) bias towards ancestry. It is first important to recognize that any single SNP will have only a small effect on the overall test-statistic. Moreover, it is realistic that ancestry of the reference population could be determined by analysis of a small subset of SNPs, followed by analysis of a person's contribution to the mixture with a separate set of SNPs (recognizing that nearly 500,000 SNPs are assayed). In the absence of SNP-specific ancestral information towards allele frequency as was assumed in our study, we can also use normalization methods that leverage the fact that we have assayed hundreds of thousands of SNPs and consequentially have largely sampled the distribution of the test-statistic. In essence, we fit the test-statistic to a second reference population matched to the individual of interest to account for ancestry differences that do not effect the overall distribution of allele frequencies. Thus under the assumption of similar test-statistic distributions, normalizing SNP data from the mixture to a reference population reduces the effect of systematic biases on allele frequency from the microarray or, to an extent, towards ancestry at a cost of power.
While not necessary in this study, the effect of ancestry on allele frequency could be more directly managed by SNP selection combined with extensive allele frequency data across multiple ancestrally diverse populations. Ideally, one would use a subset of SNPs to identify ancestry of the individual and then match them to a reference population. Moreover, SNPs that are stable for allele frequency across populations (low Fst) or at have a common distribution of allele frequencies would be preferable. Identifying such a set of SNPs and more appropriately considering ancestral biases are reserved for future database studies whereby genotype data of an ancestrally diverse set of individuals is available.
Pre-compiled UNIX binaries are available for a software implementation of our method and can be found at http://public.tgen.org/dcraig/deciphia
. Our software is able to run analysis using raw data from either Affymetrix or Illumina or by using genotype calls. The software is also able to normalize our test statistic using the reference population and/or adjust the mean test statistic using a specified individual. Additionally, the user can restrict the SNPs considered to a subset of the total available SNPs. For raw input data we are able to match the distribution of signal intensities for each raw data file to that of the mixture input file (see platform specific analysis). Finally, multiple test statistics and distance calculations are implemented including our original test statistic, Pearson correlation, Spearman rank correlation and Wilcoxon sign test.
Platform Specific Analysis
With the Affymetrix platform we were able to use genotypes for each individual and found similar results with the Illumina platform. Additionally, we were able to use the raw CEL files from the HapMap dataset 
found at http://www.HapMap.org
. To overcome the differences in distribution of signal intensity between CEL files, we matched the distribution of the signal intensities to the distribution of the mixture's CEL file. This was achieved by ordering allele frequencies on a given chip (and allele frequencies in the mixture). We then substituted the ith
allele frequencies from the mixture of interest for the ith
allele frequencies for the given chip. Without this adjustment, there was difficulty resolving any individual in any mixture due to the fact that we did not account for off-target cross-hybridization. This type of adjustment is the preferred type of normalization method when raw data is available for the mixture, person of interest, and reference population.
For the Illumina platform we used the genotypes from the HapMap dataset 
for both the person of interest and the reference populations instead of raw intensity values as we had for the Affymetrix platform. For the mixture we used raw intensity values. This set of data mimics the case where raw data may not be available but genotype calls are available. We use a simple method to reduce errors between different microarrays, where we normalize each microarray by dividing by the mean channel intensity for each respective channel. This was performed on the raw data for the mixture only. We note that this platform specific adjustment is not needed when the raw data for a person's genotype is present on the same platform. In the Illumina specific example, we utilized only the calls from the HapMap without having platform specific genotype data. Theoretically, it should be possible to use a library of Yi
means for AA
, and BB
to map genotype calls to expected Yi
values to each SNP for individually genotyped samples, but this was not necessary for our analysis.
Simulation was used to test the efficacy of using high-density SNP genotyping data for resolving mixtures. The key variables of the simulation are: the number of SNPs s
, the fraction f
of the total DNA mixture contributed by our person of interest Yi
, and the variance or noise inherent to assay probes vp
. In the simulations, theoretical mixtures were composed by randomly sampling individuals from the 58C Wellcome Trust Case-Control Consortium (WTCCC) dataset 
. After removing duplicates, relatives and other data anomalies, a total of 1423 individuals remained for sampling. The genotype calls for these individuals were provided from the WTCCC and were previously genotyped on the Affymetrix 500 K platform. Within each simulation, we randomly chose N
individuals to be equally represented in our mixture and then computed the mean allele frequency (Yi
) of our mixture for each SNP. SNPs j
with an observed Yij
below 0.05 or above 0.95 in the reference population were removed due to their potential for having false positives and low inherent information content.
We then simulated a microarray that would contain a mean of 16 probes for simplicity, approximating the mean number of probes found on the Illumina 550 K, Illumina 450S Duo and Affymetrix 5.0 platforms (18.5, 14.5 and 4 respectively). For each SNP j
we added to the Yij
of each probe a Gaussian noise based off the previously measured probe variance. When fixed, we set probe variance to 0.006 when simulating Affymetrix 5.0 arrays, and to 0.001 for both Illumina 550 K and Illumina 450S Duo arrays. The allele frequency of the mixture was then calculated to be the mean of these probe values. A mixture size of N
is equivalent to saying that an individual's DNA represents f
of the total DNA in the mixture. We tested equimolar mixtures ranging from 10 individuals to 1,000 individuals. Using this design, we tested each individual for their presence where they contributed between 10% and 0.1% genomic DNA to the total mixture. To obtain significance levels (p-values) for testing the null hypothesis, we sampled from the normal distribution. We note that we do not have enough samples to test the tail of our distribution and therefore our p-values are not completely accurate (e.g. below 10−6
). Nonetheless, p-values are expected to be sufficiently accurate to qualitatively assess the limits of our method.
To examine empirically the efficacy of our method we formed various known mixtures of DNA from HapMap individuals and genotyped the mixtures on three different platforms. Listed in and detailed in the methods
are the compositions of the different mixtures formed and the platforms they were assayed across. The use of mixtures of HapMap individuals has several advantages. First, we can be confident of the genotype calls because in most cases more than one platform has been used to identify the consensus genotype. Second, trios are available, which allow for evaluation of identifying an individual using a relative's genotype data. Third, by using mixtures of multiple HapMap individuals we can evaluate our ability to resolve each individual within the mixture. Therefore we have constructed simple two-person mixtures as well as complex mixtures containing contributions from 40+ individuals. For each mixture, we used the HapMap CEU individuals not present in the mixture as our reference population for the mixture.