PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of sagmbStatistical Applications in Genetics and Molecular BiologySubmit to Statistical Applications in Genetics and Molecular BiologySubscribeStatistical Applications in Genetics and Molecular Biology
 
Stat Appl Genet Mol Biol. 2009 January 1; 8(1): 24.
Published online 2009 May 5. doi:  10.2202/1544-6115.1433
PMCID: PMC2861316

Incorporating Duplicate Genotype Data into Linear Trend Tests of Genetic Association: Methods and Cost-Effectiveness

Abstract

The genome-wide association (GWA) study is an increasingly popular way to attempt to identify the causal variants in human disease. Duplicate genotyping (or re-genotyping) a portion of the samples in a GWA study is common, though it is typical for these data to be ignored in subsequent tests of genetic association. We demonstrate a method for including duplicate genotype data in linear trend tests of genetic association which yields increased power. We also consider the cost-effectiveness of collecting duplicate genotype data and find that when the relative cost of genotyping to phenotyping and sample acquisition costs is less than or equal to the genotyping error rate it is more powerful to duplicate genotype the entire sample instead of spending the same money to increase the sample size. Duplicate genotyping is particularly cost-effective when SNP minor allele frequencies are low. Practical advice for the implementation of duplicate genotyping is provided. Free software is provided to compute asymptotic and permutation based tests of association using duplicate genotype data as well as to aid in the duplicate genotyping design decision.

Introduction

Genetic tests of association often utilize case-control study designs in order to identify possible genetic factors contributing to the etiology of a complex disease (Amos 2007, Sasieni 1997). Examining the whole genome simultaneously through genome-wide association (GWA) studies has become an increasingly popular and effective method of determining genetic association. While high costs of GWA studies are still a limiting factor, they continue to become more economically plausible with advances in technology that identify single nucleotide polymorphism (SNP) genotypes at decreasing costs (Amos 2007).

Despite these technological advances, the misclassification of genotypes by SNP technology (genotyping errors) remains a persistent issue. Genotyping error rates are low in many instances (~0.1–0.2% or lower; Saunders et al. 2007, Tintle et al. 2005, Fridley et al. 2008, Heid et al. 2008, Pompanon et al. 2005). However, these error rates are not uniform across all SNPs and some SNPs have measurably larger genotyping error rates (Pompanon et al. 2005). The impact of genotyping errors on case-control tests of genotype-phenotype association is well known. Specifically, non-differential errors (genotyping error rates are the same regardless of phenotype) have no effect on type I error, but do cause inflated type II error (i.e. reduce power) (Gordon & Ott 2001, Gordon et al. 2002, Ahn et al. 2007). Genotyping errors are particularly detrimental to power when the minor SNP allele frequency is low (Gordon et al. 2002, Ahn et al. 2007, Kang et al. 2004, Gordon & Finch 2005).

In addition to laboratory and technology-based approaches to reducing genotyping errors, which seek to address errors at their source, some have proposed the consideration of genotyping errors when designing the study. For example, double sampling (Gordon et al. 2004, Gordon et al. 2007) uses a perfect genotype mechanism (like gene sequencing) on a subset of the sample. Another recent paper discusses how to incorporate genotyping errors when optimizing a two-stage design (Zuo et al. 2008). A third approach involves replicate genotyping (Fridley et al. 2008, Rice & Holmans 2003, Tintle et al. 2007, Lai et al. 2007, Bonin et al. 2004), which means genotyping a random subset of individuals in the sample two or more times, instead of only once.

Duplicate genotyping has been proposed by many for quality control reasons (e.g. Rice & Holmans 2003, Bonin et al. 2004) and it is now a fairly common practice (Tintle et al. 2005, Fridley et al. 2008). Traditionally, duplicate genotype data were ignored in the subsequent statistical analyses. The data were simply used as an initial assessment of data quality. Recently, however, a method was proposed to incorporate duplicate genotype data in standard χ22 tests of genotype-phenotype association on 2x3 tables (Tintle et al. 2007).

Subsequently, Tintle et al. (to appear) demonstrated the cost-effectiveness of duplicate genotyping (i.e. more power) for use in χ22 tests when genotyping costs are low relative to phenotyping and sample acquisition costs. It was found that, as a general rule, duplicate genotyping the entire sample increases power when relative genotype to phenotype/sample acquisition costs don’t exceed the genotyping error rate. Additionally, when the minor SNP allele frequency is low, duplicate genotyping the entire sample can be cost-effective even when relative costs are greater than the genotyping error rate.

The linear trend test of association (LTT), first proposed by Cochran (1954) and Armitage (1955), has been suggested by many (Sasieni 1997, Slager & Schaid 2001, Freidlin et al. 2002, Zheng et al. 2003, Zheng & Gastwirth 2006) as a method for analyzing SNP genotype data since it can incorporate information about the disease mode of inheritance, and thus increase statistical power by narrowing the focus of the alternative hypothesis. Recently, Ahn et al. (2007) demonstrated the impact of genotyping errors on the LTT. Also, Gordon et al. (2007) demonstrated how to use the LTT when double sample data are collected. In this paper, we demonstrate how to include duplicate genotype data in a LTT. We also explore the utility of including duplicate genotype data in subsequent tests of association if they have been collected for quality control reasons. Lastly, we evaluate the cost-effectiveness of designing a study to collect duplicate genotype data for analysis with the LTT.

Methods

Sampling Strategy

We consider a sampling strategy where a fraction of the entire sample, r (r [set membership][0,1] ), is randomly selected to be genotyped exactly twice, while the remaining fraction of the sample, (1–r), is genotyped exactly once. We assume that all samples have been phenotyped as either a “case” or a “control.”

Genotyping Error Assumptions

  1. Let epsiloni,j be the probability of an individual of genotype i being classified as genotype j. Following the error model of Douglas et al. (2002), we assume that epsilon1,2 = epsilon2,1 = epsilon2,3 = epsilon3,2 and epsilon1,3 = epsilon3,1 = 0.
  2. We assume non-differential genotyping errors, meaning that the probability of genotyping errors is the same for each individual in the sample, regardless of case or control status.
  3. We assume that genotyping error probabilities are independent and remain constant from the first to second genotyping. Specifically, we mean that the probability of a genotyping error does not change for an individual’s second genotyping, and is not dependent upon whether they were incorrectly genotyped the first time.

Notation

  • δm = the frequency of allele m at the SNP marker. In this paper we assume the SNP is bi-allelic, and, thus, m=U,V. We also assume that the SNP marker allele associated with the disease is allele 2.
  • ζn = the frequency of risk allele n at the disease locus. In this paper we assume the disease locus is bi-allelic and we denote the risk allele as B and the non-risk allele as A. Thus, n=A,B.
  • hmn = the frequency of the mn haplotype; that is, the frequency of having both the m allele at the SNP marker and risk allele n at the disease locus. Thus, m=U,Vn=A.Bhmn=1.
  • D = the unstandardized measure of linkage disequilibrium between the SNP marker, V, and the disease risk allele, B. Thus, D = hVB − δVζB.
  • r 2 = the measure of the correlation between the SNP marker and the disease risk allele = (D2/(δUδVζAζB).
  • ρ= r 2 / max(r 2)= a measure of the correlation of SNP allele V and disease risk allele B as a fraction of their maximum possible correlation. As pointed out by Amos (2007), max(r2)<1 unless δV = ζB. We also note that for any values of δV and, ζB max(r2) is attained when D'= 1, where D′=D/ min(δUζB,δVζA).
  • ϕ = the disease prevalence in the population.
  • f j1 j2 = the penetrance of the disease given genotype j1j2 at the disease locus. Thus, fBB is the probability someone who is BB at the disease locus (homozygote for the risk allele) has the disease, fAB is the probability someone who is AB at the disease locus (heterozygote for the risk allele) has the disease, and fAA is the probability someone who is AA at the disease locus (homozygote for the non-risk allele) has the disease.
  • γ = a general relative risk of disease parameter which is used to compute genotype specific relative risks (γBB and γAB) in ways that are dependent upon the mode of inheritance of the disease (dominant, additive, recessive).
  • pi = the probability of genotype i in the cases, i=1,2,3
  • qi = the probability of genotype i in the controls, i=1,2,3
  • pi* = the probability of observing genotype i in the cases assuming there are genotyping errors, and the sample is genotyped exactly once i=1,2,3
  • qi* = the probability of observing genotype i in the controls assuming there are genotyping errors, and the sample is genotyped exactly once i=1,2,3
  • pij* = the probability of observing genotype i once and genotype j once in the cases assuming there are genotyping errors, and the sample is genotyped exactly twice i=1,2,3, j=1,2,3 and ij.
  • qij* = the probability of observing genotype i once and genotype j once in the controls assuming there are genotyping errors, and the sample is genotyped exactly twice i=1,2, 3, j=1,2,3 and ij.
  • T = the total number of cases
  • S = the total number of controls
  • N = T + S = the total sample size
  • k = S/T = ratio of controls to cases
  • c = the relative cost of genotyping to phenotyping/sample acquisition

Contingency Tables for a Study with Duplicate Genotype Data

When a fraction (r) of the sample has been duplicate genotyped, and the SNP marker under consideration has three possible genotypes (1=UU, 2=UV and 3=VV), data can be summarized into two tables, as shown in Tables 1a and 1b1b. We assume that an equal fraction of both cases and controls has been duplicate genotyped.

Table 1a
Single genotyped data
Table 1b
Duplicate genotyped data

Using a weighting strategy for duplicate genotype data presented by Tintle et al. (2007), Tables 1a and 1b1b can be combined into a single table (Table 1c) as follows:

t'i=ti+tii+0.5(tij)+0.5(tik)
(1)

where ij and ik, with a similar equation for the controls. As in shown in Tintle et al. (2007), using equal weights (0.5) for the inconsistently identified individuals is optimal.

Table 1c
Combined data contingency table

Disease Modes of Inheritance

We consider three disease modes of inheritance (MOI): Dominant (γBB = γAB = γ), Additive(γAB = γ, γBB =2γ −1), and Recessive (γBB = γ,γAB =1),where γBB = the relative risk of disease for a participant with two copies of the risk allele (fBB / fAA) and γAB = the relative risk of disease for a participant with one copy of the risk allele (f AB / fAA). Note that when γBB = γAB = 1 then the null hypothesis of no association between genotype and disease is true.

Linear Trend Test

As noted earlier, the linear trend test (LTT) is a powerful choice for the analysis of case-control studies of genetic association because of the ability to include information about the disease mode of inheritance (Sasieni 1997, Slager & Schaid 2001, Zheng & Gastwirth 2006). The traditional LTT statistic is U/σU where U is a statistic based on the disease mode of inheritance and the observed cell counts in the 2x3 contingency table (e.g. Table 1a), and σU is estimated based on the observed cell counts in the same table. In this paper, we extend the traditional version of the test to be able to include duplicate genotype data, proposing the LTTd (see Results:Finding the LTT statistic). In short, the LTTd uses the strategy proposed by Tintle et al. (2007) to place individuals who have been inconsistently duplicate genotyped to each of the two genotypes to which they have been genotyped (see Equation (1)). This strategy, however, means that the resulting contingency table of phenotype-genotype (Table 1c) no longer has a multinomial distribution due to increased covariance between cells, requiring the introduction of the LTTd.

In developing the LTTd, we also address the issue of bias in the σU estimate. Freidlin et al. (2002) demonstrated that the method of estimating σU as considered by Slager and Schaid (2001) was biased and thus provided invalid results. Zheng and Gastwirth (2006) consider two alternatives to the Slager and Schaid approach which they call “case-control” (cc) and “control” (c). The Slager and Schaid method estimates σU assuming that the null hypothesis of no genotype-phenotype association is true. The cc method estimates σU without the restriction of the null hypothesis being true, whereas the c method is similar but only uses the sample of controls. Zheng and Gastwirth find, and we confirmed in our own attempts to implement the method, that the c method increases the type I error in some cases (results not shown). Thus, we choose to base our results only on the cc method.

Simulation Study

To confirm that the empirical distribution of the LTTd (Derived later, see Results: Finding the LTTd statistic) follows the theoretical asymptotic distribution ( χ12) for practical sample sizes we conducted a simulation study (see Table 2 for parameters and values used).

Table 2
Parameter values for the simulation study

We examined all possible combinations of parameter values and so a total of 17,496 settings were evaluated. The simulation study was conducted as follows:

Step 1. For given values of δV, ζB,ρ, ϕ,γ, and the disease MOI the true genotype probabilities (pi and qi, i=1,2,3) were computed. See Ahn et al. (2007) for details.

Step 2. The true genotype probabilities (pi and qi, i=1,2,3) were then adjusted to reflect the genotype error rate (epsilon), yielding pij*, qij*, pi* and qi*. See Tintle et al. (2007) for details.

Step 3. For given values of k, r, and n, and the observed single and duplicate genotyping probabilities ( pij*, qij*, pi* and qi*) found in step 2, entries into Tables 1a and 1b1b were randomly simulated. For each combination of parameter values in Table 2, 2,000 random tables were simulated. In cases where γ =1 (null hypothesis is true), the type I error rate was analyzed by comparing the nominal significant level α (we examined 0.05, 0.005, and 0.0002) with the empirical α level. In cases where γ =1.25 or γ =2.00 (i.e. the alternative hypothesis is true), the empirical power was compared to the theoretical power (see equation (A3) in the Appendix).

Cost-effectiveness Computational Study

We completed a computational study comparing theoretical power values for different values of r (duplicate genotyping percentage), c (relative genotyping costs) and other parameters. Table 3 shows the settings used for this study. We examined all 10,368 possible combinations of parameter values based on Table 3.

Table 3
Parameters and values for the computational study

The computational study was carried out as follows:

  • Step 1. Assuming there are no genotyping errors, for given values of δV, ζ B, ρ, ϕ, γ and the disease type, the genotype probabilities were computed as if no duplicates were obtained. These values were then used to find the sample size needed (N0) to yield the specified power level (80% or 95%).
  • Step 2. Find the budget (B) needed to conduct the study if no duplicates as: B=(1+c)N0, where c is the genotyping cost per person relative to phenotyping/acquisition cost.
  • Step 3. Assuming there is duplicate genotyping (r>0), the sample size that can be obtained for the same budget, B, is found as Nr=B1+c(1+r).Nr can then be used in the power computation formula (A3) in the Appendix, to find the power using duplicate genotyping for that sample size. Then we find the optimal value of r that yields the largest power of the test. All computations used α=0.0002.

Results

Finding the LTTd Statistic

Zheng and Gastwirth (2006) present a test statistic for the LTT as Z=UV, where U=ixi(SNtiTNsi) and V is an estimate of the variance of U. Tintle et al. (2007) showed that by using the allocation strategy (Equation (1))Tables 1a and 1c1c estimate the same quantities. Thus, the numerator of the Zheng and Gastwirth Z statistic becomes:

Ud=ixi(SNt'iTNs'i)
(2)

According to the Central Limit Theorem, Table 1c has an approximately multivariate normal distribution (see also Tintle et al. 2007). The expected value of Ud under the null hypothesis (p*i=q*i for all i) is zero (see Equation (A1) in the Appendix). Thus,

LTTd=UdVar(Ud)
(3)

has a standard normal distribution and, therefore, (LTTd)2 has a χ12 distribution. Following the results of Zheng and Gastwirth (2006) the expression for Var(Ud) follows from (A2; Appendix), using pi*=riRs,qi*=siSs and pij*=rijRd,qij*=sijSd. Equation (A2) also accounts for additional covariance between cells in Table 1c from using the allocation strategy.

Simulation Results for LTTd

As described earlier (Methods: Simulation study), a simulation study was conducted to ensure that nominal type I and type II error rates obtained using the asymptotic theory of the LTTd were maintained in practice. First we consider the distribution of LTTd if the null hypothesis is true (γ = 1) and then the distribution of LTTd if the alternative hypothesis is true (γ ≠ 1).

Simulation Results for LTTd under the Null Hypothesis

For each combination of parameter values, a 99% confidence interval was found for the empirical α. For both the dominant and additive models, nominal type I error rates were maintained empirically regardless of sample size since an expected number of simulation settings had a 99% confidence interval on the empirical α that did not contain the nominal α (1.2% and 1.2% for dominant and additive, respectively, for α=0.05, 1.2% and 1.3% for the α=0.005 level and 1.1% and 0.7% for the α=0.0002 level). Nominal type I error rates were maintained empirically for the recessive model as long as the minimum cell count in Table 1c was at least 5 (detailed results not shown).

Simulation Results for LTTd under the Alternative Hypothesis

The LTTd statistic generally gives comparable theoretical and empirical power values across all simulation settings for the additive and dominant models as long as expected cell counts in Table 1c are at least 5. For each combination of parameter values, a 99% confidence interval was placed on the empirical power. For both the dominant and additive models when the minimum cell count in Table 3 was at least 5, an expected number of simulation settings had a 99% confidence interval on the empirical power that did not contain the theoretical power (0.9% and 1.2% for dominant and additive, respectively, for α=0.05, 1.3% and 0.9% for α=0.005 level and 1.7% and 1.3% for the α=0.0002 level). When the minimum cell count was less than 5 in the dominant and additive models, the empirical power was often still very close to theoretical power (results not shown). The recessive model with small δV showed significant differences between theoretical and empirical power (detailed results not shown), though theoretical power and empirical power were similar for larger values of δV .

Recommendations for Use of a Permutation Test

Based on the simulation study, differences in theoretical and empirical type I and type II errors are possible when the recessive disease model is used, in cases where at least one cell count in the grouped table is less than 5, or in cases where the total sample is less than 1,000 individuals. In these cases we recommend estimating p-values for the LTTd by permuting phenotype status instead of using the asymptotic theory provided above. A permutation based p-value is available in our software (see Results: Software).

Example

In Tintle et al. (2007), duplicate genotype data from a case-control study on bipolar disorder was presented for a SNP with inconsistently genotyped individuals where all individuals were duplicate genotyped. We present this data here (Table 4) in the form of Table 1b using a linear trend test for analysis to demonstrate the utility of the methods just developed.

Table 4
Duplicate genotype data from a study of bi-polar disorder

Tintle et al. (2007) report a p-value of 0.061 from the χ22 test ignoring inconsistently identified individuals and 0.064 from the test including inconsistencies. Using our software and assuming an additive mode of inheritance, the linear trend test just presented yields a p-value is 0.0230 ignoring inconsistents, 0.0241 including inconsistents using the method shown above and 0.0245 using a permutation test with 2000 permutations.

Cost-effectiveness of Duplicate Genotyping Using Previously Collected Data

Initially, we consider an instance of including previously collected quality control data in the test of association. In every case examined, the power of the LTTd is higher when the duplicate genotype data is included as compared to when it is not. In other words, it is better to include the duplicate genotype data in subsequent tests of association then to ignore inconsistencies and treat the data as missing. This result is consistent with the results of Tintle et al. (2007) for the χ22test of association.

Evaluating the Cost-effectiveness of Collecting Duplicate Genotype Data

The most important case, however, is when c>0. That is, when we view the collection of duplicate genotype data as an a priori design decision, and thus must account for the cost of collecting the duplicates for a fraction, r, of the sample.

Given a fixed budget, in 49.2% of cases examined (see Table 3) where c>0, duplicate genotyping the entire sample (r=1) was found to be the most cost-effective design strategy (yields the highest power). In all remaining cases, r=0 provided the highest power. Thus, the optimal strategy is always “all or nothing.”

In order to characterize situations where duplicate genotyping will be cost-effective, logistic regression models were used with all parameters predicting whether or not duplicate genotyping the entire sample was the most cost-effective design. Three parameters (δV, c and epsilon) had the strongest relationship with cost-effectiveness. Relative cost, c, had the strongest relationship (Wald χ2=1424.2, p<0.0001), genotyping error rate epsilon also had a very strong relationship (Wald χ2=1259.2, p<0.0001) and minor allele frequency (δV) was also strongly related (Wald χ2=465.6, p<0.0001). As minor allele frequency (δV) declined, costs (c) declined, or genotyping error rate (epsilon) increased, duplicate genotyping the entire sample was more likely to be the optimal design decision.

Table 5 shows the percentage of cases examined in the computational study where duplicate genotyping the entire sample is the most effective design decision for different values of c (relative genotyping costs) and epsilon (genotyping error rate).

Table 5
Percent of cases where genotyping is cost-effective

Table 5 demonstrates a general rule of thumb: duplicate genotyping the entire sample will always be cost-effective (regardless of δV ) if c ≤ epsilon. Table 5 also demonstrates that duplicate genotyping is sometimes cost-effective when c>epsilon. While details are not shown in the table, when δV is small, duplicate genotyping can be cost-effective even when c>epsilon.

Example Power Values

Table 6 provides power values for a specific example. Specifically, we present power under different values of δV, epsilon, r and c for a disease with a prevalence of 2.5%, disease allele frequency of 5%, equal number of cases and controls (k=1), and a SNP marker and disease allele that are in perfect linkage disequilibrium (ρ =1).

Table 6
Example of comparative power values

The sample size needed to yield 80% power was calculated assuming there was no genotyping error. Column III shows the power for that sample size, after taking the genotyping error into account. When the marker frequency is low and/or the genotyping error rate is larger Column III demonstrates that power can be significantly impacted by genotyping errors. Columns IV–VII then reduce the sample size to maintain the budget reflecting the additional cost of collecting duplicate genotype data on all samples at different genotype costs (c). As c decreases, epsilon increases and δV decreases, Columns IV–VIII demonstrate that uplicate genotyping becomes more cost-effective. We note that in all cases duplicate genotyping does not meet or exceed the error-free power of 80% (detailed results not shown) however duplicate genotyping can successfully mediate some of the power loss due to genotyping errors.

Note that the power values in Table 6 are from a specific example. Please use our software (Results: Software) to investigate power at values specific to your research situation while keeping in mind the rule of thumb presented in Table 5.

Recommendations for Use

In practice, duplicate genotyping should be considered when relative genotype to phenotype/sample acquisition costs do not exceed the expected SNP genotyping error rate. A more detailed treatment of practical considerations when using duplicate genotyping is provided in Tintle et al. (to appear). We summarize three main considerations here.

First, calculations provided in this manuscript consider only a single SNP. However, in practice, the decision to duplicate genotype will need to be made for an entire set of SNPs (e.g. all of the SNPs on a chip). In these cases using the same rule of thumb (duplicate genotype if c ≤ epsilon) is appropriate where the epsilon used is the minimum error rate expected for any single SNP.

Second, if there is concern that some samples may be of low quality and, thus, have higher genotyping error rates than other samples (a violation of genotyping error assumption #2), the error rate, epsilon, used in the c ≤ epsilon rule of thumb should be the minimum expected epsilon for the high quality samples. Note, however, that we are still assuming non-differential errors in this case. Differential errors may increase the type I error rate, and are not considered in this manuscript.

Third, GWA studies are typically conducted in two-stages where all markers are genotyped on a sample of individuals, and then a subset of the markers is genotyped on a sample of additional individuals. When considering the use of duplicate genotyping in two-stage studies, the decision on the use of duplicate genotyping should be made separately at each stage since the relative cost of genotyping to phenotyping will be different at each stage.

Software

To facilitate the utilization of the methods discussed in this paper, we provide two companion pieces of software for this work. The first computes the LTTd statistic and provides an asymptotic and permutation p-value. The second provides power computations for different genotyping costs, allele frequencies and error rates to assist in the duplicate genotyping design decision. Software is available at http://math.hope.edu/tintle/duplicate.html (source code written in R).

Conclusions

This work demonstrates how duplicate genotype data can be included in a linear trend test (LTT) of genetic association. Duplicate genotype data are included in the LTT through a weighting strategy and a subsequent adjustment of the variance of the LTT statistic yielding the LTTd. We demonstrate via simulation that the asymptotic null and alternative distributions of the LTTd statistic are obtained with reasonably small sample sizes in most cases. Both asymptotic and permutation test p-values are available in the free companion software.

We demonstrate that in the case of no duplicate genotyping costs (e.g. the data has already been collected) including the duplicate data in the LTTd always increases statistical power. This confirms a similar result in Tintle et al. (2007).

We also consider the cost-effectiveness of designing a study to collect duplicate genotype data, and find that when the relative cost of genotyping to phenotype/sample acquisition costs (c) is less than or equal to the genotyping error rate (epsilon), collecting duplicate genotype data on the entire sample is costeffective. Further, we find that the optimal amount of duplicate genotyping, in these cases, will always involve duplicate genotyping the entire sample. In a twostage GWA study for a complex disease, if a relatively small set of SNPs are being followed up at stage 2 and it will be costly to enroll more subjects, duplicate genotyping may be cost-effective since relative genotyping to phenotyping/acquisition costs c will be low.

Since the rule-of-thumb just described is conservative it is important to note that duplicate genotyping will be cost-effective in many situations when c>epsilon. This rule was provided to allow researchers to quickly assess the costeffectiveness of duplicate genotyping on a large scale. It is quite likely that, even if c>epsilon, duplicate genotyping may provide moderate power gains for SNPs with low minor SNP allele frequency. Our software should be used to determine cost-effectiveness of duplicate genotyping for specific experimental conditions.

We assume that genotyping errors are independent from the first to second genotyping (genotyping error assumption #3) and that genotyping error rates are non-differential (genotyping error assumption #2). Future work is needed to extend results to consider differential genotyping errors when duplicate genotyping. Further reading on sources of genotyping error and their impact on analyses can be found in Bonin et al. (2004) and Gordon and Finch (2005). We also assume that duplicate genotyping is applied to a random subsample of size nr. Further work is necessary to explore optimizing the value of r depending upon phenotype or initial genotype classification.

When collected, duplicate genotype data should always be included in the subsequent test of association and in many realistic cases duplicate genotype data should be collected on the entire sample.

Acknowledgments

Bryce Borchers, Marshall Brown and Brian McLellan contributed equally to this work. This project was funded in part by a grant from the National Institutes of Health, R15-HG004543. The content is solely the responsibility of the authors and does not necessarily represent the official view of the National Human Genome Research Institute or the National Institutes of Health. Additionally, this project received support from the Tanis Fund for Statistics Research.

Appendix

Finding the Mean of Ud

E(Ud)=E(ixi(SNt'iTNs'i))=ixi(STpi*NTSqi*N)=STNixi(pi*qi*)
(A1)

Finding the Variance of Ud

Var(Ud)=Var(i=13xi(SNt'iTNs'i))=(SN)2Var(i=13xit'i)+(TN)2Var(i=13xis'i)=(SN)2[i=13(xi2Var(t'i)+2j=2i<j3xixjCov(t'i,t'j))]+(TN)2[i=13(xi2Var(s'i)+2j=2i<j3xixjCov(s'i,s'j))]
(A2)

Below we show how to find Var(t1) and Cov (t1, t2), other terms can be found similarly.

Var(t'i)=Var(t1+t11+12ij(t12+t13))=Var(t1)+Var(t11)+14Var(t12)+14Var(t13)+ Cov(t11,t12)+Cov(t11,t13)+12Cov(t12,t13)=Tsp1*(1p1*)+TD(p11*(1p11*)+14p12*(1p12*)+14p13*(1p13*)p11*p12*p11*p13*12p12*p13*)

Cov(t'1,t'2)=E[t'1t'2]E[t'1]·E[t'2] whereE[t'1]=E[t1]+E[t11]+12E[t12]+12E[t13]=NTsp1+NTd(p11*+12p12*+12p13*)E[t'2]=E[t2]+E[t22]+12E[t12]+12E[t23]=NTsp2+NTd(p22*+12p12*+12p23*)E[t'1t'2]=E[t1t2]+E[t1]E[t22]+E[t11]E[t2]+E[t11t22]+12(E[t1]E[t12]+E[t1]E[t23]+E[t13]E[t2])+12(E[t11t12]+E[t11t23]+E[t12t2]+E[t22t12]+E[t22t13])+14(E[t12t12]+E[t12t23]+E[t12t13]+E[t12t23])

Where we make the following substitutions in E[t1, t2] as appropriate:

E[ti]=Tspi*,   E[tij]=TDpij*    for    i=j  or    ij,  E[titi]=Var(ti)+E[ti]2,E[tiitii]=Var(tii)+E[tii]2,      E[titj]=Cov(ti,tj)+E[ti]E[tj]    for   ij, andE[tijtik]=Cov(tijtik)+E[tij]E[tik]  for jk.

Power of the LTTd

Following the results of Zheng and Gastwirth (Zheng & Gastwirth 2006), asymptotic power for the LTTd can be computed using the following formula

1ϕ(z1α/2σdμdσd)+ϕ(zα/2σdμdσd)=1ϕ(z1α/2μdσd)+ϕ(zα/2μdσd)
(A3)

Where σd=Var(Ud) and μd=E(Ud).

References:

  • Ahn K, Haynes C, Kim W, Fleur RS, Gordon D, Finch SJ. 2007. The effects of SNP genotyping errors on the power of the Cochran-Armitage linear trend test for case/control association studies Annals of Human Genetics 71Pt 2249–261.26110.1111/j.1469-1809.2006.00318.x [PubMed] [Cross Ref]
  • Amos CI. Successful design and conduct of genome-wide association studies. Human molecular genetics. 2007;16(2):R220–5. doi: 10.1093/hmg/ddm161. [PMC free article] [PubMed] [Cross Ref]
  • Armitage P. Tests for linear trends in proportions and frequencies. Biometrics. 1955;11:375–386. doi: 10.2307/3001775. [Cross Ref]
  • Bonin A, Bellemain E, Bronken Eidesen P, Pompanon F, Brochmann C, Taberlet P. How to track and assess genotyping errors in population genetics studies. Molecular ecology. 2004;13(11):3261–3273. doi: 10.1111/j.1365-294X.2004.02346.x. [PubMed] [Cross Ref]
  • Cochran WG. Some methods for strengthening the common chi-squared tests. Biometrics. 1954;10:417–451. doi: 10.2307/3001616. [Cross Ref]
  • Douglas JA, Skol AD, Boehnke M. Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclearfamily data. American Journal of Human Genetics. 2002;70(2):487–495. doi: 10.1086/338919. [PubMed] [Cross Ref]
  • Freidlin B, Zheng G, Li Z, Gastwirth JL. Trend tests for casecontrol studies of genetic markers: power, sample size and robustness. Human heredity. 2002;53(3):146–152. doi: 10.1159/000064976. [PubMed] [Cross Ref]
  • Fridley BL, Turner ST, Chapman AB, Rodin AS, Boerwinkle E, Bailey KR. Reproducibiilty of genotypes as measured by the affymetrix GeneChip 100K Human Mapping Array set. Computational Statistics and Data Analysis. 2008;52:5367–5374. doi: 10.1016/j.csda.2008.05.020. [PMC free article] [PubMed] [Cross Ref]
  • Gordon D, Finch SJ. “Consequences of Error” In: Dunn MJ, Jorde LB, Little PFR, Subramanian S, Wiley, editors. Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics. 2005.
  • Gordon D, Haynes C, Yang Y, Kramer PL, Finch SJ. Linear Trend Tests for Case-Control Genetic Association that Incorporate Random Phenotype and Genotype Misclassification Error. Genetic epidemiology. 2007;31:853–870. doi: 10.1002/gepi.20246. [PubMed] [Cross Ref]
  • Gordon D, Finch SJ, Nothnagel M, Ott J. Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Human heredity. 2002;54(1):22–33. doi: 10.1159/000066696. [PubMed] [Cross Ref]
  • Gordon D, Ott J. Assessment and management of single nucleotide polymorphism genotype errors in genetic association analysis. Pacific Symposium on BiocomputingPacific Symposium on Biocomputing. 2001:18–29. [PubMed]
  • Gordon D, Yang Y, Haynes C, Finch SJ, Mendell NR, Brown AM, Haroutunian V. 2004. Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling Statistical applications in genetics and molecular biology 3Article2610.2202/1544-6115.1085 [PubMed] [Cross Ref]
  • Heid IM, Lamina C, Kuchenhoff H, Fischer G, Klopp N, Kolz M, Grallert H, Vollmert C, Wagner S, Huth C, Muller J, Muller M, Hunt SC, Peters A, Paulweber B, Wichmann HE, Kronenberg F, Illig T. Estimating the single nucleotide polymorphism genotype misclassification from routine double measurements in a large epidemiologic sample. American Journal of Epidemiology. 2008;168(8):878–889. doi: 10.1093/aje/kwn208. [PMC free article] [PubMed] [Cross Ref]
  • Kang SJ, Gordon D, Finch SJ. What SNP genotyping errors are most costly for genetic association studies? Genetic epidemiology. 2004;26(2):132–141. doi: 10.1002/gepi.10301. [PubMed] [Cross Ref]
  • Lai RZ, Zhang H, Yang YN. Repeated measurement sampling in genetic association analysis with genotyping errors. Genetic epidemiology. 2007;31(2):143–153. doi: 10.1002/gepi.20197. [PubMed] [Cross Ref]
  • Pompanon F, Bonin A, Bellemain E, Taberlet P. Genotyping errors: causes, consequences and solutions. Nature reviews.Genetics. 2005;6(11):847–859. doi: 10.1038/nrg1707. [PubMed] [Cross Ref]
  • Rice KM, Holmans P. 2003. Allowing for genotyping error in analysis of unmatched case-control studies Annals of Human Genetics 67Pt 2165–174.17410.1046/j.1469-1809.2003.00020.x [PubMed] [Cross Ref]
  • Sasieni PD. From genotypes to genes: doubling the sample size. Biometrics. 1997;53(4):1253–1261. doi: 10.2307/2533494. [PubMed] [Cross Ref]
  • Saunders IW, Brohede J, Hannan GN. Estimating genotyping error rates from Mendelian errors in SNP array genotypes and their impact on inference. Genomics. 2007;90(3):291–296. doi: 10.1016/j.ygeno.2007.05.011. [PubMed] [Cross Ref]
  • Slager SL, Schaid DJ. Case-control studies of genetic markers: power and sample size approximations for Armitage's test for trend. Human heredity. 2001;52(3):149–153. doi: 10.1159/000053370. [PubMed] [Cross Ref]
  • Tintle NL, Gordon D, Van Bruggen D, Finch SJ.(to appear) The costeffectiveness of duplicate genotyping for testing genetic association AnnHumGenet [PMC free article] [PubMed]
  • Tintle NL, Ahn K, Mendell NR, Gordon D, Finch SJ. Characteristics of replicated single-nucleotide polymorphism genotypes from COGA: Affymetrix and Center for Inherited Disease Research. BMC genetics [computer file] 2005;6(Suppl 1):S154. doi: 10.1186/1471-2156-6-S1-S154. [PMC free article] [PubMed] [Cross Ref]
  • Tintle NL, Gordon D, McMahon FJ, Finch SJ. 2007. Using duplicate genotyped data in genetic analyses: testing association and estimating error rates. Statistical applications in genetics and molecular biology 6Article4 [PubMed]
  • Zheng G, Freidlin B, Li Z, Gastwirth JL. Choice of Scores in Trend Tests for Case-control studies of candidate-gene associations. Biometrical journal. 2003;45(3):335–348. doi: 10.1002/bimj.200390016. [Cross Ref]
  • Zheng G, Gastwirth JL. On estimation of the variance in Cochran-Armitage trend tests for genetic association using case-control studies. Statistics in medicine. 2006;25(18):3150–3159. doi: 10.1002/sim.2250. [PubMed] [Cross Ref]
  • Zuo Y, Zou G, Wang J, Zhao H, Liang H. 2008. Optimal two-stage design for case-control association analysis incorporating genotyping errors Annals of Human Genetics 72Pt 3375–387.38710.1111/j.1469-1809.2007.00419.x [PMC free article] [PubMed] [Cross Ref]

Articles from Statistical Applications in Genetics and Molecular Biology are provided here courtesy of Berkeley Electronic Press