Home | About | Journals | Submit | Contact Us | Français |

**|**Stat Appl Genet Mol Biol**|**PMC2861316

Formats

Article sections

Authors

Related links

Stat Appl Genet Mol Biol. 2009 January 1; 8(1): 24.

Published online 2009 May 5. doi: 10.2202/1544-6115.1433

PMCID: PMC2861316

Copyright © 2009 The Berkeley Electronic Press. All rights reserved

This article has been cited by other articles in PMC.

The genome-wide association (GWA) study is an increasingly popular way to attempt to identify the causal variants in human disease. Duplicate genotyping (or re-genotyping) a portion of the samples in a GWA study is common, though it is typical for these data to be ignored in subsequent tests of genetic association. We demonstrate a method for including duplicate genotype data in linear trend tests of genetic association which yields increased power. We also consider the cost-effectiveness of collecting duplicate genotype data and find that when the relative cost of genotyping to phenotyping and sample acquisition costs is less than or equal to the genotyping error rate it is more powerful to duplicate genotype the entire sample instead of spending the same money to increase the sample size. Duplicate genotyping is particularly cost-effective when SNP minor allele frequencies are low. Practical advice for the implementation of duplicate genotyping is provided. Free software is provided to compute asymptotic and permutation based tests of association using duplicate genotype data as well as to aid in the duplicate genotyping design decision.

Genetic tests of association often utilize case-control study designs in order to identify possible genetic factors contributing to the etiology of a complex disease (Amos 2007, Sasieni 1997). Examining the whole genome simultaneously through genome-wide association (GWA) studies has become an increasingly popular and effective method of determining genetic association. While high costs of GWA studies are still a limiting factor, they continue to become more economically plausible with advances in technology that identify single nucleotide polymorphism (SNP) genotypes at decreasing costs (Amos 2007).

Despite these technological advances, the misclassification of genotypes by SNP technology (genotyping errors) remains a persistent issue. Genotyping error rates are low in many instances (~0.1–0.2% or lower; Saunders et al. 2007, Tintle et al. 2005, Fridley et al. 2008, Heid et al. 2008, Pompanon et al. 2005). However, these error rates are not uniform across all SNPs and some SNPs have measurably larger genotyping error rates (Pompanon et al. 2005). The impact of genotyping errors on case-control tests of genotype-phenotype association is well known. Specifically, non-differential errors (genotyping error rates are the same regardless of phenotype) have no effect on type I error, but do cause inflated type II error (i.e. reduce power) (Gordon & Ott 2001, Gordon et al. 2002, Ahn et al. 2007). Genotyping errors are particularly detrimental to power when the minor SNP allele frequency is low (Gordon et al. 2002, Ahn et al. 2007, Kang et al. 2004, Gordon & Finch 2005).

In addition to laboratory and technology-based approaches to reducing genotyping errors, which seek to address errors at their source, some have proposed the consideration of genotyping errors when designing the study. For example, double sampling (Gordon et al. 2004, Gordon et al. 2007) uses a perfect genotype mechanism (like gene sequencing) on a subset of the sample. Another recent paper discusses how to incorporate genotyping errors when optimizing a two-stage design (Zuo et al. 2008). A third approach involves replicate genotyping (Fridley et al. 2008, Rice & Holmans 2003, Tintle et al. 2007, Lai et al. 2007, Bonin et al. 2004), which means genotyping a random subset of individuals in the sample two or more times, instead of only once.

Duplicate genotyping has been proposed by many for quality control reasons (e.g. Rice & Holmans 2003, Bonin et al. 2004) and it is now a fairly common practice (Tintle et al. 2005, Fridley et al. 2008). Traditionally, duplicate genotype data were ignored in the subsequent statistical analyses. The data were simply used as an initial assessment of data quality. Recently, however, a method was proposed to incorporate duplicate genotype data in standard ${\chi}_{2}^{2}$ tests of genotype-phenotype association on 2x3 tables (Tintle et al. 2007).

Subsequently, Tintle et al. (to appear) demonstrated the cost-effectiveness of duplicate genotyping (i.e. more power) for use in ${\chi}_{2}^{2}$ tests when genotyping costs are low relative to phenotyping and sample acquisition costs. It was found that, as a general rule, duplicate genotyping the entire sample increases power when relative genotype to phenotype/sample acquisition costs don’t exceed the genotyping error rate. Additionally, when the minor SNP allele frequency is low, duplicate genotyping the entire sample can be cost-effective even when relative costs are greater than the genotyping error rate.

The linear trend test of association (LTT), first proposed by Cochran (1954) and Armitage (1955), has been suggested by many (Sasieni 1997, Slager & Schaid 2001, Freidlin et al. 2002, Zheng et al. 2003, Zheng & Gastwirth 2006) as a method for analyzing SNP genotype data since it can incorporate information about the disease mode of inheritance, and thus increase statistical power by narrowing the focus of the alternative hypothesis. Recently, Ahn et al. (2007) demonstrated the impact of genotyping errors on the LTT. Also, Gordon et al. (2007) demonstrated how to use the LTT when double sample data are collected. In this paper, we demonstrate how to include duplicate genotype data in a LTT. We also explore the utility of including duplicate genotype data in subsequent tests of association if they have been collected for quality control reasons. Lastly, we evaluate the cost-effectiveness of designing a study to collect duplicate genotype data for analysis with the LTT.

We consider a sampling strategy where a fraction of the entire sample, *r* (*r* [0,1] ), is randomly selected to be genotyped exactly twice, while the remaining fraction of the sample, *(1–r),* is genotyped exactly once. We assume that all samples have been phenotyped as either a “case” or a “control.”

- Let
be the probability of an individual of genotype_{i,j}*i*being classified as genotype*j*. Following the error model of Douglas et al. (2002), we assume that_{1,2}*=*_{2,1}*=*_{2,3}*=*and_{3,2}_{1,3}*=*= 0._{3,1} - We assume non-differential genotyping errors, meaning that the probability of genotyping errors is the same for each individual in the sample, regardless of case or control status.
- We assume that genotyping error probabilities are independent and remain constant from the first to second genotyping. Specifically, we mean that the probability of a genotyping error does not change for an individual’s second genotyping, and is not dependent upon whether they were incorrectly genotyped the first time.

*δ*= the frequency of allele_{m}*m*at the SNP marker. In this paper we assume the SNP is bi-allelic, and, thus,*m*=*U,V*. We also assume that the SNP marker allele associated with the disease is allele*2*.*ζ*= the frequency of risk allele_{n}*n*at the disease locus. In this paper we assume the disease locus is bi-allelic and we denote the risk allele as*B*and the non-risk allele as*A*. Thus,*n*=*A,B*.*h*_{mn}*=*the frequency of the*mn*haplotype; that is, the frequency of having both the*m*allele at the SNP marker and risk allele*n*at the disease locus. Thus, $\sum _{m=U,V}\sum _{n=A.B}{h}_{mn}=1$.*D =*the unstandardized measure of linkage disequilibrium between the SNP marker,*V*, and the disease risk allele,*B*. Thus,*D = h*_{VB}*− δ*_{V}*ζ*._{B}*r*^{2}= the measure of the correlation between the SNP marker and the disease risk allele = (*D*^{2}/(*δ*_{U}*δ*_{V}*ζ*_{A}*ζ*)._{B}- ρ=
*r*^{2}/ max(*r*^{2})= a measure of the correlation of SNP allele*V*and disease risk allele*B*as a fraction of their maximum possible correlation. As pointed out by Amos (2007), max(*r*^{2})<1 unless*δ*=_{V}*ζ*. We also note that for any values of_{B}*δ*and,_{V}*ζ*max(_{B}*r*^{2}) is attained when*D*'= 1, where*D*′=*D/*min(*δ*_{U}*ζ*_{B,}*δ*_{V}*ζ*)._{A} *ϕ*= the disease prevalence in the population.*f*_{j1}_{j2}= the penetrance of the disease given genotype*j*_{1}*j*_{2}at the disease locus. Thus,*f*is the probability someone who is BB at the disease locus (homozygote for the risk allele) has the disease,_{BB}*f*is the probability someone who is AB at the disease locus (heterozygote for the risk allele) has the disease, and_{AB}*f*is the probability someone who is AA at the disease locus (homozygote for the non-risk allele) has the disease._{AA}*γ*= a general relative risk of disease parameter which is used to compute genotype specific relative risks (*γ*and_{BB}*γ*) in ways that are dependent upon the mode of inheritance of the disease (dominant, additive, recessive)._{AB}*p*= the probability of genotype_{i}*i*in the cases,*i*=1,2,3*q*= the probability of genotype_{i}*i*in the controls,*i*=1,2,3- ${p}_{i}^{*}$ = the probability of observing genotype
*i*in the cases assuming there are genotyping errors, and the sample is genotyped exactly once*i*=1,2,3 - ${q}_{i}^{*}$ = the probability of observing genotype
*i*in the controls assuming there are genotyping errors, and the sample is genotyped exactly once*i*=1,2,3 - ${p}_{ij}^{*}$ = the probability of observing genotype
*i*once and genotype*j*once in the cases assuming there are genotyping errors, and the sample is genotyped exactly twice*i*=1,2,3,*j=*1,2,3 and*i*≤*j*. - ${q}_{ij}^{*}$ = the probability of observing genotype
*i*once and genotype*j*once in the controls assuming there are genotyping errors, and the sample is genotyped exactly twice*i*=1,2, 3,*j=*1,2,3 and*i*≤*j*. *T =*the total number of cases*S =*the total number of controls*N*=*T + S*= the total sample size*k*=*S/T*= ratio of controls to cases*c*= the relative cost of genotyping to phenotyping/sample acquisition

When a fraction *(r)* of the sample has been duplicate genotyped, and the SNP marker under consideration has three possible genotypes (1=*UU*, 2=*UV* and 3=*VV*), data can be summarized into two tables, as shown in *Tables 1a* and *1b1b*. We assume that an equal fraction of both cases and controls has been duplicate genotyped.

Using a weighting strategy for duplicate genotype data presented by Tintle et al. (2007), *Tables 1a* and *1b1b* can be combined into a single table (*Table 1c*) as follows:

$$t{\text{'}}_{i}={t}_{i}+{t}_{ii}+0.5({t}_{ij})+0.5({t}_{ik})$$

(1)

where *i* ≠ *j* and *i* ≠ *k*, with a similar equation for the controls. As in shown in Tintle et al. (2007), using equal weights (0.5) for the inconsistently identified individuals is optimal.

We consider three disease modes of inheritance (MOI): Dominant (*γ** _{BB}* =

As noted earlier, the linear trend test (LTT) is a powerful choice for the analysis of case-control studies of genetic association because of the ability to include information about the disease mode of inheritance (Sasieni 1997, Slager & Schaid 2001, Zheng & Gastwirth 2006). The traditional LTT statistic is
$U/{\sigma}_{U}$ where *U* is a statistic based on the disease mode of inheritance and the observed cell counts in the 2x3 contingency table (e.g. *Table 1a*), and *σ*_{U} is estimated based on the observed cell counts in the same table. In this paper, we extend the traditional version of the test to be able to include duplicate genotype data, proposing the *LTT*_{d}*(see Results:Finding the LTT statistic*). In short, the *LTT** _{d}* uses the strategy proposed by Tintle et al. (2007) to place individuals who have been inconsistently duplicate genotyped to each of the two genotypes to which they have been genotyped (see Equation (

In developing the *LTT _{d}*, we also address the issue of bias in the

To confirm that the empirical distribution of the *LTT** _{d}* (Derived later, see

We examined all possible combinations of parameter values and so a total of 17,496 settings were evaluated. The simulation study was conducted as follows:

*Step 1.* For given values of *δ*_{V}*, ζ*_{B,}*ρ*, *ϕ*,*γ*, and the disease MOI the true genotype probabilities (*p i* and

*Step 2.* The true genotype probabilities (*p** _{i}* and

*Step 3.* For given values of *k*, *r*, and *n*, and the observed single and duplicate genotyping probabilities (
${p}_{ij}^{*}$,
${q}_{ij}^{*}$,
${p}_{i}^{*}$ and
${q}_{i}^{*}$) found in step 2, entries into *Tables 1a* and *1b1b* were randomly simulated. For each combination of parameter values in *Table 2*, 2,000 random tables were simulated. In cases where *γ* =1 (null hypothesis is true), the type I error rate was analyzed by comparing the nominal significant level α (we examined 0.05, 0.005, and 0.0002) with the empirical α level. In cases where *γ* =1.25 or *γ* =2.00 (i.e. the alternative hypothesis is true), the empirical power was compared to the theoretical power (see equation *(A3)* in the *Appendix*).

We completed a computational study comparing theoretical power values for different values of *r* (duplicate genotyping percentage), *c* (relative genotyping costs) and other parameters. *Table 3* shows the settings used for this study. We examined all 10,368 possible combinations of parameter values based on *Table 3*.

The computational study was carried out as follows:

- Step 1. Assuming there are no genotyping errors, for given values of
*δ*_{V,}*ζ*,_{B}*ρ*,*ϕ*,*γ*and the disease type, the genotype probabilities were computed as if no duplicates were obtained. These values were then used to find the sample size needed (*N*) to yield the specified power level (80% or 95%)._{0} - Step 2. Find the budget (
*B*) needed to conduct the study if no duplicates as:*B=(1+c)N*, where_{0}*c*is the genotyping cost per person relative to phenotyping/acquisition cost. - Step 3. Assuming there is duplicate genotyping (
*r*>0), the sample size that can be obtained for the same budget,*B*, is found as ${N}_{r}=\frac{B}{1+c(1+r)}$.*N*can then be used in the power computation formula_{r}*(A3)*in the*Appendix*, to find the power using duplicate genotyping for that sample size. Then we find the optimal value of*r*that yields the largest power of the test. All computations used α=0.0002.

Zheng and Gastwirth (2006) present a test statistic for the LTT as
$Z=\frac{U}{\sqrt{V}}$, where
$U=\sum _{i}{x}_{i}\left(\frac{S}{N}{t}_{i}-\frac{T}{N}{s}_{i}\right)$ and *V* is an estimate of the variance of *U*. Tintle et al. (2007) showed that by using the allocation strategy *(Equation (1))**Tables 1a* and *1c1c* estimate the same quantities. Thus, the numerator of the Zheng and Gastwirth *Z* statistic becomes:

$${U}_{d}=\sum _{i}{x}_{i}\left(\frac{S}{N}t{\text{'}}_{i}-\frac{T}{N}s{\text{'}}_{i}\right)$$

(2)

According to the Central Limit Theorem, *Table 1c* has an approximately multivariate normal distribution (see also Tintle et al. 2007). The expected value of *Ud* under the null hypothesis (*p*^{*}_{i}=*q*^{*}_{i} for all *i*) is zero (see Equation *(A1)* in the *Appendix*). Thus,

$$LT{T}_{d}=\frac{{U}_{d}}{\sqrt{Var({U}_{d})}}$$

(3)

has a standard normal distribution and, therefore, *(LTT*_{d}*)*^{2} has a
${\chi}_{1}^{2}$ distribution. Following the results of Zheng and Gastwirth (2006) the expression for *Var*(*U** _{d}*) follows from

As described earlier (*Methods: Simulation study*), a simulation study was conducted to ensure that nominal type I and type II error rates obtained using the asymptotic theory of the *LTT _{d}* were maintained in practice. First we consider the distribution of

For each combination of parameter values, a 99% confidence interval was found for the empirical α. For both the dominant and additive models, nominal type I error rates were maintained empirically regardless of sample size since an expected number of simulation settings had a 99% confidence interval on the empirical α that did not contain the nominal α (1.2% and 1.2% for dominant and additive, respectively, for α=0.05, 1.2% and 1.3% for the α=0.005 level and 1.1% and 0.7% for the α=0.0002 level). Nominal type I error rates were maintained empirically for the recessive model as long as the minimum cell count in *Table 1c* was at least 5 (detailed results not shown).

The *LTT** _{d}* statistic generally gives comparable theoretical and empirical power values across all simulation settings for the additive and dominant models as long as expected cell counts in

Based on the simulation study, differences in theoretical and empirical type I and type II errors are possible when the recessive disease model is used, in cases where at least one cell count in the grouped table is less than 5, or in cases where the total sample is less than 1,000 individuals. In these cases we recommend estimating p-values for the *LTT** _{d}* by permuting phenotype status instead of using the asymptotic theory provided above. A permutation based p-value is available in our software (see

In Tintle et al. (2007), duplicate genotype data from a case-control study on bipolar disorder was presented for a SNP with inconsistently genotyped individuals where all individuals were duplicate genotyped. We present this data here (*Table 4*) in the form of *Table 1b* using a linear trend test for analysis to demonstrate the utility of the methods just developed.

Tintle et al. (2007) report a p-value of 0.061 from the ${\chi}_{2}^{2}$ test ignoring inconsistently identified individuals and 0.064 from the test including inconsistencies. Using our software and assuming an additive mode of inheritance, the linear trend test just presented yields a p-value is 0.0230 ignoring inconsistents, 0.0241 including inconsistents using the method shown above and 0.0245 using a permutation test with 2000 permutations.

Initially, we consider an instance of including previously collected quality control data in the test of association. In every case examined, the power of the *LTT** _{d}* is higher when the duplicate genotype data is included as compared to when it is not. In other words, it is better to include the duplicate genotype data in subsequent tests of association then to ignore inconsistencies and treat the data as missing. This result is consistent with the results of Tintle et al. (2007) for the
${\chi}_{2}^{2}$test of association.

The most important case, however, is when *c*>0. That is, when we view the collection of duplicate genotype data as an *a priori* design decision, and thus must account for the cost of collecting the duplicates for a fraction, *r*, of the sample.

Given a fixed budget, in 49.2% of cases examined (see *Table 3*) where *c*>0, duplicate genotyping the entire sample (*r*=1) was found to be the most cost-effective design strategy (yields the highest power). In all remaining cases, *r*=0 provided the highest power. Thus, the optimal strategy is always “all or nothing.”

In order to characterize situations where duplicate genotyping will be cost-effective, logistic regression models were used with all parameters predicting whether or not duplicate genotyping the entire sample was the most cost-effective design. Three parameters (*δ** _{V}*,

*Table 5* shows the percentage of cases examined in the computational study where duplicate genotyping the entire sample is the most effective design decision for different values of *c* (relative genotyping costs) and (genotyping error rate).

*Table 5* demonstrates a general rule of thumb: duplicate genotyping the entire sample will always be cost-effective (regardless of *δ** _{V}* ) if

*Table 6* provides power values for a specific example. Specifically, we present power under different values of *δ** _{V}*,

The sample size needed to yield 80% power was calculated assuming there was no genotyping error. *Column III* shows the power for that sample size, after taking the genotyping error into account. When the marker frequency is low and/or the genotyping error rate is larger *Column III* demonstrates that power can be significantly impacted by genotyping errors. *Columns IV–VII* then reduce the sample size to maintain the budget reflecting the additional cost of collecting duplicate genotype data on all samples at different genotype costs *(c)*. As *c* decreases, increases and *δ** _{V}* decreases,

Note that the power values in *Table 6* are from a specific example. Please use our software (*Results: Software*) to investigate power at values specific to your research situation while keeping in mind the rule of thumb presented in *Table 5*.

In practice, duplicate genotyping should be considered when relative genotype to phenotype/sample acquisition costs do not exceed the expected SNP genotyping error rate. A more detailed treatment of practical considerations when using duplicate genotyping is provided in Tintle et al. (to appear). We summarize three main considerations here.

First, calculations provided in this manuscript consider only a single SNP. However, in practice, the decision to duplicate genotype will need to be made for an entire set of SNPs (e.g. all of the SNPs on a chip). In these cases using the same rule of thumb (duplicate genotype if *c ≤ *) is appropriate where the used is the minimum error rate expected for any single SNP.

Second, if there is concern that some samples may be of low quality and, thus, have higher genotyping error rates than other samples (a violation of genotyping error assumption #2), the error rate, *,* used in the *c ≤ * rule of thumb should be the minimum expected for the high quality samples. Note, however, that we are still assuming non-differential errors in this case. Differential errors may increase the type I error rate, and are not considered in this manuscript.

Third, GWA studies are typically conducted in two-stages where all markers are genotyped on a sample of individuals, and then a subset of the markers is genotyped on a sample of additional individuals. When considering the use of duplicate genotyping in two-stage studies, the decision on the use of duplicate genotyping should be made separately at each stage since the relative cost of genotyping to phenotyping will be different at each stage.

To facilitate the utilization of the methods discussed in this paper, we provide two companion pieces of software for this work. The first computes the *LTT** _{d}* statistic and provides an asymptotic and permutation p-value. The second provides power computations for different genotyping costs, allele frequencies and error rates to assist in the duplicate genotyping design decision. Software is available at http://math.hope.edu/tintle/duplicate.html (source code written in

This work demonstrates how duplicate genotype data can be included in a linear trend test (*LTT)* of genetic association. Duplicate genotype data are included in the *LTT* through a weighting strategy and a subsequent adjustment of the variance of the *LTT* statistic yielding the *LTT** _{d}*. We demonstrate via simulation that the asymptotic null and alternative distributions of the

We demonstrate that in the case of no duplicate genotyping costs (e.g. the data has already been collected) including the duplicate data in the *LTT _{d}* always increases statistical power. This confirms a similar result in Tintle et al. (2007).

We also consider the cost-effectiveness of designing a study to collect duplicate genotype data, and find that when the relative cost of genotyping to phenotype/sample acquisition costs *(c)* is less than or equal to the genotyping error rate *()*, collecting duplicate genotype data on the entire sample is costeffective. Further, we find that the optimal amount of duplicate genotyping, in these cases, will always involve duplicate genotyping the entire sample. In a twostage GWA study for a complex disease, if a relatively small set of SNPs are being followed up at stage 2 and it will be costly to enroll more subjects, duplicate genotyping may be cost-effective since relative genotyping to phenotyping/acquisition costs *c* will be low.

Since the rule-of-thumb just described is conservative it is important to note that duplicate genotyping will be cost-effective in many situations when *c>*. This rule was provided to allow researchers to quickly assess the costeffectiveness of duplicate genotyping on a large scale. It is quite likely that, even if *c>*, duplicate genotyping may provide moderate power gains for SNPs with low minor SNP allele frequency. Our software should be used to determine cost-effectiveness of duplicate genotyping for specific experimental conditions.

We assume that genotyping errors are independent from the first to second genotyping (genotyping error assumption #3) and that genotyping error rates are non-differential (genotyping error assumption #2). Future work is needed to extend results to consider differential genotyping errors when duplicate genotyping. Further reading on sources of genotyping error and their impact on analyses can be found in Bonin et al. (2004) and Gordon and Finch (2005). We also assume that duplicate genotyping is applied to a random subsample of size *nr*. Further work is necessary to explore optimizing the value of *r* depending upon phenotype or initial genotype classification.

When collected, duplicate genotype data should always be included in the subsequent test of association and in many realistic cases duplicate genotype data should be collected on the entire sample.

Bryce Borchers, Marshall Brown and Brian McLellan contributed equally to this work. This project was funded in part by a grant from the National Institutes of Health, R15-HG004543. The content is solely the responsibility of the authors and does not necessarily represent the official view of the National Human Genome Research Institute or the National Institutes of Health. Additionally, this project received support from the Tanis Fund for Statistics Research.

$$\begin{array}{l}E({U}_{d})=E\left(\sum _{i}{x}_{i}\left(\frac{S}{N}t{\text{'}}_{i}-\frac{T}{N}s{\text{'}}_{i}\right)\right)\\ =\sum _{i}{x}_{i}\left(\frac{ST{p}_{i}^{*}}{N}-\frac{TS{q}_{i}^{*}}{N}\right)=\frac{ST}{N}\sum _{i}{x}_{i}({p}_{i}^{*}-{q}_{i}^{*})\end{array}$$

(A1)

$$\begin{array}{l}Var({U}_{d})=Var\left(\sum _{i=1}^{3}{x}_{i}\left(\frac{S}{N}t{\text{'}}_{i}-\frac{T}{N}s{\text{'}}_{i}\right)\right)={\left(\frac{S}{N}\right)}^{2}Var\left(\sum _{i=1}^{3}{x}_{i}t{\text{'}}_{i}\right)+{\left(\frac{T}{N}\right)}^{2}Var\left(\sum _{i=1}^{3}{x}_{i}s{\text{'}}_{i}\right)\\ ={\left(\frac{S}{N}\right)}^{2}\left[\sum _{i=1}^{3}\left({x}_{i}^{2}Var(t{\text{'}}_{i})+2\sum _{\underset{i<j}{j=2}}^{3}{x}_{i}{x}_{j}Cov(t{\text{'}}_{i},t{\text{'}}_{j})\right)\right]\\ +{\left(\frac{T}{N}\right)}^{2}\left[\sum _{i=1}^{3}\left({x}_{i}^{2}Var(s{\text{'}}_{i})+2\sum _{\underset{i<j}{j=2}}^{3}{x}_{i}{x}_{j}Cov(s{\text{'}}_{i},s{\text{'}}_{j})\right)\right]\end{array}$$

(A2)

Below we show how to find *Var*(*t*′_{1}) and *Cov* (*t*′_{1,} *t*′_{2}), other terms can be found similarly.

$$\begin{array}{l}Var(t{\text{'}}_{i})=Var\left({t}_{1}+{t}_{11}+\frac{1}{2}\sum _{i\ne j}({t}_{12}+{t}_{13})\right)=Var({t}_{1})+Var({t}_{11})+\frac{1}{4}Var({t}_{12})+\frac{1}{4}Var({t}_{13})\\ +Cov({t}_{11},{t}_{12})+Cov({t}_{11},{t}_{13})+\frac{1}{2}Cov({t}_{12},{t}_{13})\\ ={T}_{s}{p}_{1}^{*}(1-{p}_{1}^{*})+{T}_{D}\left(\begin{array}{l}\begin{array}{c}\begin{array}{c}{p}_{11}^{*}(1-{p}_{11}^{*})\end{array}+\frac{1}{4}{p}_{12}^{*}(1-{p}_{12}^{*})+\frac{1}{4}{p}_{13}^{*}(1-{p}_{13}^{*})\end{array}\\ -{p}_{11}^{*}{p}_{12}^{*}-{p}_{11}^{*}{p}_{13}^{*}-\frac{1}{2}{p}_{12}^{*}{p}_{13}^{*}\end{array}\right)\end{array}$$

$$\begin{array}{l}Cov(t{\text{'}}_{1},t{\text{'}}_{2})=E\left[t{\text{'}}_{1}t{\text{'}}_{2}\right]-E\left[t{\text{'}}_{1}\right]\xb7E\left[t{\text{'}}_{2}\right]\text{where}\\ E\left[t{\text{'}}_{1}\right]=E[{t}_{1}]+E[{t}_{11}]+\frac{1}{2}E\left[{t}_{12}\right]+\frac{1}{2}E\left[{t}_{13}\right]=N{T}_{s}{p}_{1}+N{T}_{d}\left({p}_{11}^{*}+\frac{1}{2}{p}_{12}^{*}+\frac{1}{2}{p}_{13}^{*}\right)\\ E\left[t{\text{'}}_{2}\right]=E[{t}_{2}]+E[{t}_{22}]+\frac{1}{2}E\left[{t}_{12}\right]+\frac{1}{2}E\left[{t}_{23}\right]=N{T}_{s}{p}_{2}+N{T}_{d}\left({p}_{22}^{*}+\frac{1}{2}{p}_{12}^{*}+\frac{1}{2}{p}_{23}^{*}\right)\\ E\left[t{\text{'}}_{1}t{\text{'}}_{2}\right]=E[{t}_{1}{t}_{2}]+E[{t}_{1}]E[{t}_{22}]+E[{t}_{11}]E[{t}_{2}]+E[{t}_{11}{t}_{22}]+\frac{1}{2}(E[{t}_{1}]E[{t}_{12}]+E[{t}_{1}]E[{t}_{23}]+E[{t}_{13}]E[{t}_{2}])\\ +\frac{1}{2}(E[{t}_{11}{t}_{12}]+E[{t}_{11}{t}_{23}]+E[{t}_{12}{t}_{2}]+E[{t}_{22}{t}_{12}]+E[{t}_{22}{t}_{13}])+\frac{1}{4}(E\left[{t}_{12}{t}_{12}\right]+E\left[{t}_{12}{t}_{23}\right]+E\left[{t}_{12}{t}_{13}\right]+E\left[{t}_{12}{t}_{23}\right])\end{array}$$

Where we make the following substitutions in *E*[*t*′_{1,} *t*′_{2}] as appropriate:

$$\begin{array}{l}E\left[{t}_{i}\right]={T}_{s}{p}_{i}^{*},E\left[{t}_{ij}\right]={T}_{D}{p}_{ij}^{*}\text{for}i=j\text{or}i\ne j,E\left[{t}_{i}{t}_{i}\right]=Var({t}_{i})+E{\left[{t}_{i}\right]}^{2},\\ E\left[{t}_{ii}{t}_{ii}\right]=Var({t}_{ii})+E{\left[{t}_{ii}\right]}^{2},E\left[{t}_{i}{t}_{j}\right]=Cov({t}_{i},{t}_{j})+E\left[{t}_{i}\right]E\left[{t}_{j}\right]\text{for}i\ne j,\text{and}\\ E\left[{t}_{ij}{t}_{ik}\right]=Cov\left({t}_{ij}{t}_{ik}\right)+E\left[{t}_{ij}\right]E\left[{t}_{ik}\right]\text{for}j\ne k.\end{array}$$

Following the results of Zheng and Gastwirth (Zheng & Gastwirth 2006), asymptotic power for the *LTT _{d}* can be computed using the following formula

$$1-\varphi \left(\frac{{z}_{1-\alpha /2}{\sigma}_{d}-{\mu}_{d}}{{\sigma}_{d}}\right)+\varphi \left(\frac{{z}_{\alpha /2}{\sigma}_{d}-{\mu}_{d}}{{\sigma}_{d}}\right)=1-\varphi \left({z}_{1-\alpha /2}-\frac{{\mu}_{d}}{{\sigma}_{d}}\right)+\varphi \left({z}_{\alpha /2}-\frac{{\mu}_{d}}{{\sigma}_{d}}\right)$$

(A3)

Where
${\sigma}_{d}=\sqrt{Var({U}_{d})}$ and *μd=E(U _{d}*).

- Ahn K, Haynes C, Kim W, Fleur RS, Gordon D, Finch SJ. 2007. The effects of SNP genotyping errors on the power of the Cochran-Armitage linear trend test for case/control association studies Annals of Human Genetics 71Pt 2249–261.26110.1111/j.1469-1809.2006.00318.x [PubMed] [Cross Ref]
- Amos CI. Successful design and conduct of genome-wide association studies. Human molecular genetics. 2007;16(2):R220–5. doi: 10.1093/hmg/ddm161. [PMC free article] [PubMed] [Cross Ref]
- Armitage P. Tests for linear trends in proportions and frequencies. Biometrics. 1955;11:375–386. doi: 10.2307/3001775. [Cross Ref]
- Bonin A, Bellemain E, Bronken Eidesen P, Pompanon F, Brochmann C, Taberlet P. How to track and assess genotyping errors in population genetics studies. Molecular ecology. 2004;13(11):3261–3273. doi: 10.1111/j.1365-294X.2004.02346.x. [PubMed] [Cross Ref]
- Cochran WG. Some methods for strengthening the common chi-squared tests. Biometrics. 1954;10:417–451. doi: 10.2307/3001616. [Cross Ref]
- Douglas JA, Skol AD, Boehnke M. Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclearfamily data. American Journal of Human Genetics. 2002;70(2):487–495. doi: 10.1086/338919. [PubMed] [Cross Ref]
- Freidlin B, Zheng G, Li Z, Gastwirth JL. Trend tests for casecontrol studies of genetic markers: power, sample size and robustness. Human heredity. 2002;53(3):146–152. doi: 10.1159/000064976. [PubMed] [Cross Ref]
- Fridley BL, Turner ST, Chapman AB, Rodin AS, Boerwinkle E, Bailey KR. Reproducibiilty of genotypes as measured by the affymetrix GeneChip 100K Human Mapping Array set. Computational Statistics and Data Analysis. 2008;52:5367–5374. doi: 10.1016/j.csda.2008.05.020. [PMC free article] [PubMed] [Cross Ref]
- Gordon D, Finch SJ. “Consequences of Error” In: Dunn MJ, Jorde LB, Little PFR, Subramanian S, Wiley, editors. Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics. 2005.
- Gordon D, Haynes C, Yang Y, Kramer PL, Finch SJ. Linear Trend Tests for Case-Control Genetic Association that Incorporate Random Phenotype and Genotype Misclassification Error. Genetic epidemiology. 2007;31:853–870. doi: 10.1002/gepi.20246. [PubMed] [Cross Ref]
- Gordon D, Finch SJ, Nothnagel M, Ott J. Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Human heredity. 2002;54(1):22–33. doi: 10.1159/000066696. [PubMed] [Cross Ref]
- Gordon D, Ott J. Assessment and management of single nucleotide polymorphism genotype errors in genetic association analysis. Pacific Symposium on BiocomputingPacific Symposium on Biocomputing. 2001:18–29. [PubMed]
- Gordon D, Yang Y, Haynes C, Finch SJ, Mendell NR, Brown AM, Haroutunian V. 2004. Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling Statistical applications in genetics and molecular biology 3Article2610.2202/1544-6115.1085 [PubMed] [Cross Ref]
- Heid IM, Lamina C, Kuchenhoff H, Fischer G, Klopp N, Kolz M, Grallert H, Vollmert C, Wagner S, Huth C, Muller J, Muller M, Hunt SC, Peters A, Paulweber B, Wichmann HE, Kronenberg F, Illig T. Estimating the single nucleotide polymorphism genotype misclassification from routine double measurements in a large epidemiologic sample. American Journal of Epidemiology. 2008;168(8):878–889. doi: 10.1093/aje/kwn208. [PMC free article] [PubMed] [Cross Ref]
- Kang SJ, Gordon D, Finch SJ. What SNP genotyping errors are most costly for genetic association studies? Genetic epidemiology. 2004;26(2):132–141. doi: 10.1002/gepi.10301. [PubMed] [Cross Ref]
- Lai RZ, Zhang H, Yang YN. Repeated measurement sampling in genetic association analysis with genotyping errors. Genetic epidemiology. 2007;31(2):143–153. doi: 10.1002/gepi.20197. [PubMed] [Cross Ref]
- Pompanon F, Bonin A, Bellemain E, Taberlet P. Genotyping errors: causes, consequences and solutions. Nature reviews.Genetics. 2005;6(11):847–859. doi: 10.1038/nrg1707. [PubMed] [Cross Ref]
- Rice KM, Holmans P. 2003. Allowing for genotyping error in analysis of unmatched case-control studies Annals of Human Genetics 67Pt 2165–174.17410.1046/j.1469-1809.2003.00020.x [PubMed] [Cross Ref]
- Sasieni PD. From genotypes to genes: doubling the sample size. Biometrics. 1997;53(4):1253–1261. doi: 10.2307/2533494. [PubMed] [Cross Ref]
- Saunders IW, Brohede J, Hannan GN. Estimating genotyping error rates from Mendelian errors in SNP array genotypes and their impact on inference. Genomics. 2007;90(3):291–296. doi: 10.1016/j.ygeno.2007.05.011. [PubMed] [Cross Ref]
- Slager SL, Schaid DJ. Case-control studies of genetic markers: power and sample size approximations for Armitage's test for trend. Human heredity. 2001;52(3):149–153. doi: 10.1159/000053370. [PubMed] [Cross Ref]
- Tintle NL, Gordon D, Van Bruggen D, Finch SJ.(to appear) The costeffectiveness of duplicate genotyping for testing genetic association AnnHumGenet [PMC free article] [PubMed]
- Tintle NL, Ahn K, Mendell NR, Gordon D, Finch SJ. Characteristics of replicated single-nucleotide polymorphism genotypes from COGA: Affymetrix and Center for Inherited Disease Research. BMC genetics [computer file] 2005;6(Suppl 1):S154. doi: 10.1186/1471-2156-6-S1-S154. [PMC free article] [PubMed] [Cross Ref]
- Tintle NL, Gordon D, McMahon FJ, Finch SJ. 2007. Using duplicate genotyped data in genetic analyses: testing association and estimating error rates. Statistical applications in genetics and molecular biology 6Article4 [PubMed]
- Zheng G, Freidlin B, Li Z, Gastwirth JL. Choice of Scores in Trend Tests for Case-control studies of candidate-gene associations. Biometrical journal. 2003;45(3):335–348. doi: 10.1002/bimj.200390016. [Cross Ref]
- Zheng G, Gastwirth JL. On estimation of the variance in Cochran-Armitage trend tests for genetic association using case-control studies. Statistics in medicine. 2006;25(18):3150–3159. doi: 10.1002/sim.2250. [PubMed] [Cross Ref]
- Zuo Y, Zou G, Wang J, Zhao H, Liang H. 2008. Optimal two-stage design for case-control association analysis incorporating genotyping errors Annals of Human Genetics 72Pt 3375–387.38710.1111/j.1469-1809.2007.00419.x [PMC free article] [PubMed] [Cross Ref]

Articles from Statistical Applications in Genetics and Molecular Biology are provided here courtesy of **Berkeley Electronic Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |