|Home | About | Journals | Submit | Contact Us | Français|
The genome-wide association (GWA) study is an increasingly popular way to attempt to identify the causal variants in human disease. Duplicate genotyping (or re-genotyping) a portion of the samples in a GWA study is common, though it is typical for these data to be ignored in subsequent tests of genetic association. We demonstrate a method for including duplicate genotype data in linear trend tests of genetic association which yields increased power. We also consider the cost-effectiveness of collecting duplicate genotype data and find that when the relative cost of genotyping to phenotyping and sample acquisition costs is less than or equal to the genotyping error rate it is more powerful to duplicate genotype the entire sample instead of spending the same money to increase the sample size. Duplicate genotyping is particularly cost-effective when SNP minor allele frequencies are low. Practical advice for the implementation of duplicate genotyping is provided. Free software is provided to compute asymptotic and permutation based tests of association using duplicate genotype data as well as to aid in the duplicate genotyping design decision.
Genetic tests of association often utilize case-control study designs in order to identify possible genetic factors contributing to the etiology of a complex disease (Amos 2007, Sasieni 1997). Examining the whole genome simultaneously through genome-wide association (GWA) studies has become an increasingly popular and effective method of determining genetic association. While high costs of GWA studies are still a limiting factor, they continue to become more economically plausible with advances in technology that identify single nucleotide polymorphism (SNP) genotypes at decreasing costs (Amos 2007).
Despite these technological advances, the misclassification of genotypes by SNP technology (genotyping errors) remains a persistent issue. Genotyping error rates are low in many instances (~0.1–0.2% or lower; Saunders et al. 2007, Tintle et al. 2005, Fridley et al. 2008, Heid et al. 2008, Pompanon et al. 2005). However, these error rates are not uniform across all SNPs and some SNPs have measurably larger genotyping error rates (Pompanon et al. 2005). The impact of genotyping errors on case-control tests of genotype-phenotype association is well known. Specifically, non-differential errors (genotyping error rates are the same regardless of phenotype) have no effect on type I error, but do cause inflated type II error (i.e. reduce power) (Gordon & Ott 2001, Gordon et al. 2002, Ahn et al. 2007). Genotyping errors are particularly detrimental to power when the minor SNP allele frequency is low (Gordon et al. 2002, Ahn et al. 2007, Kang et al. 2004, Gordon & Finch 2005).
In addition to laboratory and technology-based approaches to reducing genotyping errors, which seek to address errors at their source, some have proposed the consideration of genotyping errors when designing the study. For example, double sampling (Gordon et al. 2004, Gordon et al. 2007) uses a perfect genotype mechanism (like gene sequencing) on a subset of the sample. Another recent paper discusses how to incorporate genotyping errors when optimizing a two-stage design (Zuo et al. 2008). A third approach involves replicate genotyping (Fridley et al. 2008, Rice & Holmans 2003, Tintle et al. 2007, Lai et al. 2007, Bonin et al. 2004), which means genotyping a random subset of individuals in the sample two or more times, instead of only once.
Duplicate genotyping has been proposed by many for quality control reasons (e.g. Rice & Holmans 2003, Bonin et al. 2004) and it is now a fairly common practice (Tintle et al. 2005, Fridley et al. 2008). Traditionally, duplicate genotype data were ignored in the subsequent statistical analyses. The data were simply used as an initial assessment of data quality. Recently, however, a method was proposed to incorporate duplicate genotype data in standard tests of genotype-phenotype association on 2x3 tables (Tintle et al. 2007).
Subsequently, Tintle et al. (to appear) demonstrated the cost-effectiveness of duplicate genotyping (i.e. more power) for use in tests when genotyping costs are low relative to phenotyping and sample acquisition costs. It was found that, as a general rule, duplicate genotyping the entire sample increases power when relative genotype to phenotype/sample acquisition costs don’t exceed the genotyping error rate. Additionally, when the minor SNP allele frequency is low, duplicate genotyping the entire sample can be cost-effective even when relative costs are greater than the genotyping error rate.
The linear trend test of association (LTT), first proposed by Cochran (1954) and Armitage (1955), has been suggested by many (Sasieni 1997, Slager & Schaid 2001, Freidlin et al. 2002, Zheng et al. 2003, Zheng & Gastwirth 2006) as a method for analyzing SNP genotype data since it can incorporate information about the disease mode of inheritance, and thus increase statistical power by narrowing the focus of the alternative hypothesis. Recently, Ahn et al. (2007) demonstrated the impact of genotyping errors on the LTT. Also, Gordon et al. (2007) demonstrated how to use the LTT when double sample data are collected. In this paper, we demonstrate how to include duplicate genotype data in a LTT. We also explore the utility of including duplicate genotype data in subsequent tests of association if they have been collected for quality control reasons. Lastly, we evaluate the cost-effectiveness of designing a study to collect duplicate genotype data for analysis with the LTT.
We consider a sampling strategy where a fraction of the entire sample, r (r [0,1] ), is randomly selected to be genotyped exactly twice, while the remaining fraction of the sample, (1–r), is genotyped exactly once. We assume that all samples have been phenotyped as either a “case” or a “control.”
When a fraction (r) of the sample has been duplicate genotyped, and the SNP marker under consideration has three possible genotypes (1=UU, 2=UV and 3=VV), data can be summarized into two tables, as shown in Tables 1a and 1b1b. We assume that an equal fraction of both cases and controls has been duplicate genotyped.
where i ≠ j and i ≠ k, with a similar equation for the controls. As in shown in Tintle et al. (2007), using equal weights (0.5) for the inconsistently identified individuals is optimal.
We consider three disease modes of inheritance (MOI): Dominant (γBB = γAB = γ), Additive(γAB = γ, γBB =2γ −1), and Recessive (γBB = γ,γAB =1),where γBB = the relative risk of disease for a participant with two copies of the risk allele (fBB / fAA) and γAB = the relative risk of disease for a participant with one copy of the risk allele (f AB / fAA). Note that when γBB = γAB = 1 then the null hypothesis of no association between genotype and disease is true.
As noted earlier, the linear trend test (LTT) is a powerful choice for the analysis of case-control studies of genetic association because of the ability to include information about the disease mode of inheritance (Sasieni 1997, Slager & Schaid 2001, Zheng & Gastwirth 2006). The traditional LTT statistic is where U is a statistic based on the disease mode of inheritance and the observed cell counts in the 2x3 contingency table (e.g. Table 1a), and σU is estimated based on the observed cell counts in the same table. In this paper, we extend the traditional version of the test to be able to include duplicate genotype data, proposing the LTTd (see Results:Finding the LTT statistic). In short, the LTTd uses the strategy proposed by Tintle et al. (2007) to place individuals who have been inconsistently duplicate genotyped to each of the two genotypes to which they have been genotyped (see Equation (1)). This strategy, however, means that the resulting contingency table of phenotype-genotype (Table 1c) no longer has a multinomial distribution due to increased covariance between cells, requiring the introduction of the LTTd.
In developing the LTTd, we also address the issue of bias in the σU estimate. Freidlin et al. (2002) demonstrated that the method of estimating σU as considered by Slager and Schaid (2001) was biased and thus provided invalid results. Zheng and Gastwirth (2006) consider two alternatives to the Slager and Schaid approach which they call “case-control” (cc) and “control” (c). The Slager and Schaid method estimates σU assuming that the null hypothesis of no genotype-phenotype association is true. The cc method estimates σU without the restriction of the null hypothesis being true, whereas the c method is similar but only uses the sample of controls. Zheng and Gastwirth find, and we confirmed in our own attempts to implement the method, that the c method increases the type I error in some cases (results not shown). Thus, we choose to base our results only on the cc method.
To confirm that the empirical distribution of the LTTd (Derived later, see Results: Finding the LTTd statistic) follows the theoretical asymptotic distribution ( ) for practical sample sizes we conducted a simulation study (see Table 2 for parameters and values used).
We examined all possible combinations of parameter values and so a total of 17,496 settings were evaluated. The simulation study was conducted as follows:
Step 1. For given values of δV, ζB,ρ, ϕ,γ, and the disease MOI the true genotype probabilities (pi and qi, i=1,2,3) were computed. See Ahn et al. (2007) for details.
Step 2. The true genotype probabilities (pi and qi, i=1,2,3) were then adjusted to reflect the genotype error rate (), yielding , , and . See Tintle et al. (2007) for details.
Step 3. For given values of k, r, and n, and the observed single and duplicate genotyping probabilities ( , , and ) found in step 2, entries into Tables 1a and 1b1b were randomly simulated. For each combination of parameter values in Table 2, 2,000 random tables were simulated. In cases where γ =1 (null hypothesis is true), the type I error rate was analyzed by comparing the nominal significant level α (we examined 0.05, 0.005, and 0.0002) with the empirical α level. In cases where γ =1.25 or γ =2.00 (i.e. the alternative hypothesis is true), the empirical power was compared to the theoretical power (see equation (A3) in the Appendix).
We completed a computational study comparing theoretical power values for different values of r (duplicate genotyping percentage), c (relative genotyping costs) and other parameters. Table 3 shows the settings used for this study. We examined all 10,368 possible combinations of parameter values based on Table 3.
The computational study was carried out as follows:
Zheng and Gastwirth (2006) present a test statistic for the LTT as , where and V is an estimate of the variance of U. Tintle et al. (2007) showed that by using the allocation strategy (Equation (1))Tables 1a and 1c1c estimate the same quantities. Thus, the numerator of the Zheng and Gastwirth Z statistic becomes:
According to the Central Limit Theorem, Table 1c has an approximately multivariate normal distribution (see also Tintle et al. 2007). The expected value of Ud under the null hypothesis (p*i=q*i for all i) is zero (see Equation (A1) in the Appendix). Thus,
has a standard normal distribution and, therefore, (LTTd)2 has a distribution. Following the results of Zheng and Gastwirth (2006) the expression for Var(Ud) follows from (A2; Appendix), using and . Equation (A2) also accounts for additional covariance between cells in Table 1c from using the allocation strategy.
As described earlier (Methods: Simulation study), a simulation study was conducted to ensure that nominal type I and type II error rates obtained using the asymptotic theory of the LTTd were maintained in practice. First we consider the distribution of LTTd if the null hypothesis is true (γ = 1) and then the distribution of LTTd if the alternative hypothesis is true (γ ≠ 1).
For each combination of parameter values, a 99% confidence interval was found for the empirical α. For both the dominant and additive models, nominal type I error rates were maintained empirically regardless of sample size since an expected number of simulation settings had a 99% confidence interval on the empirical α that did not contain the nominal α (1.2% and 1.2% for dominant and additive, respectively, for α=0.05, 1.2% and 1.3% for the α=0.005 level and 1.1% and 0.7% for the α=0.0002 level). Nominal type I error rates were maintained empirically for the recessive model as long as the minimum cell count in Table 1c was at least 5 (detailed results not shown).
The LTTd statistic generally gives comparable theoretical and empirical power values across all simulation settings for the additive and dominant models as long as expected cell counts in Table 1c are at least 5. For each combination of parameter values, a 99% confidence interval was placed on the empirical power. For both the dominant and additive models when the minimum cell count in Table 3 was at least 5, an expected number of simulation settings had a 99% confidence interval on the empirical power that did not contain the theoretical power (0.9% and 1.2% for dominant and additive, respectively, for α=0.05, 1.3% and 0.9% for α=0.005 level and 1.7% and 1.3% for the α=0.0002 level). When the minimum cell count was less than 5 in the dominant and additive models, the empirical power was often still very close to theoretical power (results not shown). The recessive model with small δV showed significant differences between theoretical and empirical power (detailed results not shown), though theoretical power and empirical power were similar for larger values of δV .
Based on the simulation study, differences in theoretical and empirical type I and type II errors are possible when the recessive disease model is used, in cases where at least one cell count in the grouped table is less than 5, or in cases where the total sample is less than 1,000 individuals. In these cases we recommend estimating p-values for the LTTd by permuting phenotype status instead of using the asymptotic theory provided above. A permutation based p-value is available in our software (see Results: Software).
In Tintle et al. (2007), duplicate genotype data from a case-control study on bipolar disorder was presented for a SNP with inconsistently genotyped individuals where all individuals were duplicate genotyped. We present this data here (Table 4) in the form of Table 1b using a linear trend test for analysis to demonstrate the utility of the methods just developed.
Tintle et al. (2007) report a p-value of 0.061 from the test ignoring inconsistently identified individuals and 0.064 from the test including inconsistencies. Using our software and assuming an additive mode of inheritance, the linear trend test just presented yields a p-value is 0.0230 ignoring inconsistents, 0.0241 including inconsistents using the method shown above and 0.0245 using a permutation test with 2000 permutations.
Initially, we consider an instance of including previously collected quality control data in the test of association. In every case examined, the power of the LTTd is higher when the duplicate genotype data is included as compared to when it is not. In other words, it is better to include the duplicate genotype data in subsequent tests of association then to ignore inconsistencies and treat the data as missing. This result is consistent with the results of Tintle et al. (2007) for the test of association.
The most important case, however, is when c>0. That is, when we view the collection of duplicate genotype data as an a priori design decision, and thus must account for the cost of collecting the duplicates for a fraction, r, of the sample.
Given a fixed budget, in 49.2% of cases examined (see Table 3) where c>0, duplicate genotyping the entire sample (r=1) was found to be the most cost-effective design strategy (yields the highest power). In all remaining cases, r=0 provided the highest power. Thus, the optimal strategy is always “all or nothing.”
In order to characterize situations where duplicate genotyping will be cost-effective, logistic regression models were used with all parameters predicting whether or not duplicate genotyping the entire sample was the most cost-effective design. Three parameters (δV, c and ) had the strongest relationship with cost-effectiveness. Relative cost, c, had the strongest relationship (Wald χ2=1424.2, p<0.0001), genotyping error rate also had a very strong relationship (Wald χ2=1259.2, p<0.0001) and minor allele frequency (δV) was also strongly related (Wald χ2=465.6, p<0.0001). As minor allele frequency (δV) declined, costs (c) declined, or genotyping error rate () increased, duplicate genotyping the entire sample was more likely to be the optimal design decision.
Table 5 shows the percentage of cases examined in the computational study where duplicate genotyping the entire sample is the most effective design decision for different values of c (relative genotyping costs) and (genotyping error rate).
Table 5 demonstrates a general rule of thumb: duplicate genotyping the entire sample will always be cost-effective (regardless of δV ) if c ≤ . Table 5 also demonstrates that duplicate genotyping is sometimes cost-effective when c>. While details are not shown in the table, when δV is small, duplicate genotyping can be cost-effective even when c>.
Table 6 provides power values for a specific example. Specifically, we present power under different values of δV, , r and c for a disease with a prevalence of 2.5%, disease allele frequency of 5%, equal number of cases and controls (k=1), and a SNP marker and disease allele that are in perfect linkage disequilibrium (ρ =1).
The sample size needed to yield 80% power was calculated assuming there was no genotyping error. Column III shows the power for that sample size, after taking the genotyping error into account. When the marker frequency is low and/or the genotyping error rate is larger Column III demonstrates that power can be significantly impacted by genotyping errors. Columns IV–VII then reduce the sample size to maintain the budget reflecting the additional cost of collecting duplicate genotype data on all samples at different genotype costs (c). As c decreases, increases and δV decreases, Columns IV–VIII demonstrate that uplicate genotyping becomes more cost-effective. We note that in all cases duplicate genotyping does not meet or exceed the error-free power of 80% (detailed results not shown) however duplicate genotyping can successfully mediate some of the power loss due to genotyping errors.
Note that the power values in Table 6 are from a specific example. Please use our software (Results: Software) to investigate power at values specific to your research situation while keeping in mind the rule of thumb presented in Table 5.
In practice, duplicate genotyping should be considered when relative genotype to phenotype/sample acquisition costs do not exceed the expected SNP genotyping error rate. A more detailed treatment of practical considerations when using duplicate genotyping is provided in Tintle et al. (to appear). We summarize three main considerations here.
First, calculations provided in this manuscript consider only a single SNP. However, in practice, the decision to duplicate genotype will need to be made for an entire set of SNPs (e.g. all of the SNPs on a chip). In these cases using the same rule of thumb (duplicate genotype if c ≤ ) is appropriate where the used is the minimum error rate expected for any single SNP.
Second, if there is concern that some samples may be of low quality and, thus, have higher genotyping error rates than other samples (a violation of genotyping error assumption #2), the error rate, , used in the c ≤ rule of thumb should be the minimum expected for the high quality samples. Note, however, that we are still assuming non-differential errors in this case. Differential errors may increase the type I error rate, and are not considered in this manuscript.
Third, GWA studies are typically conducted in two-stages where all markers are genotyped on a sample of individuals, and then a subset of the markers is genotyped on a sample of additional individuals. When considering the use of duplicate genotyping in two-stage studies, the decision on the use of duplicate genotyping should be made separately at each stage since the relative cost of genotyping to phenotyping will be different at each stage.
To facilitate the utilization of the methods discussed in this paper, we provide two companion pieces of software for this work. The first computes the LTTd statistic and provides an asymptotic and permutation p-value. The second provides power computations for different genotyping costs, allele frequencies and error rates to assist in the duplicate genotyping design decision. Software is available at http://math.hope.edu/tintle/duplicate.html (source code written in R).
This work demonstrates how duplicate genotype data can be included in a linear trend test (LTT) of genetic association. Duplicate genotype data are included in the LTT through a weighting strategy and a subsequent adjustment of the variance of the LTT statistic yielding the LTTd. We demonstrate via simulation that the asymptotic null and alternative distributions of the LTTd statistic are obtained with reasonably small sample sizes in most cases. Both asymptotic and permutation test p-values are available in the free companion software.
We demonstrate that in the case of no duplicate genotyping costs (e.g. the data has already been collected) including the duplicate data in the LTTd always increases statistical power. This confirms a similar result in Tintle et al. (2007).
We also consider the cost-effectiveness of designing a study to collect duplicate genotype data, and find that when the relative cost of genotyping to phenotype/sample acquisition costs (c) is less than or equal to the genotyping error rate (), collecting duplicate genotype data on the entire sample is costeffective. Further, we find that the optimal amount of duplicate genotyping, in these cases, will always involve duplicate genotyping the entire sample. In a twostage GWA study for a complex disease, if a relatively small set of SNPs are being followed up at stage 2 and it will be costly to enroll more subjects, duplicate genotyping may be cost-effective since relative genotyping to phenotyping/acquisition costs c will be low.
Since the rule-of-thumb just described is conservative it is important to note that duplicate genotyping will be cost-effective in many situations when c>. This rule was provided to allow researchers to quickly assess the costeffectiveness of duplicate genotyping on a large scale. It is quite likely that, even if c>, duplicate genotyping may provide moderate power gains for SNPs with low minor SNP allele frequency. Our software should be used to determine cost-effectiveness of duplicate genotyping for specific experimental conditions.
We assume that genotyping errors are independent from the first to second genotyping (genotyping error assumption #3) and that genotyping error rates are non-differential (genotyping error assumption #2). Future work is needed to extend results to consider differential genotyping errors when duplicate genotyping. Further reading on sources of genotyping error and their impact on analyses can be found in Bonin et al. (2004) and Gordon and Finch (2005). We also assume that duplicate genotyping is applied to a random subsample of size nr. Further work is necessary to explore optimizing the value of r depending upon phenotype or initial genotype classification.
When collected, duplicate genotype data should always be included in the subsequent test of association and in many realistic cases duplicate genotype data should be collected on the entire sample.
Bryce Borchers, Marshall Brown and Brian McLellan contributed equally to this work. This project was funded in part by a grant from the National Institutes of Health, R15-HG004543. The content is solely the responsibility of the authors and does not necessarily represent the official view of the National Human Genome Research Institute or the National Institutes of Health. Additionally, this project received support from the Tanis Fund for Statistics Research.
Below we show how to find Var(t′1) and Cov (t′1, t′2), other terms can be found similarly.
Where we make the following substitutions in E[t′1, t′2] as appropriate:
Following the results of Zheng and Gastwirth (Zheng & Gastwirth 2006), asymptotic power for the LTTd can be computed using the following formula
Where and μd=E(Ud).