There are two major causes for missing calls. One is due to poor quality of DNA samples, which often fails to be amplified and to generate strong enough intensity of fluorescence signals over the background. The other arises when an observation, i.e., a read out of fluorescence signals, cannot be assigned unequivocally to any of the clusters of genotype, therefore, is subject to 'no-call' procedure. In this report, we mainly focus on the missing calls due to the failure of being assigned to any clusters of genotype.
Nature of no-calls: results of sequencing
To evaluate the nature of no-calls in reality, four different widely-used high-throughput genotyping platforms were included in this study, and they are GenomeLab™ SNPstream Genotyping System (Beckman Coulter, Los Angeles), BeadLab SNP Genotyping System (Illumina, San Diego), TaqMan® SNP Genotyping Assays (ABI, Foster City) and GeneChip® Human Mapping 500 K Array Set (Affymetrix, Santa Clara). Eight SNPs were selected and subjected to regenotyping of equivocal observations (no-calls) through sequencing. The criteria for the selection of SNPs and samples for sequencing were presented in Methods.
The genotype distribution of the observed data at each locus which were produced by the respective genotyping technology was compared with that of no-calls which were obtained by sequencing (Table ). Statistically significant differences were observed in SNPstream, Illumina and GeneChip 500 K, indicating that the MCB indeed exists in widely-used genotyping technologies and it would lead to a biased estimation of allele/genotype frequencies. The genotype-specific call-rates ci(i = AA, Aa, aa) were calculated, most of which were above 0.95, but it could be as low as 0.75 for GeneChip 500 K.
The regenotyping results for the different genotyping platforms.
In the subsequent sections, in order to explore the effects introduced by MCB, we proposed a model to investigate the nature of no-calls. Typically an equivocal observation occurs as follows (Fig. ). For the data points that lie between the cluster of homozygotes of minor alleles (AA) and the cluster of heterozygotes (Aa), the real genotypes could be either homozygotes of minor alleles (Scenario I) or heterozygotes (Scenario II). For those that lie between the cluster of homozygotes of major alleles (aa) and the cluster of heterozygotes (Aa), the real genotypes could be either heterozygotes (Scenario III) or homozygotes of major alleles (Scenario IV). When the observations that cannot be called unequivocally are discarded, Scenario II is equivalent to Scenario III. To facilitate discussion, we assumed no-calls only happen in a specific genotype with the genotype-specific call-rate c (0 ≤ c ≤ 1).
Figure 1 The sketch of genotyping calling. The points shown in 'x' represent no-calls due to the failure of being assigned to any clusters of genotype unequivocally. When the data points lie between the cluster of homozygotes of minor alleles (AA) and that of (more ...)
Effect of MCB on association studies
The development of high-throughput genotyping technologies make association study widely conducted for identification of disease loci underlying complex traits. We now examine the effect of MCB on association using various disease models and statistical tests.
Power issue is of special importance in association studies [31
]. To investigate the effect of MCB on the power of association studies, MCB was introduced into disease models (see Methods). Here, the sample size is 500 for both case and control groups, and MCB are identical for both groups.
It has been commonly assumed that missing calls would lead to power loss due to decreased sample size, which only holds in absence of MCB. In the presence of MCB, the power can be affected by both the sample size and the biased estimation of allele and genotype frequencies (see Fig. , Additional file 1
). The change of power by MCB is usually larger than that by UBM. For example, the power loss by UBM is all less than 5% for the locus with MAF = 0.25 under various disease models (power ≈ 80% in genotypic χ2
test) when c
= 0.80, while the change can be around and even more than 10% when disturbed by MCB (Table ).
The power under different modes of inheritance relationship in the significant of 0.05.
For the χ2
test based on genotype frequencies, MCB always leads to power loss in all scenarios under different disease models compared with the null (in the absence of missing) (see Fig. and Additional file 2
). But for the χ2
test based on allele frequencies, it even can gain the power in some scenarios, because of the biased estimation of allele frequency (see Fig. and Additional file 1
). Genotypic χ2
test seems to be more robust to the changes of power in association studies than allelic χ2
test in the presence of MCB (see Fig. , Additional file 1
, and Table ).
The changes of power vary in different disease models. It can be summarized that for the disease model with dominant relationship (hAA
), the power change in Scenario IV (missing in aa
) is most; for the disease model with recessive relationship (hAA
), the power change in Scenario I (missing in AA
) is the largest; for the disease model with overdominant relationship (hAa
), the power change in Scenario II & III (missing in Aa
) is the most; and for the disease model with additive relationship (hAA
), the power change in Scenario I is similar to that in Scenario IV, especially when c
is small, while the decrease of sample size in quantity differs greatly (see Additional file 1
). The influence under the disease model with multiplicative relationship (hAA
) is similar to that with additive relationship (Fig. ).
In addition, though the minor allele of A in the current settings of disease models is susceptible to the disease (in overdominant disease model, Aa is susceptible to disease), the conclusions drawn above can also be extended when A is a protective one (data not shown). Moreover, in the disease model (hAA = 0.01, hAa = 0.01, and haa = 0.01), the type-I error rate in MCB remains to be 0.05, indicating that MCB does not inflate the false positive rate for the association studies, under the assumption that the extent of missing is identical in both case and control.
Tradeoff between MCB and genotyping errors
In the previous section, we showed that MCB is common in the current genotyping technologies, and it could affect the subsequent analyses seriously and lead to false conclusions. The key issue is how to deal with those equivocal observations which apparently are responsible for MCB. Two alternative options are available. The first option is to discard the observations of borderline quality using the 'no-call' procedure which may lead to MCB. The second option is to assign these observations to one of the genotypes at the cost of increasing genotyping errors. Here, we compare the overall outcome (allele frequency estimation and power of association studies) of these two options and try to offer guidelines for different scenarios to minimize the biases caused by the equivocal observations. In addition, we evaluate the overall call-rate and genotyping error rate in the two options respectively and intend to re-examine the current QC standards.
To facilitate the presentation, the genotype-specific call-rate c was set to 0.80, a moderate MCB as shown previously. For the second option, we assumed all of the equivocal observations are called and the proportion of accurate calls among these equivocal ones is denoted by conf (0 ≤ conf ≤ 1). The genotyping error rate increases with decreasing conf. For instance, if conf = 1, it means all of the equivocal observations are called accurately; whereas, if conf = 0, all of the equivocal ones are misclassified.
Both MCB and genotyping errors will possibly lead to inaccurate estimation of allele/genotype frequencies and in turn distorted association. When 'no-call' procedure is applied to those observations of borderline quality, we showed earlier that the biased estimation is dictated only by MAF and c
. When the equivocal observations are called, the bias in allele frequency estimation depends on conf
in addition to MAF and c
. In particular, the bias in allele frequency estimation reflected by the changes of MAF estimation increases with the decreasing conf
(see Additional file 3
). Fixed MAF and c
, the biased estimations caused by MCB and by genotyping errors are comparable. The bias introduced by MCB is certain, whereas the bias caused by genotyping errors changes with the conf
monotonically. Interestingly, when the conf
is large enough (indicated by the solid line in Fig. ), the biased estimation of MAF caused by genotyping errors are smaller than that caused by MCB. Therefore, it would be more beneficial to call the equivocal observations in this case (grey area above the line in Fig. ). However, when the conf
is below the solid line indicated in Fig. , the biased estimation of MAF caused by genotyping errors is more and 'no-call' procedure is recommended (area below the line in Fig. ). It should be noted that the bias in estimation of allele frequency in Scenario I by MCB is the greatest, more so than in genotyping errors, even with the highest error rate (conf
= 0). Therefore, it is suggested that 'no-call' principle should not be taken in Scenario I if the objective is to minimize the biased estimation of MAF (Fig. ).
Figure 3 Effects of MCB and genotyping errors on MAF estimation. A) illustrates the overall call-rate for the loci with different MAFs in the presence of MCB (c = 0.8). B) illustrates the threshold of conf by a solid line. If the equivocal observations can be (more ...)
In the following section, we explore the performance of association studies affected by MCB and by genotyping errors. Here, MCB (c
= 0.80) and genotyping errors (c
= 0.80, 0 ≤ conf
≤ 1), are assumed to be identical for case and control groups. MCB and genotyping errors were introduced to the disease models with various modes of inheritance relationship (Table ) respectively according to Methods. The power affected by MCB was discussed previously. The power affected by genotyping errors were shown in Additional file 3
. For χ2
test based on genotype frequencies, genotyping errors may have no effect on association studies sometimes, i.e
., in Scenario I and Scenario II for the dominant disease model (hAA
); otherwise, it may cause power loss. But for χ2
test based on allele frequencies, it may occasionally cause power gain because of the biased estimation of allele frequency. Though the power affected by genotyping errors is complicated in different scenarios and disease models, the power either does not change or changes monotonically with the conf
given the scenarios and disease models (see Additional file 3
). Therefore, similar to the previous study on allele frequency estimation, a threshold of conf
(indicated by the solid lines in Fig. , and Additional file 4B
) is expected as well. If the conf
is below the threshold, the power loss caused by genotyping errors is larger than that by MCB; therefore, 'no-call' procedure should be taken (area below the line in Fig. , and Additional file 4B
). Otherwise, it is better to call the observations of borderline quality at the cost of genotyping errors (grey area above the line in Fig. , and Additional file 4B
). For instance, for a locus with MAF = 0.35, when 'no-call' procedure is taken for the equivocal observations happened in Scenario I (c
= 0.80), the overall call-rate can still achieve at 97.6%. The power is 87.7% in MCB compared with 92.1% of the null in the multiplicative disease model for allelic χ2
test. However, if these equivocal observations are called even though they are completely misclassified (conf
= 0), the power with genotyping errors can be 88.1% at least. It indicates that in order to reduce the power loss caused by the equivocal observations, it would be more beneficial to call the equivocal observations with a genotyping error rate 2.5% than 'no-call' with an overall call-rate 97.5% (Fig. ).
Figure 4 Effects of MCB and genotyping errors on association studies under multiplicative disease model. A) illustrates the overall call-rate for the loci with different MAFs in the presence of MCB (c = 0.8). B) illustrates the threshold of conf by a solid line. (more ...)
As shown above, the commonly-used 'no-call' principal for the observations of borderline quality is not always the best choice. By weighing the influences on the performance of association study and allele frequency estimation, it is therefore preferable to force the calling of, even though they can be erroneous, these equivocal observations when they lie between the cluster of homozygotes of minor alleles and that of heterozygotes (i.e. in Scenario I & Scenario II). When the equivocal observations lie between the cluster of homozygotes of major alleles and the cluster of heterozygotes (i.e., in Scenario III & Scenario IV), the loss of power introduced by MCB is more pronounced than that by genotyping error when these equivocal observations can be accurately called; but when the calling accuracy cannot be granted (the conf
is small), the power loss is affected more by the genotyping error and it may be better to invoke 'no-call' procedure. In addition, with different disease models, different decisions for dealing with these equivocal observations may be made. A program called QC-Tradeoff
is available online to suggest whether 'no-call' procedure could be conducted http://humpopgenfudan.cn/en/resource/download.html
to minimize the biases caused by the equivocal observations.
In the above analyses, for the models of genotyping errors, we assumed all of equivocal observations were called to facilitate the discussion. Here, we extended a general model to explore the joint effects caused by MCB and genotyping errors. We assumed only (100 × α
)% of equivocal observations (0 ≤ α
≤ 1) were called with accuracy still denoted by conf
. The genotype frequencies in this joint model were illustrated in Additional file 7
. When α
= 0, this model is equivalent to the model of MCB; and when α
= 1, this model is equivalent to the model of genotyping errors. We introduced this joint model (c
= 0.8, conf
= 0.0, 0.25, 0.5, 0.75. 1.0, and 0 ≤ α
≤ 1) to the disease models denoted in Table . Fig. and Additional file 6
illustrated the power of association tests in the presence of MCB and genotyping errors, and the corresponding overall call-rate and genotyping error rate. An interesting finding is that the influences of the power caused by the equivocal observations always change monotonically with α
from 0 (the model of MCB) to 1(the model of genotyping error). It indicates in order to minimize the biases caused by the equivocal observations on association studies, the validate procedure is either no-call resulted in MCB or call all of the equivocal observations with genotyping errors, which we had discussed above.
Figure 5 Joint effects of MCB and genotyping errors on association studies under multiplicative disease model. A) illustrates the overall call-rate for the loci with different values of α in the joint models of MCB and genotyping errors (MAF = 0.25, c (more ...)
Given the knowledge of relationship of bias in allele/genotype frequency estimation and in association study with the magnitude of MCB and genotyping errors, it is therefore likely to develop a strategy to minimize the bias by choosing proper cut-offs for call-rate and genotyping error rate (see Discussion).