Effect of allele frequency parameters on HWD
Our simulation shows that the HWD measure θ only increases with respect to r under no experimental errors, supporting that duplication acts in the direction of increasing observed heterozygotes.
Probability that an HWE-violating SNP is in a CNV
Our results suggest that copy number variation can be a major contributor to HWD, even assuming the tendency towards small variant frequencies of CNV, especially at a low observed SNP minor allele frequency and large sample size. Segmental duplication is a major effect at a higher observed SNP minor allele frequency. About 1% genotyping errors did not make much difference to P(CNV|HWD). At a 5% or higher genotyping error, CNV or SD is less likely to be the cause of HWD.
Out results show that the probability of a SNP being in a duplicated region given HWD depends on the observed allele frequency. In case of a high observed minor allele frequency, HWD tends to be due to duplication, whereas in case of a small
, HWD is mainly due to SNP genotyping error and random variation. This is mainly because the effect of duplication can be buffered for low observed minor allele frequencies.
Hosking et al.
analyzed 36 HWE-violating SNPs and concluded that 58% of these cases were due to genotyping errors. This is an average that does not depend on observed minor allele frequency, but it is consistent with our result with a 5% genotyping-error model. The authors found 14% ‘non-specific’ cases where a primer/probe set can bind to multiple regions in the genome. These 14% may be included either in annotated segmental duplication or copy number variation. The other 28% showed no reason for HWD. Some of these cases may belong to a previously unannotated SD or CNV.
Prior knowledge of r
For the prior distribution of r, we incorporated estimates from previous studies about CNVs. Fredman et al.
estimated through an in silico
analysis that 3.7% of validated SNPs and 13.1% of nonvalidated SNPs were found in segmental duplicons. We interpret this as 7% on average, considering 65.2% of the SNPs used were valid in their analysis. It is similar to but slightly higher than the estimated proportion of SD in the genome. We simply used the reported genomic proportions of CNV or of SD as the prior probabilities of a SNP being in a CNV or an SD. Considering the previous reports
that SNPs are enriched in CNVs, using the genomic proportions as a prior probability is conservative in estimating the posterior probability of CNVs and SDs.
Our beta prior assumes about 50% of the CNVs have a minor allele frequency (MAF) more than 3.5% and about 13% and 1.5% have >10% and >20% MAFs, respectively, which are approximately consistent with Iafrate et al.'s estimate
. 12% of the CNVs identified by Iafrate et al. had >10% MAF and 3% had frequency of >20%. More conservative estimates have been reported as well. A recent study using about 1200 North American individuals estimated that more than 93% of CNV regions (CNVRs) have less than 1% MAF. Only 1% of the CNVRs analyzed had MAF >5%. The authors suggested that CNVs are not likely to affect SNP association studies seriously because of the low MAF. According to these recent estimates, a more realistic prior distribution of r would be even more skewed than the beta distribution that we have used. Another recent study by Wong et al.
detected 3,654 CNVs and 800 of them had at least 3% frequency, indicating a higher estimate for CNV minor allele frequencies.
SNP genotyping errors
Genotyping error rates for Sequenom (San Diego, California, USA), Illumina (San Diego, California, USA) and other new methods were reported as less than 1% (personal communication, Cantor). Sources and types of genotyping errors may vary and such heterogeneous effects were not considered in our model.
Cox and Kraft
showed that HWE tests have low power in detecting genotyping errors. This means that most of the genotyping errors do not cause departure from HWE. Our study indicates that once a SNP violates HWE, there is a good chance to have genotyping errors as well as segmental duplication or copy number variation, when the genotyping error is above 5%. These two results are not contradictory but provide different angles. As seen in the likelihood of HWD given no CNV or SD (), the sensitivity of detecting genotyping errors using HWD is very low. However, the relative contribution of genotyping error can become large when other factors are even less likely to cause HWD.
HWE violation and association studies
Hunter et al.
proposed to include HWE-deviated SNPs in case-control association studies because association tests do not assume HWE. Trikalinos et al.
, however, showed through a meta-analysis of 591 previous association studies that HWE-violating samples gave in significantly different results in the association testing. Taken together, we'd like to adopt a view that the association tests do not assume HWE, but can be affected by HWD, partly because these tests do assume that the SNPs are not in duplicated regions. Thus, it seems useful to know the effect of duplication on the HWE of a SNP.
Independence and HWE assumptions
Although at least some CNVs are generated in tandem
, the extent to which tandem and interspersed duplications contribute to the entire CNV space is unknown. As for segmental duplication, 45% and 47% are tandem and interchromosomal, respectively, in humans
, indicating the possibility of abundant interspersed CNVs. Our assumption of independence between duplicate sites may not hold if they are tandem and in linkage disequilibrium.
In addition, we assumed that an underlying CNV itself is under HWE. Sebat et al.
suggests that CNVs might be under negative selection. A recent survey on experimentally identified CNVs by Nguyen et al.
revealed that human CNVs are significantly enriched in telomeric and centromeric regions and protein coding genes, indicating nonneutral evolution of CNVs. However, the extent to which such selective pressures can affect the HWE of a CNV has yet to be studied.
Nguyen et al.
also revealed that CNVs are associated with high synonymous and nonsynonymous substitution rates, indicating that the assumption of a constant SNP rate on duplicated and nonduplicated regions may not hold. Other factors may also affect the priors for SNP allele frequencies, including nonuniform allele frequency distribution and gene conversion
Our model assumes duplication, genotyping error and random variation as the only sources of HWD. In reality, there are other sources of HWD. One of them is the noise in the actual population. Shoemaker et al.
noted that a population is not under a perfect Hardy-Weinberg equilibrium. In their analysis, the authors used inbreeding coefficient fA
<|0.03| as the limit of HWD in human population, as suggested by a National Research Council report (National Research Council 1996)
. The inbreeding coefficient is one of the proposed measures of HWD and fA
0 indicates HWE
. Our study assumes that the population is under the perfect HWE in each locus. Sampling of individuals in real experiments is not perfectly random and can be another source of bias.
Our model does not consider population admixture effect. Population admixture is an important confounding factor in case-control studies and it is known that the admixture effect causes deviation from HWE, as we mentioned in the background section of our manuscript. Nevertheless, with sample size <1000, population admixture can be detected by HWE testing only when f>0.4 and k>0.2, where f is the allele frequency difference between the mixed populations and k the proportion of the minor population
. A recent study indicates that most populations do not satisfy this criterion
. Thus, we assume that population admixture has minor effect on HWE in most populations. It would be helpful to incorporate admixture effect to our model, once we obtain sufficient knowledge about the degree of population-difference of CNVs. Our study focuses on the relative contribution of genotyping errors and duplication effect.
Our study shows that the degree of HWD increases with respect to r, the frequency of two-copy alleles. Duplication acts in the direction of increasing observed heterozygotes. The results of our Bayesian analysis suggest that copy number variation can be a major contributor to HWD, when sample size is large and genotyping error is small. The relative contribution of CNV and SD to HWD varies with observed SNP allele frequency.