The family-based association study design is popular, since it allows for control of population admixture/substructure by using the non-transmitted parental alleles as control alleles. For family-based studies, genotype data from unrelated individuals are usually unavailable to evaluate deviations from HWE. Therefore it is common practice to carry out genotype quality control by testing for deviations from HWE using the parental or unaffected sibling genotype data. When trio data are used in association studies, parents which are included in the analysis are not phenotyped and can be either unaffected or affected for the trait understudy. Even when parents of affected probands are truly unaffected they have a higher probability than the general population of being susceptibility loci carriers. For fixed trait prevalence this probability increases with increasing genotypic RR. For family-based studies unaffected siblings are especially useful when parental data are missing. For case-control studies unaffected siblings are not commonly used as controls due to the reduction in power compared to when unrelated controls are analyzed. An exception is in the study of dizygote twins, where the unaffected co-twin is employed as a control. The advantage of this design is that the cases and controls are matched on environmental factors, since co-twins share many environmental and intrauterine exposures.
For most current genome-wide association studies, a large sample and a small α value are used (i.e. ≤1 × 10−7) to have adequate power to detect associations and guard against false positive results due to multiple testing. However even for studies with thousands of study subjects for low genotypic RR (≤1.2) these studies are often underpowered for genome-wide significance levels. In this study we used a small α value i.e. 1 × 10−7 and a large sample size i.e. 5,000 trios. This sample size was selected for sufficient power to detect an association for a large variety of genotypic RRs and allele frequencies.
Although testing for deviations from HWE in genotype data from controls or unaffected family members is often used as quality control to detect markers with genotyping error, deviation from HWE can be also be caused by other factors. In this study it is demonstrated that family ascertainment can also cause deviations from HWE in the genotype data of parents and unaffected siblings at the disease/trait susceptibility locus. Two measures are calculated: the HWD coefficientand the power to reject the null hypothesis of HWE. It is shown that detection of deviation from HWE due to a true association is negligible for a sample of 5,000 trios at α level of 1 × 10−7. The power will vary depending on sample size and α levels for a specific HWD coefficient and allele frequency. The genotypic RR also plays an important role in the strength of deviation from HWE, with higher genotypic RRs causing larger deviations of the HWD coefficient from 0. For the parental genotype data for 5,000 trios, under a multiplicative model for an allele frequency of 0.45 and a genotypic RR of 1.5, the power is 8.3 × 10−6 for an α level of 1 × 10−7 and increases to 0.175 for an α level of 0.05. Likewise it can be seen that the sample size has an effect on power for the same example using an α level of 1 × 10−7; the power is 5.6 × 10−7 for 1,000 trios and increases to 5.15 × 10−5 for 10,000 trios.
The phenomenon of deviations from HWE at the functional locus does not only occur because of ascertainment through families. When individuals are excluded from the control group due to having the phenotype understudy, deviations from HWE are also observed in the control genotype data at the disease/trait susceptibility locus. In this situation the HWD coefficient is negative for all genetic models except the dominant model for which the HWD coefficient is positive. Although the largest deviation from HWD is observed for the multiplicative model the magnitude of deviation is only marginally greater than 0. For a fixed genotypic RR the HWD disequilibrium coefficient increases with increasing disease prevalence. If the controls are collected from the general population without any exclusion criteria and the laws of HWE are not violated, no deviation from HWE will be observed in the genotype data.
The HWD coefficient reflects the difference between observed homozygote frequency and the corresponding expected frequency under HWE. Negative values indicate an excess of heterozygous genotypes and a deficiency of homozygous genotypes while positive HWD coefficients indicate the opposite. Negative HWD coefficients are indicative of gentoyping error under a random error model and when homozygous genotypes are incorrectly called as heterozygous genotypes [
17]. Under all genetic models considered, HWD coefficients are negative for the parental genotype data (fig. ). For the additive and dominant model, HWD coefficients are also negative in the unaffected sibling genotype data (fig. ). Therefore, it is important not to make the assumption that negative HWD coefficients indicate genotyping errors when observed in parental and unaffected sibling genotype data.
The deviation from HWE caused by a true association can be further compounded by genotyping error and population substructure. The influence of genotyping error on the HWD coefficients is dependent on the error model. Genotyping error can create either an excess of homozygote or heterozygote genotypes depending on the underlying genotyping error model [
17]. Genotyping error usually does not have a large effect on HWD coefficients unless the genotyping error rate is high. Genotyping error at the disease/trait susceptibility locus in the parental and unaffected sibling genotype data can either attenuate or amplify the deviation of HWD coefficient from 0; in turn this will affect the power to detect a deviation from HWE. The absolute power will be dependent on genetic model, genotypic RR, type of genotyping error, frequency of genotyping error, allele frequency, sample size and α value. Population substructure always creates an excess of homozygotes when the subpopulations have different allele frequencies. When the allele frequency difference is large in the two populations the deviation from HWE can be dominated by population substructure and the HWD coefficients shift from negative to positive or become more positive.
All calculations are based upon pedigrees with one affected proband. If calculations were carried out for kindreds with multiple offspring, the genotype probabilities for the 9 parental mating types would be modified. With increasing number of affected offspring the probability would increase that the parents are susceptibility allele carriers, since the probability that affected offspring are phenocopies is reduced with increasing number of affected offspring. For the unaffected siblings calculations are carried out conditional on their parents having one affected offspring. Based upon the probability of each possible mating type, the probability for all three possible genotypes is then calculated conditional on the offspring being unaffected.
The similarity between probands and unaffected siblings is due to low penetrance of susceptibility loci since unaffected siblings and probands can share a large proportion of high risk genotypes. When the penetrances were raised to high values, the patterns of HWD in unaffected siblings showed dramatic differences from probands (data not shown) since the probability that the unaffected sibling is a susceptibility allele carrier is greatly diminished.
For both the deviation from HWE and the power of rejecting the null hypothesis of HWE the results are shown for population allele frequencies which range from 0.05 to 0.95. Although it is unlikely that a disease susceptibility locus will have high allele frequencies (e.g. ≥0.5) it is not unlikely to observe such high allele frequencies for variants which are involved in human variation.
Unless genotype data for probands, parents or unaffected siblings are genotyped in different batches it is expected that the type of genotyping error and error rates should be consistent. Therefore, potentially different patterns of deviations from HWE in proband data compared to patterns observed in parental or unaffected sibling genotype data could be an indication that the deviation is due to an association and not genotyping errors. It can be observed that for the recessive and multiplicative model the pattern of deviation from HWE is different in the parental genotype data compared to proband genotype data. For the proband genotype data there is no deviation from HWE for the multiplicative model, and the deviation from HWE is positive for the recessive model, while for the parental genotype data the deviation from HWE is negative for both the multiplicative and recessive model. However, even though for the parental genotype data the HWD coefficients are negative the divergence from 0 is not large, especially under the recessive model. For unaffected sibling genotype data the strength of deviation from HWE is less than for the proband genotype data for the same genetic model and there is no difference in the direction, with the exception of the multiplicative model where D = 0 for the proband genotype data. In most circumstances differences in the deviation in HWE in the genotype data between the proband and either parental or unaffected sibling genotype data are difficult to distinguish from random variability.
In family-based studies, erroneous genotypes could bias linkage or association study. Mendelian inconsistency is usually used to detect errors in family-based studies. Errors which include wrong pedigree structure and sample mix-ups will usually cause a large portion of markers to display Mendelian inconsistency and therefore are easily detected. However, genotyping errors which often dependent on genotyping methods are more difficult to detect, since genotyping errors are often compatible with Mendelian inheritance. Undetected genotype errors can increase type I and II errors. Detection of genotyping error via deviation from HWE is often carried out in unrelated controls from case-control association study [
16,
17]. However deviation from HWE is not necessarily caused by genotyping errors and may be due to chance, population admixture/stratification, inbreeding, selection or copy number variants [
18,
19,
20,
21,
22]. In this article it is demonstrated that in genotype data obtained from parents and unaffected siblings of probands the deviation from HWE at the trait locus could be due to probands’ ascertainment. However the deviations are not large and at a genome-wide association study α value the power of detecting deviations from HWE at the functional locus is low. Deviations from HWE in either parental or unaffected sibling genotyping data can be used to flag markers for potential genotyping error. For these markers, cluster quality score should be examined for potential problems. Information on duplicate samples and Mendelian inconsistencies may give further evidence of genotyping error. Genotypes can also be confirmed by obtaining genotyping results from another platform. Additionally, markers with high rates of missing genotype data (e.g. >0.05) may also be indicative of problems with genotyping error.