|Home | About | Journals | Submit | Contact Us | Français|
Cryptic relatedness was suggested to be an important source of confounding in population-based association studies (PBAS). The magnitude and manner of cryptic relatedness affecting the performance of PBAS of continuous traits remain to be investigated. We simulated a set of related samples through biased sampling and inbreeding, and evaluated the power and type I error rates of simple association tests (SAT) without correcting for cryptic relatedness. We also used extended likelihood ratio tests (ELRT) to conduct PBAS accounting for cryptic relatedness, and compared it with genomic control (GC). Cryptic relatedness decreased the power as well as increased the type I error rates of SAT in both biased sampling and inbreeding models. The impact of cryptic relatedness on the performance of SAT appeared to be limited in the biased sampling model. However, cryptic relatedness in inbred populations may result in excessive false positive results of SAT. Compared with SAT and GC, ELRT obtained improved power and type I error rates under various scenarios. Ignoring cryptic relatedness may increase spurious association results in PBAS. Our ELRT provides a novel approach to control cryptic relatedness in PBAS of human continuous traits.
Population-based association studies (PBAS) are a powerful strategy for susceptible gene mapping of human complex diseases [1 3 3]. With the rapid development of high-throughput genotyping technologies, genome-wide PBAS are widely used to identify causal alleles of human complex diseases, such as osteoporosis, diabetes and obesity [4 5 6 7]. Nonetheless, an outstanding issue complicating PBAS is population structure, which can cause spurious association results and limit the robustness and efficiencies of PBAS [8 9 10].
Population structure mainly refers to population stratification and cryptic relatedness. In contrast to population stratification, which has been extensively studied and was well addressed by Pritchard et al. [11, 12], Zhu et al. [13, 14], Zhang et al. , Chen et al. , Price et al.  and Devlin and Roeder , information about the impact of cryptic relatedness on PBAS is limited. Cryptic relatedness, which means some or all subjects of study samples are related, was suggested to be an important source of confounding in PBAS [18, 19]. Because PBAS assume individual independence of study samples, cryptic relatedness may make these statistical tests invalid and reduce the robustness and efficiencies of PBAS. Biased sampling and inbreeding are two major reasons of cryptic relatedness.
By now, few studies have been conducted to assess the impact of cryptic relatedness on case-control studies , but none for quantitative traits. The magnitude and way of cryptic relatedness affecting the performance of PBAS of human continuous traits remain to be investigated. There are several PBAS methods that can take individual relationships into account [18, 20 21 22]. However, most of these methods require known individual relationships, which are usually not certain or available in practice. When individual relationships are not known in advance, genomic control (GC) was suggested to control cryptic relatedness in PBAS [18, 19]. GC was originally developed to correct for population stratification and had been found to be conservative in stratified populations [23 24 25]. Information about the performance of GC correcting for cryptic relatedness in PBAS is limited.
Variance component models were first introduced to genetic studies in the 20th century [26, 27]. Fisher divided the total phenotypic variance of a quantitative trait into environment variance and genetic variance due to additive, dominance, and epistasis genetic effects . Through including the variance components of genes linked to particular loci, variance component models have been widely used for genetic linkage and association mapping of human complex diseases [28 29 30 31].
PLINK is a popular genome-wide PBAS software package, which can estimate individual genome-wide identity by descent (IBD) sharing coefficients using genotypic data in seeming unrelated individuals . The estimated genome-wide IBD sharing coefficients can be used to infer individual relationships, which can then be included into a variance component model to control the impact of cryptic relatedness on PBAS. By now, to the best of our knowledge, no work about the performance of PLINK in IBD estimation has been reported.
In this study, we simulated a set of related samples through biased sampling and inbreeding, and evaluated the power and type I error rates of simple association tests (SAT) without correcting for cryptic relatedness. Based on a variance component model, we also used extended likelihood ratio tests (ELRT) to conduct PBAS accounting for cryptic relatedness, and compared it with GC in both biased sampling and inbreeding models. Our study aims to assess how serious confounding from cryptic relatedness is in PBAS of continuous traits, and to develop an efficient PBAS approach to control cryptic relatedness.
PLINK is first applied to genotypic data to estimate genome-wide IBD sharing coefficients for each pair of individuals . The estimated IBD sharing coefficients can then be converted to kinship coefficients: Kij = 0.5Pij1 + Pij2, where Kij represents the kinship coefficient between individuals i and j; Pij1 and Pij2 are estimated by PLINK, and denote the general possibilities of sharing one and two IBD allele(s) between individuals i and j on a genome-wide scale, respectively. Based on the inferred kinship coefficients, classical likelihood ratio tests are extended to conduct PBAS accounting for individual relationships. Supposed genotypic and phenotypic data of n individuals were collected. The log-likelihood functions under null hypothesis (H0) and alternate hypothesis (H1) can be expressed as
where a is phenotypic effect of candidate locus; β is a n × 1 vector of fixed effect; σpoly and σe are two n × 1 vectors representing polygenic and environmental effects, respectively; y is a n × 1 vector of observed phenotypic values; Ω is a n × n kinship coefficient matrix with element Kij (i, j = 1, 2, 3,…, n); Z is a n × 1 vector of individual genotype at candidate locus; I is a n × n identity matrix. Log-likelihood ratio test statistic U can be written as
where L0 and L1 are the maximized log-likelihood values estimated under H0 and H1, respectively.
We considered two common cryptic relatedness models: biased sampling and inbreeding. Genotype data of 1,000 bi-allelic loci were simulated for each individual. Allele frequencies of the 1,000 loci were randomly generated from beta distribution in the first generation. Recombination rates were assigned 1.0 × 10−8 for all pairs of adjacent loci. Mutation rates were set to be 1.0 × 10−5 for each locus. All loci were assumed to be under Hardy-Weinberg equilibrium and randomly recombined and mutated during genotype simulations.
For the biased sampling, we simulated 400 nuclear families with two parents and four children in each family. Two parents were first simulated based on the randomly generated allele frequencies, and then randomly mated and recombined to generate four children in each family. We randomly selected one parent and two children from each of the families as related individuals in the total sample (400 individuals). The remaining unrelated individuals in the total sample were obtained through randomly selecting one individual from each of the remaining families. For the inbreeding model, our simulation procedure includes two stages. In stage 1, based on the randomly generated allele frequencies, a small unrelated population was first simulated as the founder population. The founder population was then randomly mated and recombined for some generations to generate a population a of size of 3,200, with non-overlapping and random pairing of parents in each generation. Four children were simulated in each family and population size was assumed to increase two times per generation in stage 1. In stage 2, the simulated population was forward-randomly mated and recombined for five generations to obtain an inbred population with non-overlapping and random pairing of parents in each generation. Population size was kept constant in stage 2. 400 subjects were finally randomly selected from the simulated inbred population (3,200 individuals) as study sample.
A bi-allelic quantitative trait locus (QTL) was assumed to be associated with an individual quantitative phenotype. The QTL was randomly selected from the simulated 1,000 loci with 0.18 ≤ minor allele frequency ≤ 0.22. An additive genetic model was implemented here for quantitative phenotype simulation. Let yj be the phenotypic value of individual j, the linear model is expressed as
where β is a fixed effect; zj is the genotype of individual j at QTL (zj = 0, 1 or 2); a is the additive genetic effect of QTL; and pj is the residual polygenic effect of individual j attributed to other potential susceptive loci. During the simulation, pj was randomly generated from a normal distribution with mean 0 and variance σpoly in the first generation. In the second or more generation (for the inbreeding model), pj equaled the average value of two parents’ p j, which ensured the phenotypic relatedness among family members due to polygenic effect. ej is the residual environmental effect of individual j, following zero-mean normal distribution with variance σe.
Proportions of sib pairs and polygenic variance in the biased sampling model and the founder population sizes in the inbreeding model were controlled to model various relatedness levels. The simulated QTL was assumed to explain 2% of phenotypic variation in both biased sampling and inbreeding models. Detailed parameter designs are presented in table table11.
Individual kinship coefficients of the study sample (400 individuals) were first inferred by PLINK. To assess the possible bias caused by kinship coefficient inference, we also recorded the real kinship coefficients for each pair of individuals in a simulation for the biased sampling model. The simulated genotypic and phenotypic data were simultaneously analyzed by SAT, GC and ELRT using PLINK inferred kinship coefficients (ELRTP) and real kinship coefficients (ELRTR, only for the biased sampling model), respectively. 1,000 simulations were conducted for each parameter setting. Power and type I error rates were calculated, respectively, as the proportions of positive results (P values ≤ 0.05) obtained at the simulated QTL with and without phenotypic effect in 1,000 simulations. All the simulations and ELRT analyses were implemented in R .
The mean inflation factors estimated by GC under various scenarios are presented in table table1.1. The performances of SAT, ELRTP, ELRTR and GC in the biased sampling and inbreeding models are detailed in the following:
Proportions of related subjects and polygenic variances were varied to investigate the potential effect of biased sampling on PBAS. As shown in table table2,2, with proportions of related subjects increasing from 0.0 to 0.3, we observed consistent decreasing trends in power (from 88.8 to 83.6%) as well as increasing type I error rates (from 5.0 to 7.0%) for SAT. Compared with GC, both ELRTP and ELRTR obtained higher power and lower type I error rates under the same proportions of related subjects that we investigated. The performance of ELRTR was slightly better than that of ELRTP. In addition, the performance of GC showed similar varying trends with SAT and obtained the lowest power (from 88.0 to 81.8%) under various studied proportions of related subjects.
Table Table33 provides an overview of comparison results with respect to polygenic variances. With polygenic variances increasing from 0.2 to 0.4, SAT presented consistently decreasing power (from 85.6 to 83.6%) as well as increasing type I error rates (from 4.8 to 7.0%). ELRTP and ELRTR had similar performances and performed better in power and type I error rates than GC under the same polygenic variances investigated.
Table Table44 summarized the association test results of the 4 methods in inbred populations. We observed a high type I error rate of 7.3% for SAT at the founder population size = 100. When founder population sizes increased to 200 or 400, SAT obtained normal type I error rates (≤ 5%). Compared with GC, ELRTP generally showed higher power (from 95.0 to 96.0%) and lower type I error rates (from 4.2 to 4.9%) within the range of founder population size we investigated.
To answer how important it is to consider cryptic relatedness in PBAS of human continuous traits, we simulated a set of related samples through biased sampling and inbreeding, and investigated the power and type I error rates of SAT. We found that biased sampling decreased the power as well as increased the type I error rates of SAT. However, the confounding from biased sampling was limited in our study. For instance, even if 30% of the samples were closely related sib pairs, the type I error rates of SAT just increased to 7.0%. The effects of biased sampling on the power of SAT also appeared to be limited in our study. To investigate the impact of biased sampling on the performance of SAT, we simulated extremely related samples, which are usually not available in practice. Our simulation results are consistent with Voight and Pritchard's study, which assessed the effect of cryptic relatedness on case-control studies through theoretical derivation . Based on our simulation results and on the aforementioned study , we suggest that the impact of biased sampling on PBAS might be limited and could generally be ignored in practice.
Due to inherent advantages, some inbred populations, such as founder and island populations, were recommended for PBAS [9, 34 35 36 36]. Some or all individuals from these inbred populations are usually related because of their common ancestries . Information about the potential impact of inbreeding on PBAS of continuous traits is limited. In our study, we observed a high type I error rate of 7.3% for SAT, when founder population size was 100. With founder population sizes increasing to 200 or 400, type I error rates of SAT decreased to normal levels (≤ 5%). Our simulation results suggest that cryptic relatedness in inbred populations might increase spurious results in PBAS of continuous traits. For PBAS conducted in small and closely related inbred populations, it may be better to carefully address cryptic relatedness.
Because cryptic relatedness may be a serious problem in some situations [18, 19], we extended classical likelihood ratio tests to conduct PBAS accounting for cryptic relatedness (ELRT), and compared it with GC under various scenarios. ELRT presented improved power and type I error rates compared to GC in both biased sampling and inbreeding models. It should be emphasized that ELRT uses genome-wide IBD sharing coefficients estimated by PLINK to infer individual kinship coefficients, and does not require known individual relationships . On the other hand, the performance of ELRT may be affected by the accuracy of genome-wide IBD sharing coefficients estimation. To assess the possible effect of genome-wide IBD sharing coefficients estimation on the performance of ELRT, in the biased sampling model, we compared the performance of ELRT using the kinship coefficients inferred by PLINK (ELRTP) and the real kinship coefficients obtained from simulations (ELRTR), respectively. The performance of ELRTP was close to that of ELRTR under various scenarios, which may demonstrate the good performance of PLINK in genome-wide IBD sharing coefficients estimation, and suggests no significant effect of kinship coefficients inference on the performance of ELRT in our study. Additionally, we observed that the computational cost of ELRT significantly increased with increasing sample sizes, due to the large kinship coefficient matrix used by ELRT. For example, execution of ELRT on a data set with 2,000 samples and 1,000 markers requires about 26 hours of computation time (Intel Xeon dual quad-core CPUs with 4 GB memories), which is usually acceptable for real studies.
It should be noted that using PLINK to identify related individuals and excluding them in following studies may also help to decrease the impact of cryptic relatedness on PBAS. However, it may be difficult to define a suitable excluding criterion in practice. A too strict excluding criterion may significantly decrease sample sizes and power of PBAS, while a too loose one may not eliminate the spurious associations caused by cryptic relatedness. GC is a popular PBAS method correcting for population stratification and cryptic relatedness . In our studies, GC generally showed moderate decreasing trends in power and moderate increasing trends in type I error rates with increasing relatedness levels in both biased sampling and inbreeding models. The performance of GC appeared to be slighted affected by relatedness levels.
In summary, our study results show that cryptic relatedness may decrease the power as well as increase the type I error rates of PBAS of continuous traits. The impact of cryptic relatedness caused by biased sampling on PBAS is limited. In contrast, cryptic relatedness in inbred populations may be serious and should be carefully addressed. Our ELRT provides a novel approach to control spurious results caused by cryptic relatedness in PBAS of human continuous traits.
Investigators of this work were partially supported by grants from NIH (R01 AR050496, R21 AG027110, R01 AG026564, P50 AR055081 and R21 AA015973). The study also benefited from grants from the National Science Foundation of China, the Huo Ying Dong Education Foundation, HuNan Province, Xi’an Jiaotong University, and the Ministry of Education of China.