Genome-wide association studies (GWASs) have been widely used to identify common variants that contribute to variation in complex human phenotypes and diseases. Pedigree integrity is crucial to the performance of family-based GWA, as well as in population-based data with unknown family structure. High-throughput genotyping performed in a GWAS presents new opportunities for pedigree error detection using millions of SNPs to assess the degree of relationship between a pair of individuals. With these opportunities come the challenges of accounting for linkage disequilibrium among typed markers, while managing computational resources to analyze the large amount of genotype data. Compared to linkage studies, association studies also require consideration of population substructure, misreported race and ethnicity and unreported familial relationships among samples recruited as unrelated individuals.
One well-developed approach for relationship inference in linkage studies offers fully parametric methods for sib pairs (Boehnke and Cox, 1997
) and extensions to general pedigrees (McPeek and Sun, 2000
) using hidden Markov models (HMM) to calculate multipoint marker probabilities, incorporated into a likelihood framework to assess evidence in support of particular pair-wise relationships. In considering full multipoint marker probabilities, computational demands increase with the number of markers genotyped, making analysis of GWAS SNPs for all pairs of individuals prohibitive. A simple method, known as GRR (Graphical Representation of Relationship errors; Abecasis et al.
), uses clustering of readily available non-parametric estimates for mean and standard deviation (SD) of identical by state (IBS) statistics at a series of markers for each pair of relatives. GRR identifies outliers of clusters as relationship errors. Performance of the clustering algorithm used to classify relative pairs depends on the panel of genetic markers, the underlying allele frequencies of genetic markers for different individuals, and the number of individuals genotyped. If certain pairs of individuals do not cluster—either due to limitations in sample size or due to the different underlying allele frequencies between different pairs (e.g. in the presence of population structure)—GRR fails to detect the pedigree errors. One efficient implementation of relationship inference in GWAS data is available in a widely used software package, PLINK (Purcell et al.
). The identical-by-descent (IBD) statistics between each pair of individuals are estimated using the average of IBS and the estimation of sample-level allele frequencies at each SNP according to Hardy–Weinberg Equilibrium (HWE) assumptions.
All popular algorithms for relationship inference depend on reliable estimates of allele frequencies at each SNP, assuming a homogeneous population without stratification (Abecasis et al.
; Boehnke and Cox, 1997
; Lynch and Ritland, 1999
; McPeek and Sun, 2000
; Purcell et al.
). Recent GWAS analytic advances for association mapping have incorporated the presence of unknown family and population structure (Choi et al.
; Kang et al.
; Thornton and McPeek, 2010
; Yang et al.
; Zhang et al.
); however, algorithms to estimate family relationships remain based on the assumption of population homogeneity. In samples with undetected population substructure, this strong assumption of population homogeneity leads to biased results, systematically inflating the degree of relatedness among individuals of the same racial group.
Current approaches to relationship and population structure inference are somewhat circular. The relationship inference relies on correct specification of a homogeneous subpopulation (Purcell et al.
), while the detection of population structure relies on the correct identification of unrelated individuals (Zhu et al.
). In addition to the non-robustness to the population structure, existing approaches do not apply to small datasets, e.g. for comparison of a single pair of individuals, or relationship inference on a single pedigree.
We present a novel framework for relationship inference, Kinship-based INference for Genome-wide association studies (KING), together with a rapid algorithm for relationship inference appropriate for use on samples with thousands of individuals genotyped at millions of SNPs from autosomes, consistent with a scale typically achieved in a GWAS. Within this framework we present two methods: (i) KING-homo, derived under the assumption of population homogeneity and (ii) KING-robust that provides robust relationship inference in the presence of population substructure. The estimated pedigree information provided by KING (such as kinship coefficients) can be used to verify relationships, reconstruct pedigrees and conduct genetic association tests without relying on self-reported pedigree information. Our computationally efficient and flexible approach allows automated pedigree error detection, and is amenable to datasets involving a very small number of individuals, as encountered in forensic DNA analysis.