Using FHS sample, we have demonstrated that imputation in general improves statistical power for GWAS, even when the imputed individuals have not been genotyped at all. Using GWAS top hits identified with observed genotypes as positive controls, we explored the effects of different quality control thresholds for incorporating genotypes of imputed individuals on the association results. In order to perform imputation for large pedigrees that are too complex to be handled with realistic computing power, we developed an algorithm for splitting and trimming large pedigrees into sub-pedigrees that can optimizes the number of closest related genotyped relatives for each un- or poorly- genotyped individual. Unlike imputation of un-genotyped variants using external reference, family-based imputation is to impute un- or poorly- genotyped individuals. As these individuals do not have any genotypes or good quality genotypes, one cannot compute the actual imputation quality. Therefore imputation certainty is used as imputation quality measure.
The fact that the Third Generation cohort has better imputation certainty over the previous two generations is a result of more Third Generation individuals having at least one genotyped parent, or more genotyped relatives. The proportion of imputed Third Generation subjects having at least one parent is 91.9%, versus 0.2% and 32.3% for imputed individuals in the Original and Offspring cohorts, respectively. The average sum of genotyped 1st
degree relatives per imputed individuals for the Original, Offspring and Third Generation cohorts among the 3121 imputed sample are 6.1, 5.3 and 9.2, respectively. In addition, the average sum of genotyped 1st
, and 3rd
degree relatives of the Original, Offspring and Third Generation cohorts among the 1187 individuals are 7.9, 7.1 and 9.8, respectively. This explains why the average imputation certainty is higher in 1187 individuals than in all 3121 individuals for each generation. The fact that the 1187 sample in split sub-pedigrees has higher mean imputation certainty than the rest imputed sample is due to imputed individuals in large pedigrees (thus need splitting) generally having more genotyped relatives, among whom, the most informative ones are retained in the split sub-pedigrees created using our algorithm. When regressing imputation certainty on the numbers of well-genotyped 1st
, and 3rd
degree relatives, the bit size and the number of members in sub-pedigree, the numbers of genotyped 1st
degree relatives and the number of members in sub-pedigree are positively associated with imputation certainty with p-values
, respectively. This indicates that most information is contributed by the 1st
degree genotyped relatives. As the proposed algorithm relies on the relationships within a large pedigree, results are sensitive to any pedigree misspecification in nature.
The imputation works well as we only have 23 and 368 SNPs with MAF difference (maximum 0.218) greater than 0.1 and 0.05, respectively, between well-genotyped and filtered imputed samples. The 23 SNPs have average 498 Mendelian errors, which suggests additional useful criterion for selecting SNPs for imputation. If the whole 3121 imputed sample is used, MAF will be in general underestimated. This reassures the necessity of using imputation certainty filter and the validation of our GWAS results using incorporated genotype data.
When incorporating imputed genotype data with observed genotype data, we consider various combinations of thresholds of person-specific certainty and SNP-specific certainty to filter out genotypes and individuals with lower imputation certainty. Although we have observed improved statistical significance for most combinations of thresholds for most of the top SNPs, there are still a few cases (rs4681 for fibrinogen and rs10186236 for HDL) with no improvement for any threshold combinations. indicates that failure to strengthen the statistical significance is not likely due to low imputation certainty, as the average certainties in incorporated imputed individuals for rs4681 and rs10186236 are 97.1% (top 3rd) and 96.2% (top 4th), respectively, and the improvement does not seem to be associated with high imputation certainty. The lack of improvement may be due to heterogeneity in phenotypes between imputed and genotyped individuals and/or noise in the imputed genotypes. Even though family-based imputation can increase the sample size that leads to power increase by theory, it also introduces noise due to the uncertainty in the imputed genotypes. There is no imputation certainty threshold combination that consistently gives better results than other combinations. The thresholds (person-specific certainty >0.5 and SNP-specific certainty >0.6) we have adopted for our sample seem working well. It may be applicable for other studies, but examination of the sensitivity is still warranted when applied to a different study. In addition, as shown in , imputation certainty decreases as MAF increases. One can thus take MAF into consideration when applying SNP-specific imputation certainty threshold during quality filtering.
With both imputed sample (person-specific certainty >0.5 and SNP-specific certainty >0.6) and genotyped sample included in GWAS of Alzheimer disease, fibrinogen, HDL and uric acid, the statistical significance is strengthened for 7 out of 9 independent genome-wide significant loci, or for 98 out of 146 genome-wide significant SNPs, while the inflation measured by genomic control factor (λ
) remains similar compared with GWAS using genotyped individuals only. Among the 7 loci, APOC1
for Alzheimer disease has the smallest number in sample size increase (), but its proportion of reduction in standard error of beta is the largest and so is its increase in statistical significance (). In general one would expect the proportion of reduction in the standard error to be similar to the proportion of increase in the square root of sample size 
. The disproportional change in this case is due to that Alzheimer disease is more common in the added imputed sample. There are 164 cases (5.1%) in 3192 genotyped individuals and 30 cases (10.4%) in 288 added imputed individuals. Except for the association between DPP10
locus and HDL, all other associations have been previously reported or confirmed by meta-analysis 
. Figures S5, S6, S7, S8, S9, S10, S11, S12, S13
are the regional association plots by SNAP 
for the 9 loci based on GWAS using incorporated genotype data. rs4420638 is 340 bp and 10297 bp away from APOC1
and the well-known APOE
genes, respectively. In FHS 550K data, no SNP is genotyped in APOE
and no SNP is in high linkage disequilibrium with rs4420638. Therefore, as shown in Figures S5
for incident Alzheimer disease, rs4420638 is the only genome-wide significant SNP in +/−200 kb region around itself. In addition, rs4420638 is strongly associated with APOE
with p-value 3.3×10−8
as previously reported 
in FHS. The association between rs1800588 (LIPC
) and HDL becomes genome-wide significant after including the imputed sample. Similarly, the association between rs13148356 in SLC2A9
and uric acid becomes genome-wide significant after imputation indicates that rs13148356 is also a likely truly associated variant missed by analyzing genotyped individuals only. The associations of rs4681 (FGB
) with fibrinogen and rs10186236 (DPP10
) with HDL are slightly weakened, but the latter association has not previously been reported.
Our results demonstrate that the proposed algorithm for splitting and trimming large pedigrees for IBD-based imputation worked well and that including the imputed sample with genotyped sample in GWAS generally strengthened the association signals for loci with associations that have already been well established. We identified a plausible imputation quality filtering threshold based on FHS. Further examination may be required to generalize this threshold to other studies.