The NARAC data consisted of 868 cases of rheumatoid arthritis (RA) and 1194 controls genotyped on the 550 k Illumina SNP chip. Four regions were selected on chromosome 1, each consisting of 30 consecutive SNPs, representing regions with disease association (PTPN22 [
9,
10] and PADI4 [
11,
12]) and without disease association, and with high or low LD. SNPs deviating from Hardy-Weinberg equilibrium (HWE) (
p < 0.001) or with call rates below 95% were removed before analysis.
Two scenarios were considered: 1) imputation of "untyped" markers and 2) imputation to combine two datasets.
Scenario 1
A set of genotyped SNPs were removed completely and subsequently imputed for all subjects. LD plots for the regions as well as a list of removed SNPs are provided by Fridley et al. in this volume [
13]. For null regions 1 and 2, seven and eight SNPs were removed, respectively. For the PTPN22 region, two datasets were created with four SNPs excluded in addition to either the most strongly associated SNP (rs2476601) or the two SNPs flanking rs2476601. A similar approach was taken for the PADI4 region, with rs6683201 or the two SNPs flanking rs6683201 removed in addition to five other SNPs.
Scenario 2
To represent the combined analysis of data from two studies, cases and controls were randomly assigned to two study populations, resulting in 434 cases and 597 controls per group. Genotypes at 10 randomly selected SNPs from each region were removed for all individuals in the first group. A second non-overlapping set of 10 random SNPs were deleted in the second group. Thus, in each region, 10 SNPs were genotyped in both cohorts, while 10 were genotyped only in cohort 1 and were imputed in cohort 2, and 10 were genotyped in cohort 2 and imputed in cohort 1.
Imputation was performed using IMPUTE v 0.4.1 [
2], MACH v 1.0.16 [
4], fastPHASE v 1.2.3 [
8], and PLINK v 0.99 [
7]. Haplotypes of the 60 HapMap CEU founders were used as the reference data to run IMPUTE, MACH, and PLINK for scenarios 1 and 2, and to run fastPHASE for scenario 1. For fastPHASE, under scenario 2, only the samples from the NARAC data were used. Programs were run with default options, except to ensure convergence of MACH, each dataset was run with 150 iterations ("--rounds 150"option). In addition the option "--dose" was used with MACH. For imputation of untyped SNPs (scenario 1), the IMPUTE options "-exclude_SNPs file-impute_excluded" were used, while for imputation under scenario 2 the "-pgs" option was used. Full details of the commands used may be obtained from the authors by request.
Our assessment of error rates focused on the proportion of incorrect genotypes obtained by imputing the most likely genotype for each missing value, regardless of the confidence in the imputation. Associations were assessed assuming log-additive allelic effects on RA risk. p-Values were calculated using the complete data and each set of imputed data. In addition, for scenario 2, association analyses using the "non-missing data" (genotypes available for only one group) were performed. Association tests based on imputed data used "allele dose" from MACH (the estimated number of minor alleles ranging from 0 to 2), the most likely genotypes imputed using fastPHASE and PLINK, and the posterior probabilities from IMPUTE. For IMPUTE, association tests were performed using the accompanying program SNPTEST, with the "-proper-frequentist 1" options.