Identify file formats ● TIMING ~5 min
1| For SNP data available in standard PED and MAP file formats, as in our CG study, follow option A. For SNP data available in standard binary file format, as in our GWA study, follow option B. The instructions provided here are for unpacking the sample data provided as gzipped files at
http://www.well.ox.ac.uk/ggeu/NPanalysis/. If using the .zip files provided as
supplementary Data 1 or
supplementary Data 2, please proceed directly to step 2.
▲
CRITICAL STEP The format in which genotype data are returned to investigators varies according to genome-wide SNP platforms and genotyping centers. We assume that genotypes have been called by the genotyping center, undergone appropriate quality control filters as described in a previous protocol
3 and returned as clean data in a standard file format.
- Standard PED and MAP file format
- Download the file ‘cg-data.tgz’.
- Type ‘tar -xvzf cg-data.tgz’ at the shell prompt to unpack the gzipped .tar file and create files ‘cg.ped’ and ‘cg.map’.
▲
CRITICAL STEP The simulated data used here have passed standard quality control filters: all individuals have a missing data rate of < 20%, and SNPs with a missing rate of > 5%, a MAF < 1% or an HWE
P value < 1 × 10
− 4 have already been excluded. These filters were selected in accordance with procedures described elsewhere
3 to minimize the influence of genotype-calling artifacts in a CG study.
- Standard binary file format
- Download the file ‘gwa-data.tgz’.
- Type ‘tar -xvzf gwa-data.tgz’ at the shell prompt to unpack the gzipped .tar file and obtain the standard binary files ‘gwa.bed’, ‘gwa.bim’ and ‘gwa.fam’ and the covariate file ‘gwa-covar.’
▲
CRITICAL STEP Optimized binary BED files contain the genotype information and the corresponding BIM/FAM files contain the map and pedigree information. The binary BED file is a compressed file that allows faster processing in PLINK and takes less storage space, thus facilitating the analysis of large-scale data sets
32.
▲
CRITICAL STEP The simulated data used here have passed standard quality control: all individuals have a missing data rate of < 10%. SNPs with a missing rate > 10%, a MAF < 1% or an HWE
P value < 1 × 10
− 5 have already been excluded. These filters were selected in accordance with procedures described elsewhere
3 to minimize the influence of genotype-calling artifacts in a GWA study.
? TROUBLESHOOTING
Basic descriptive summary ● TIMING ~5 min
2| To obtain a summary of MAFs in case and control populations and an estimate of the OR for association between the minor allele (based on the whole sample) and disease in the CG study, type ‘plink --file cg --assoc --out data’. In any of the PLINK commands in this protocol, replace the ‘--file cg’ option with the ‘--bfile gwa’ option to use the binary file format of the GWA data rather than the PED and MAP file format of the CG data.
▲ CRITICAL STEP PLINK always creates a log file called ‘data.log’, which includes details of the implemented commands, the number of cases and controls in the input files, any excluded data and the genotyping rate in the remaining data. This file is very useful for checking the software is successfully completing commands.
▲ CRITICAL STEP The options in a PLINK command can be specified in any order.
? TROUBLESHOOTING
3| Open the output file ‘data.assoc’. It has one row per SNP containing the chromosome [CHR], the SNP identifier [SNP], the base-pair location [BP], the minor allele [A1], the frequency of the minor allele in the cases [F_A] and controls [F_U], the major allele [A2] and statistical data for an allelic association test including the χ2-test statistic [CHISQ], the asymptotic P value [P] and the estimated OR for association between the minor allele and disease [OR].
? TROUBLESHOOTING
Single SNP tests of association ● TIMING ~5 min
4| When there are no covariates to consider, carry out simple χ2 tests of association by following option A. For inclusion of multiple covariates and covariate interactions, follow option B.
- Simple χ2
tests of association
- Create a file containing output from single SNP χ2 tests of association in the CG data by typing ‘plink --file cg --model --out data’. The command for the GWA data is ‘plink --bfile gwa --model --out data’.
▲ CRITICAL STEP Genotypic, dominant and recessive tests will not be conducted if any one of the cells in the table of case control by genotype counts contains less than five observations. This is because the χ2 approximation may not be reliable when cell counts are small. For SNPs with MAFs < 5%, a sample of more than 2,000 cases and controls would be required to meet this threshold and more than 50,000 would be required for SNPs with MAF < 1%. To change the threshold, use the ‘--cell’ option. For example, we could lower the threshold to 3 and repeat the χ2 tests of association by typing ‘plink --file cg --model --cell 3 --out data’.
- Open the output file ‘data.model’. It contains five rows per SNP, one for each of the association tests described in . Each row contains the chromosome [CHR], the SNP identifier [SNP], the minor allele [A1], the major allele [A2], the test performed [TEST: GENO (genotypic association); TREND (Cochran-Armitage trend); ALLELIC (allelic association); DOM (dominant model); and REC (recessive model)], the cell frequency counts for cases [AFF] and controls [UNAFF], the χ2 test statistic [CHISQ], the degrees of freedom for the test [DF] and the asymptotic P value [P].
- Test of association using logistic regression
- Create a file containing output of association tests based on logistic regression assuming a multiplicative model and including covariates in the GWA data by typing ‘plink --bfile gwa --logistic --covar gwa.covar --out data’.
▲
CRITICAL STEP To specify a genotypic, dominant or recessive model in place of a multiplicative model, include the model option --genotypic, --dominant or --recessive, respectively. To include sex as a covariate, include the option --sex. To specify interactions between covariates, and between SNPs and covariates, include the option --interaction. Open the output file ‘data.assoc.logistic’. If no model option is specified, the first row for each SNP corresponds to results for a multiplicative test of association. If the ‘--genotypic’ option has been selected, the first row will correspond to a test for additivity and the subsequent row to a separate test for deviation from additivity. If the ‘--dominant’ or ‘--recessive’ model options have been selected, then the first row will correspond to tests for a dominant or recessive model of association, respectively. If covariates have been included, each of these
P values is adjusted for the effect of the covariates. The
C ≥ 0 subsequent rows for each SNP correspond to separate tests of significance for each of the
C covariates included in the regression model. Finally, if the ‘--genotypic’ model option has been selected, there is a final row per SNP corresponding to a 2 d.f. LR test of whether both the additive and the deviation from additivity components of the regression model are significant. Each row contains the chromosome [CHR], the SNP identifier [SNP], the base-pair location [BP], the minor allele [A1], the test performed [TEST: ADD (multiplicative model or genotypic model testing additivity), GENO_2DF (genotypic model), DOMDEV (genotypic model testing deviation from additivity), DOM (dominant model) or REC (recessive model)], the number of missing individuals included [NMISS], the OR, the coefficient
z-statistic [STAT] and the asymptotic
P value [
P].▲
CRITICAL STEP ORs for main effects cannot be interpreted directly when interactions are included in the model; their interpretation depends on the exact combination of variables included in the model. Refer to a standard text on logistic regression for more details
36.
? TROUBLESHOOTING
Data visualization ● TIMING ~5 min
5| To create quantile-quantile plots to compare the observed association test statistics with their expected values under the null hypothesis of no association and so assess the number, magnitude and quality of true associations, follow option A. Note that quantile-quantile plots are only suitable for GWA studies comprising hundreds of thousands of markers. To create a Manhattan plot to display the association test
P values as a function of chromosomal location and thus provide a visual summary of association test results that draw immediate attention to any regions of significance, follow option B. To visualize the LD between sets of markers in an LD plot, follow option C. Manhattan and LD plots are suitable for both GWA and CG studies comprising any number of markers. Otherwise, create customized graphics for the visualization of association test output using customized simple R
31 commands
37 (not detailed here)).
- Quantile-quantile plot
- Start R software.
- Create a quantile-quantile plot ‘chisq.qq.plot.pdf’ with a 95% confidence interval based on output from the simple χ2 tests of association described in Step 4A for trend, allelic, dominant or recessive models, wherein statistics have a χ2 distribution with 1 d.f. under the null hypothesis of no association. Create the plot by typing
data < -read.table(“[path_to]/data.model”, header = TRUE); pdf(“[path_to]/chisq.qq.plot.pdf”); library(car);
obs < - data[data$TEST = = “[model]”,]$CHISQ; qqPlot(obs, distribution = ”chisq”, df = 1, xlab = ”Expected chi-squared
values”, ylab = “Observed test statistic”, grid = FALSE); dev.off()’,
where [path_to] is the appropriate directory path and [model] identifies the association test output to be displayed, and where [model] can be TREND (Cochran-Armitage trend); ALLELIC (allelic association); DOM (dominant model); or REC (recessive model). For simple χ2 tests of association based on a genotypic model, in which test statistics have a χ2 distribution with 2 d.f. under the null hypothesis of no association, use the option [df] = 2 and [model] = GENO. - Create a quantile-quantile plot ‘pvalue.qq.plot.pdf’ based on – log10 P values from tests of association using logistic regression described in Step 4B by typing
‘data < - read.table(“[path_to]/data.assoc.logistic”, header = TRUE); pdf(“[path_to]/pvalue.qq.plot.pdf”);
obs < - −log10(sort(data[data$TEST = = ”[model]”,]$P)); exp < - −log10( c(1:length(obs)) /(length(obs) + 1)); plot(exp,
obs, ylab = “Observed (−logP)”, xlab = ”Expected(−logP) “, ylim = c(0,20), xlim = c(0,7))
lines(c(0,7), c(0,7), col = 1, lwd = 2)
; dev.off()’,
where [path_to] is the appropriate directory path and [model] identifies the association test output to be displayed and where [model] is ADD (multiplicative model); GENO_2DF (genotypic model); DOMDEV (genotypic model testing deviation from additivity); DOM (dominant model); or REC (recessive model).
- Manhattan plot
- Start Haploview. In the ‘Welcome to Haploview’ window, select the ‘PLINK Format’ tab. Click the ‘browse’ button and select the SNP association output file created in Step 4. We select our GWA study χ2 tests of association output file ‘data.model’. Select the corresponding MAP file, which will be the ‘.map’ file for the pedigree file format or the ‘.bim’ file for the binary file format. We select our GWA study file ‘gwa.bim’. Leave other options as they are (ignore pairwise comparison of markers > 500 kb apart and exclude individuals with > 50% missing genotypes). Click ‘OK’.
- Select the association results relevant to the test of interest by selecting ‘TEST’ in the dropdown tab to the right of ‘Filter:’, ‘ = ’ in the dropdown menu to the right of that and the PLINK keyword corresponding to the test of interest in the window to the right of that. We select PLINK keyword ‘ALLELIC’ to visualize results for allelic tests of association in our GWA study. Click the gray ‘Filter’ button. Click the gray ‘Plot’ button. Leave all options as they are so that ‘Chromosomes’ is selected as the ‘X-Axis’. Choose ‘P’ from the drop-down menu for the ‘Y-Axis’ and ‘−log10′ from the corresponding dropdown menu for ‘Scale:’. Click ‘OK’ to display the Manhattan plot.
- To save the plot as a scalable vector graphics file, click the button ‘Export to scalable vector graphics:’ and then click the ‘Browse’ button (immediately to the right) to select the appropriate title and directory.
- LD plot
- Start R.
- Using the standard MAP file, create the locus information file required by Haploview for the CG data by typing
‘cg.map < - read.table(“[path_to]/cg.map”);
write.table(cg.map[,c(2,4)],“[path_to]/cg.hmap”, col.names = FALSE, row.names = FALSE, quote = FALSE)
where [path_to] is the appropriate directory path. - Start Haploview. In the ‘Welcome to Haploview’ window, select the ‘LINKAGE Format’ tab. Click the ‘browse’ button to enter the ‘Data File’ and select the PED file ‘cg.ped’. Click the ‘browse’ button to enter the ‘Locus Information File’ and select the file ‘cg.hmap’. Leave other options as they are (ignore pairwise comparison of markers > 500 kb apart and exclude individuals with > 50% missing genotypes). Click ‘OK’. Select the ‘LD Plot’ tab.
- To save the plot as a portable network graphics (PNG) file, click on the ‘File’ button; from the drop-down menu, select ‘Export current tab to PNG’. The appropriate title and directory can then be selected.
? TROUBLESHOOTING
Adjustment for multiple testing ● TIMING ~5 min
6| For CG studies, typically comprising hundreds of thousands of markers, control for multiple testing using Bonferroni’s adjustment (follow option A); Holm, Sidak or FDR (follow option B) methods; or permutation (follow option C). Although Bonferroni, Holm, Sidak and FDR are simple to implement, permutation testing is widely recommended for accurately correcting for multiple testing and should be used when computationally possible. For GWA studies, select an appropriate genome-wide significance threshold (follow option D).
- Bonferroni’s Adjustment
- Consider the total number of markers tested in the CG. For a FWER α = 0.05, derive the per-test significance rate α* by dividing α by the number of markers tested. In our CG study, we have 40 markers; therefore, α* = 0.05/40 = 0.00125. Markers with P values less than α* are then declared significant.
▲
CRITICAL STEP If some of the SNPs are in LD so that there are fewer than 40 independent tests, the Bonferroni correction will be too conservative. Use LD information from HapMap and SNPSpD (
http://genepi.qimr.edu.au/general/daleN/SNPSpD/)
35 to estimate the effective number of independent SNPs
1. Derive the per-test significance rate α* by dividing α by the effective number of independent SNPs.
- Holm, Sidak and FDR
- To obtain significance values adjusted for multiple testing for trend, dominant and recessive tests of association, include the --adjust option along with the model specification option --model-[x] (where [x] is ‘trend’, ‘rec’ or ‘dom’ to indicate whether trend, dominant or recessive test association P values, respectively, are to be adjusted for) in any of the PLINK commands described in Step 4A. For example, adjusted significance values for a Cochran-Armitage trend test of association in the CG data are obtained by typing ‘plink --file cg --adjust --model-trend --out data’. Obtain significance values adjusted for an allelic test of association by typing ‘plink --file cg --assoc –adjust --out data’.
- Open the output file ‘data.model.[x].adjusted’ for adjusted trend, dominant or recessive test association P values or ‘data.assoc.adjusted’ for adjusted allelic test of association P values. These files have one row per SNP containing the chromosome [CHR], the SNP identifier [SNP], the unadjusted P value [UNADJ] identical to that found in the original association output file, the genomic-control–adjusted P value [GC], the Bonferroni-adjusted P value [BONF], the Holm step-down–adjusted P value [HOLM], the Sidak single-step–adjusted P value [SIDAK_SS], the Sidak step-down–adjusted P value [SIDAK_SD], the Benjamini and Hochberg FDR control [FDR_BH] and the Benjamini and Yekutieli FDR control [FDR_BY]. To maintain a FWER or FDR of α = 0.05, only SNPs with adjusted P values less than α are declared significant.
- Permutation
- To generate permuted P values, include the --mperm option along with the number of permutations to be performed and the model specification option –model-[x] (where [x] is ‘gen’, ‘trend’, ‘rec’ or ‘dom’ to indicate whether genotypic, trend, dominant or recessive test association P values are to be permuted) in any of the PLINK commands described in Step 4A. For example, permuted P values based on 1,000 replicates for a Cochran-Armitage trend test of association are obtained by typing ‘plink --file cg --model --mperm 1000 --model-trend --out data’ and permuted P values based on 1,000 replicates for an allelic test of association are obtained by typing ‘plink --file cg --assoc –mperm 1000 --out data’.
- Open the output file ‘data.model.[x].mperm’ for permuted P values for genotypic, trend, dominant or recessive association tests or ‘data.assoc.mperm’ for permuted P values for allelic tests of association. These files have one row per SNP containing the chromosome [CHR], the SNP identifier [SNP], the point-wise estimate of the SNP’s significance [EMP1] and the family-wise estimate of the SNP’s significance [EMP2]. To maintain a FWER of α = 0.05, only SNPs with family-wise estimated significance of less than α are declared significant.
- Genome-wide significance threshold ● TIMING ~5 min
- Obtain per-SNP significance thresholds for a given FWER from Hoggart et al22. In our GWA study of 2,000 cases and 2,000 controls from a Caucasian population, the standard per-SNP significance threshold for a FWER of α = 0.05 is estimated at 12 × 10−8 using linear interpolation between the given value of 11 × 10−8 for studies with 1,000 cases and 1,000 controls and 15 × 10−8 for studies with 5,000 cases and 5,000 controls.
? TROUBLESHOOTING
Population stratification ● TIMING ~5 min
7| For CG studies, typically comprising hundreds of thousands of markers, calculate the inflation factor λ (follow option A). For GWA studies, obtain an unbiased evaluation of the inflation factor λ by using all testing SNPs (follow option B).
- Calculate the inflation factor λ for CG studies
- Assuming that PED and MAP files for null loci are available, obtain the inflation factor by specifying the null marker loci data files instead of the CG data files and including the --adjust option along with the model specification option --model-[x] (where [x] is ‘trend’, ‘rec’ or ‘dom’ to indicate whether an inflation factor based on a trend, dominant or recessive test of association, respectively, is to be calculated) in any of the PLINK commands described in Step 4A. For example, the inflation factor corresponding to a Cochran-Armitage trend test of association is obtained by typing ‘plink --file null --model --adjust --model-trend --out data’; the inflation factor corresponding to an allelic test of association is obtained by typing ‘plink --file null --assoc --adjust --out data’, where files ‘null.ped’ and ‘null.map’ are PED and MAP files for the case and control individuals at the null marker loci.
▲ CRITICAL STEP To assess the inflation factor in CG studies, an additional set of null marker loci, which are common SNPs not associated with the disease and not in LD with CG SNPs, must be available. We do not have any null loci data files available for our CG study.
Open the PLINK log file ‘data.log’ that records the inflation factor.
- Calculate the inflation factor λ for GWA studies
- To obtain the inflation factor, include the --adjust option in any of the PLINK commands described in Step 4B. For example, the inflation factor based on logistic regression tests of association for all SNPs and assuming multiplicative or genotypic models in the GWA study is obtained by typing ‘plink --bfile gwa --genotypic --logistic --covar gwa.covar --adjust --out data’.
- Open the PLINK log file ‘data.log’, which records the inflation factor. The inflation factor for our GWA study is 1, indicating that no population stratification is detected in our GWA data.
▲
CRITICAL STEP When the sample size is large, the inflation factor λ
1000, for an equivalent study of 1,000 cases and 1,000 controls, can be calculated by rescaling λ according to the following formula
? TROUBLESHOOTING
Step 1: If genotypes are not available in standard PED and MAP or binary file formats, both Goldsurfer2 (Gs2; see refs.
38,
39) and PLINK have the functionality to read other file formats (e.g., HapMap, HapMart, Affymetrix, transposed file sets and long-format file sets) and convert these into PED and MAP or binary file formats.
Steps 2–6: The default missing genotype character is ‘0′. PLINK can recognize a different character as the missing genotype by using the ‘--missing-genotype’ option. For example, specify a missing genotype character of ‘N’ instead of ‘0′ in Step 2 by typing ‘plink --file cg --assoc --missing-genotype N --out data’.
● TIMING
None of the programs used take longer than a few minutes to run. Displaying and interpreting the relevant information are the rate-limiting steps.