Home | About | Journals | Submit | Contact Us | Français |

**|**PLoS One**|**v.5(11); 2010**|**PMC2982824

Formats

Article sections

Authors

Related links

PLoS One. 2010; 5(11): e15006.

Published online 2010 November 16. doi: 10.1371/journal.pone.0015006

PMCID: PMC2982824

Amanda Ewart Toland, Editor^{}

Ohio State University Medical Center, United States of America

Conceived and designed the experiments: LM SH YD. Performed the experiments: LM JY. Analyzed the data: LM JY. Wrote the paper: LM SH YD.

Received 2010 August 14; Accepted 2010 October 7.

Copyright Ma et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

This article has been cited by other articles in PMC.

Complex diseases or phenotypes may involve multiple genetic variants and interactions between genetic, environmental and other factors. Current genome-wide association studies (GWAS) mostly used single-locus analysis and had identified genetic effects with multiple confirmations. Such confirmed single-nucleotide polymorphism (SNP) effects were likely to be true genetic effects and ignoring this information in testing new effects of the same phenotype results in decreased statistical power due to increased residual variance that has a component of the omitted effects. In this study, a multi-locus association test (MLT) was proposed for GWAS analysis conditional on SNPs with confirmed effects to improve statistical power. Analytical formulae for statistical power were derived and were verified by simulation for MLT accounting for confirmed SNPs and for single-locus test (SLT) without accounting for confirmed SNPs. Statistical power of the two methods was compared by case studies with simulated and the Framingham Heart Study (FHS) GWAS data. Results showed that the MLT method had increased statistical power over SLT. In the GWAS case study on four cholesterol phenotypes and serum metabolites, the MLT method improved statistical power by 5% to 38% depending on the number and effect sizes of the conditional SNPs. For the analysis of HDL cholesterol (HDL-C) and total cholesterol (TC) of the FHS data, the MLT method conditional on confirmed SNPs from GWAS catalog and NCBI had considerably more significant results than SLT.

Genome-wide association studies (GWAS) have identified genetic variants associated with a number of complex diseases or phenotypes [1], [2], [3] and some of these variants had confirmations from several studies [1]. Published GWAS studies typically used a single-locus test (SLT), in which each variant is tested individually for association with a specific phenotype. Single-locus analysis may not be the best approach in the presence of confirmed SNP effects, because confirmed effects become a component of random residuals and decrease statistical power for detecting new effects if those true effects are omitted in the analysis. Controlling genetic backgrounds in single-marker tests was well explored in the area of QTL mapping including Zeng's CIM [4] and Jansen's multiple regression method [5]. Roeder et al [6] proposed to use linkage genome scan results in GWAS to achieve higher power. In a recent meta-analysis on smoking behavior, a novel SNP was identified with genome-wide significance related to smoking conditional on a known SNP [7]. Conditional analysis was also used in other meta-analysis in GWAS [8]. In this study, we propose a multi-locus test (MLT) that tests each candidate SNP conditional on confirmed SNPs for GWAS analysis to increase the statistical power for detecting new SNP effects, and we demonstrate the MLT method had increased statistical power relative to SLT using analytical formulae derived in this study and using simulation, case studies, and the Framingham Heart Study (FHS) data.

The multiple linear regression model for the MLT method can be expressed as:

(1)

where *μ*=the population mean of the phenotypic values, *Z** _{i}*=1×

(2)

A standard t-test can be used for testing the significance of the candidate SNP based on testing the following hypotheses, H_{0}: *β _{s}*=0, where

(3)

where *t _{α}* denotes the α% quantile value for t distribution and

(4)

where =the element at the *M*th row and *M*th column of variance-covariance matrix , and is a *M*×*M* variance-covariance matrix of and can be estimated as:

(5)

where ** X** is the design matrix in Equation 1.

For SLT, the statistical model is the same as Equation 1 except that the residual term is now a summation of confirmed effects and random residuals. The residual variance for the SLT model is no longer *σ*^{2} and has the following mathematical expression:

(6)

where var(*G** _{i}*) is calculated based on the

(7)

where *c* is the element at the (*p*+2)th row and (*p*+2)th column of matrix and ** X** is the design matrix for the regression model of SLT. Therefore, the t-test statistic for SLT does not have a t-distribution but a t-distribution divided by a constant .

Similar to Equations 3, the power of the one-sided t-test can be formulated as:

(8)

where *t _{α}* denotes the α% quantile value for t-distribution with

The analytical formulae for statistical power for MLT accounting for confirmed effects and for SLT without accounting for confirmed effects (Equation 3 and Equation 8) were validated by simulated data of 2000 subjects for various effect sizes of the candidate SNP and confirmed SNPs with 10,000 repeats. The phenotypic values were simulated by the summation of a population mean, three additive SNP effects and a random error which followed a standard normal distribution. The three SNPs were simulated under Hardy-Weinberg equilibrium and linkage equilibrium with allele frequencies, 0.3, 0.4 and 0.2. The first two SNPs were assumed to have confirmed effects and the last SNP was assumed to be the candidate SNP. The candidate SNP was tested by the MLT and SLT methods in each simulation. Empirical power was calculated as the proportion of significant results from all 10,000 simulation results. We fixed the effects of the two confirmed SNPs as 0.3 and 0.2 standard deviation of residuals (SD) and varied the effect of the candidate SNP from 0.04 to 0.2 SD. Simulated statistical power were nearly identical to the predicted power for MLT and SLT based on different candidate SNP effect sizes (Table S1) and on different population sizes (Table S2). With this knowledge of the power formulae being correct, predicted statistical power for MLT and SLT were calculated for various effect sizes of the confirmed SNPs (Table 1).

We further evaluated predicted statistical power using reported effect sizes for some confirmed SNP effects. We collected all reported SNP effects for HDL cholesterol (HDL-C), LDL cholesterol (LDL-C), triglycerides (TG), total cholesterol (TC), and serum metabolites (SM) from the GWAS catalog [1]. The effect sizes and risk allele frequencies of those SNPs were extracted and utilized for the power calculations. After filtering out SNPs in high linkage disequilibrium (LD) by only keeping one SNP with the largest effect size in each high LD region, the final selection of confirmed SNP markers included 22 relatively independent SNPs with effect sizes of 0.07–0.24 SD for HDL-C, 24 SNPs with effect sizes of 0.07–0.35 SD for LDL-C, 13 SNPs with effect sizes of 0.06–0.42 SD for TG, and 19 SNPs with effect sizes of 0.06–0.24 SD for TC. For SM, we extracted five SNPs which explained 5.6 to 36.3 percent of the total phenotypic variation [1], [2]. Conditional on those known significant SNP effects, statistical power of MLT increased over that of SLT by about 4–5% for HDL-C, LDL-C, TG and TC. The pattern of the heatmap for statistical power was similar for these four traits and the heatmap for HDL-C is shown in Figure 1A. Largest improvements were in the region where candidate SNPs had small effect sizes and large allele frequencies or candidate SNPs had medium effect sizes and relatively low allele frequencies (0.1–0.2). The increase in statistical power of MLT over SLT was much larger for SM, varying from 10% to 30%, because of the large effect sizes of the known SNPs (Figure 1B).

For GWAS analysis using real data, true statistical power is not observable but MLT is expected to have more significant results than SLT. To compare observed statistical significance of MLT and SLT, we used the FHS GWAS data (version 2) that had 6575 individuals with SNP genotypes of the 500k SNP panel from dbGAP [9]. Of the 6575 individuals, 6078 individuals had observations on HDL-C and 6431 individuals had observations on TC. From the 500k SNP panel, 432,096 SNP markers with known locations and minor allele frequencies 0.01 or greater were selected and tested. The original cholesterol measures deviated from normality and had outliers. The Box-Cox transformation analysis [10] implemented by the R package [11] showed that the log-transformation was approximately the best transformation to achieve normality for HDL-C and TC. Age, age-squared, cholesterol treatment, blood sugar, body mass index, smoking status, number of cigars smoked, alcohol consumption and sex were adjusted for transformed HDL-C. Blood sugar, body mass index, smoking status, and sex were adjusted for transformed TC. The testing of SNP effects used the generalized least squares version [12] of epiSNP [13]. From the GWAS catalog [1] and NCBI (http://www.ncbi.nlm.nih.gov), we selected six SNP markers (Table 2) with multiple confirmations. These six SNP markers were independent of each other because pairwise correlations measured by R-square among these SNPs were nearly zero. Results showed that MLT had more significant results than SLT for both HDL-C and TC (Table 2). The first two markers in Table 2 had the largest improvement in observed significance (reduced P-value), while the remaining markers only had minor improvement.

In our analysis, we did not impute genotypic data to fill in missing genotypes so that the MLT test had smaller sample size than SLT. The observed significance should have been larger than observed in Table 2 if the missing genotypes were filled by genotype imputing using software like MACH [14] and BEAGLE [15]. Although improvement in statistical significance could be small in some cases, such improvement could be easily achieved using GWAS analysis software like PLINK [16] and epiSNP [13] that provide options to incorporate covariates. Due to incorporating confirmed effects, MLT has smaller degrees of freedom for residuals and can be less robust than SLT. Fortunately, sample size in typical GWAS studies (thousands) is large enough for incorporating relatively small number of confirmed effects (tens). With more and more confirmed effects reported in the literature, MLT is useful for identifying new genetic variants with smaller effects.

Power comparison between analytic formulas and simulation for MLT (Power I) and SLT (Power II) with varied candidate SNP effect sizes, constant confirmed effect sizes (0.2 and 0.3 SDs for two confirmed SNPs) and constant sample size (2000). (DOC)

Click here for additional data file.^{(30K, doc)}

Power comparison between analytic formulas and simulation for MLT (Power I) and SLT (Power II) with varied sample sizes, constant candidate SNP effect size (0.1 SD) and confirmed effect sizes (0.2 and 0.3 SDs for two confirmed SNPs). (DOC)

Click here for additional data file.^{(29K, doc)}

We are thankful to Dr. Andrew Clark and Dr. Alon Keinan for helpful discussions on this manuscript. We also thank the reviewers for their comments that improved the presentation of the manuscript.

**Competing Interests: **The authors have declared that no competing interests exist.

**Funding: **This research was supported in part by project MN-16-043 of the Agricultural Experiment Station at the University of Minnesota. The Framingham Heart Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University. This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or the NHLBI. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

1. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA. 2009;106:9362–9367. [PubMed]

2. Illig T, Gieger C, Zhai G, Römisch-Margl W, Wang-Sattler R, et al. A genome-wide perspective of genetic variation in human metabolism. Nat Genet. 2010;42(2):137–41. [PubMed]

3. Sandhu MS, Waterworth DM, Debenham SL, Wheeler E, Papadakis K, et al. LDL-cholesterol concentrations: a genome-wide association study. Lancet. 2008;371(9611):483–491. [PMC free article] [PubMed]

4. Zeng Z-B. Precision mapping of quantitative trait loci. Genetics. 1994;136:1457–1468. [PubMed]

5. JANSEN RC, STAM P. High resolution of quantitative traits into multiple loci via interval mapping. Genetics. 1994;136:1447–1455. [PubMed]

6. Roeder K, Bacanu SA, Wasserman L, Devlin B. Using linkage genome scans to improve power of association in genome scans. Am J Hum Genet. 2006;78:243–252. [PubMed]

7. Saccone NL, Culverhouse RC, Schwantes-An T-H, Cannon DS, Chen X, et al. Multiple Independent Loci at Chromosome 15q25.1 Affect Smoking Quantity: a Meta-Analysis and Comparison with Lung Cancer and COPD. PLoS Genet. 2010;6(8):e1001053. doi: 10.1371/journal.pgen.1001053. [PMC free article] [PubMed]

8. Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, et al. Biological, clinical, and population relevance of 95 loci for blood lipids. Nature. 2010;466:707–713. [PMC free article] [PubMed]

9. Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, et al. The NCBI dbGaP database of genotypes and phenotypes. Nat Genet. 2007;39:1007–1181. [PMC free article] [PubMed]

10. Box GEP, Cox DR. An analysis of transformations (with discussion). J Roy Stat Soc B. 1964;26:211–252.

11. R Development Core Team. R. A language and environment for statistical computing. 2008. R Foundation for Statistical Computing. Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

12. Ma L, Amos CI, Da Y. Accounting for correlations among individuals for testing SNP single-locus and epistasis effects in Genome-wide association analysis (Abstract). 2008a. Plant & Animal Genomes XVI Conference, January 12–16, San Diego, CA. http://www.intl-pag.org/16/abstracts/PAG16_P11_903.html.

13. Ma L, Runesha HB, Dvorkin D, Garbe JR, Da Y. Parallel and serial computing tools for testing single-locus and epistatic SNP effects of quantitative traits in genome-wide association studies. BMC Bioinformatics. 2008b;9:315. [PMC free article] [PubMed]

14. Li Y, Abecasis GR. Mach 1.0: Rapid Haplotype Reconstruction and Missing Genotype Inference. Am J Hum Genet. 2006;S79:2290.

15. Browning SR, Browning BL. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet. 2007;81:1084–1097. [PubMed]

16. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, et al. PLINK: a toolset for whole-genome association and population-based linkage analysis. Am J Hum Genet. 2007;81:559–575. [PubMed]

Articles from PLoS ONE are provided here courtesy of **Public Library of Science**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |