Rare variants are believed to play an important role in disease etiology. Recent advances in high-throughput sequencing technology enable investigators to systematically characterize the genetic effects of both common and rare variants. We introduce several approaches that simultaneously test the effects of common and rare variants within a single-nucleotide polymorphism (SNP) set based on logistic regression models and logistic kernel machine models. Gene-environment interactions and SNP-SNP interactions are also considered in some of these models. We illustrate the performance of these methods using the unrelated individuals data from Genetic Analysis Workshop 17. Three true disease genes (FLT1, PIK3C3, and KDR) were consistently selected using the proposed methods. In addition, compared to logistic regression models, the logistic kernel machine models were more powerful, presumably because they reduced the effective number of parameters through regularization. Our results also suggest that a screening step is effective in decreasing the number of false-positive findings, which is often a big concern for association studies.
In this paper, we develop a powerful test for identifying SNP-sets that are predictive of survival with data from genome-wide association studies (GWAS). We first group typed SNPs into SNP-sets based on genomic features and then apply a score test to assess the overall effect of each SNP-set on the survival outcome through a kernel machine Cox regression framework. This approach uses genetic information from all SNPs in the SNP-set simultaneously and accounts for linkage disequilibrium (LD), leading to a powerful test with reduced degrees of freedom when the typed SNPs are in LD with each other. This type of test also has the advantage of capturing the potentially non-linear effects of the SNPs, SNP-SNP interactions (epistasis), and the joint effects of multiple causal variants. By simulating SNP data based on the LD structure of real genes from the HapMap project, we demonstrate that our proposed test is more powerful than the standard single SNP minimum p-value based test for association studies with censored survival outcomes. We illustrate the proposed test with a real data application.
cox model; genetic studies; gene-based analysis; kernel machine; multi-locus test; score test; single nucleotide polymorphism
We propose in this paper a powerful testing procedure for detecting a gene effect on a continuous outcome in the presence of possible gene-gene interactions (epistasis) in a gene set, e.g. a genetic pathway or network. Traditional tests for this purpose require a large number of degrees of freedom by testing the main effect and all the corresponding interactions under a parametric assumption, and hence suffer from low power. In this paper, we propose a powerful kernel machine based test. Specifically, our test is based on a garrote kernel method and is constructed as a score test. Here, the term garrote refers to an extra nonnegative parameter which is multiplied to the covariate of interest so that our score test can be formulated in terms of this nonnegative parameter. A key feature of the proposed test is that it is flexible and developed for both parametric and nonparametric models within a unified framework, and is more powerful than the standard test by accounting for the correlation among genes and hence often uses a much smaller degrees of freedom. We investigate the theoretical properties of the proposed test. We evaluate its finite sample performance using simulation studies, and apply the method to the Michigan prostate cancer gene expression data.
Garrote; Gene-gene interaction; Kernel machine; Mixed models; Restricted maximum likelihood; Score test; Semiparametric regression
Regional-based association analysis instead of individual testing of each SNP was introduced in genome-wide association studies to increase the power of gene mapping, especially for rare genetic variants. For regional association tests, the kernel machine-based regression approach was recently proposed as a more powerful alternative to collapsing-based methods. However, the vast majority of existing algorithms and software for the kernel machine-based regression are applicable only to unrelated samples. In this paper, we present a new method for the kernel machine-based regression association analysis of quantitative traits in samples of related individuals. The method is based on the GRAMMAR+ transformation of phenotypes of related individuals, followed by use of existing kernel machine-based regression software for unrelated samples. We compared the performance of kernel-based association analysis on the material of the Genetic Analysis Workshop 17 family sample and real human data by using our transformation, the original untransformed trait, and environmental residuals. We demonstrated that only the GRAMMAR+ transformation produced type I errors close to the nominal value and that this method had the highest empirical power. The new method can be applied to analysis of related samples by using existing software for kernel-based association analysis developed for unrelated samples.
Growing interest on biological pathways has called for new statistical methods for modeling and testing a genetic pathway effect on a health outcome. The fact that genes within a pathway tend to interact with each other and relate to the outcome in a complicated way makes nonparametric methods more desirable. The kernel machine method provides a convenient, powerful and unified method for multi-dimensional parametric and nonparametric modeling of the pathway effect.
In this paper we propose a logistic kernel machine regression model for binary outcomes. This model relates the disease risk to covariates parametrically, and to genes within a genetic pathway parametrically or nonparametrically using kernel machines. The nonparametric genetic pathway effect allows for possible interactions among the genes within the same pathway and a complicated relationship of the genetic pathway and the outcome. We show that kernel machine estimation of the model components can be formulated using a logistic mixed model. Estimation hence can proceed within a mixed model framework using standard statistical software. A score test based on a Gaussian process approximation is developed to test for the genetic pathway effect. The methods are illustrated using a prostate cancer data set and evaluated using simulations. An extension to continuous and discrete outcomes using generalized kernel machine models and its connection with generalized linear mixed models is discussed.
Logistic kernel machine regression and its extension generalized kernel machine regression provide a novel and flexible statistical tool for modeling pathway effects on discrete and continuous outcomes. Their close connection to mixed models and attractive performance make them have promising wide applications in bioinformatics and other biomedical areas.
We consider a semiparametric regression model that relates a normal outcome to covariates and a genetic pathway, where the covariate effects are modeled parametrically and the pathway effect of multiple gene expressions is modeled parametrically or nonparametrically using least-squares kernel machines (LSKMs). This unified framework allows a flexible function for the joint effect of multiple genes within a pathway by specifying a kernel function and allows for the possibility that each gene expression effect might be nonlinear and the genes within the same pathway are likely to interact with each other in a complicated way. This semiparametric model also makes it possible to test for the overall genetic pathway effect. We show that the LSKM semiparametric regression can be formulated using a linear mixed model. Estimation and inference hence can proceed within the linear mixed model framework using standard mixed model software. Both the regression coefficients of the covariate effects and the LSKM estimator of the genetic pathway effect can be obtained using the best linear unbiased predictor in the corresponding linear mixed model formulation. The smoothing parameter and the kernel parameter can be estimated as variance components using restricted maximum likelihood. A score test is developed to test for the genetic pathway effect. Model/variable selection within the LSKM framework is discussed. The methods are illustrated using a prostate cancer data set and evaluated using simulations.
BLUPs; Kernel function; Model/variable selection; Nonparametric regression; Penalized likelihood; REML; Score test; Smoothing parameter; Support vector machines
Numerous nonparametric approaches have been proposed in literature to detect differential gene expression in the setting of two user-defined groups. However, there is a lack of nonparametric procedures to analyze microarray data with multiple factors attributing to the gene expression. Furthermore, incorporating interaction effects in the analysis of microarray data has long been of great interest to biological scientists, little of which has been investigated in the nonparametric framework.
In this paper, we propose a set of nonparametric tests to detect treatment effects, clinical covariate effects, and interaction effects for multifactorial microarray data. When the distribution of expression data is skewed or heavy-tailed, the rank tests are substantially more powerful than the competing parametric F tests. On the other hand, in the case of light or medium-tailed distributions, the rank tests appear to be marginally less powerful than the parametric competitors.
The proposed rank tests enable us to detect differential gene expression and establish interaction effects for microarray data with various non-normally distributed expression measurements across genome. In the presence of outliers, they are advantageous alternative approaches to the existing parametric F tests due to the robustness feature.
Knowledge of simulated genetic effects facilitates interpretation of methodological studies. Genetic interactions for common disorders are likely numerous and weak. Using the 200 replicates of the Genetic Analysis Workshop 16 (GAW16) Problem 3 simulated data, we compared the statistical power to detect weak gene-gene interactions using a haplotype-based test in the UNPHASED software with genotypic mixed model (GMM) and additive mixed model (AMM) mixed linear regression model in SAS. We assumed a candidate-gene approach where a single-nucleotide polymorphism (SNP) in one gene is fixed and multiple SNPs are at the second gene. We analyzed the quantitative low-density lipoprotein trait (heritability 0.7%), modulated by simulated interaction of rs4648068 from 4q24 and another gene on 8p22, where we analyzed seven SNPs. We generally observed low power calculated per SNP (≤ 37% at the 0.05 level), with the haplotype-based test being inferior. Over all tests, the haplotype-based test performed within chance, while GMM and AMM had low power (~10%). The haplotype-based and mixed models detected signals at different SNPs. The haplotype-based test detected a signal in 50 unique replicates; GMM and AMM featured both shared and distinct SNPs and replicates (65 replicates shared, 41 GMM, 27 AMM). Overall, the statistical signal for the weak gene-gene interaction appears sensitive to the sample structure of the replicates. We conclude that using more than one statistical approach may increase power to detect such signals in studies with limited number of loci such as replications. There were no results significant at the conservative 10-7 genome-wide level.
Complex diseases are multifactorial in nature and can involve multiple loci with gene × gene and gene × environment interactions. Research on methods to uncover the interactions between those genes that confer susceptibility to disease has been extensive, but many of these methods have only been developed for sibling pairs or sibships. In this report, we assess the performance of two methods for finding gene × gene interactions that are applicable to arbitrarily sized pedigrees, one based on correlation in per-family nonparametric linkage scores and another that incorporates candidate loci genotypes as covariates into an affected relative pair linkage analysis. The power and type I error rate of both of these methods was addressed using the simulated Genetic Analysis Workshop 14 data. In general, we found detection of the interacting loci to be a difficult problem, and though we experienced some modest success there is a clear need to continue developing new methods and approaches to the problem.
Currently, most methods for detecting gene-gene interaction (GGI) in genomewide association studies (GWASs) are limited in their use of single nucleotide polymorphism (SNP) as the unit of association. One way to address this drawback is to consider higher level units such as genes or regions in the analysis. Earlier we proposed a statistic based on canonical correlations (CCU) as a gene-based method for detecting gene-gene co-association. However, it can only capture linear relationship and not nonlinear correlation between genes. We therefore proposed a counterpart (KCCU) based on kernel canonical correlation analysis (KCCA).
Through simulation the KCCU statistic was shown to be a valid test and more powerful than CCU statistic with respect to sample size and interaction odds ratio. Analysis of data from regions involving three genes on rheumatoid arthritis (RA) from Genetic Analysis Workshop 16 (GAW16) indicated that only KCCU statistic was able to identify interactions reported earlier.
KCCU statistic is a valid and powerful gene-based method for detecting gene-gene co-association.
Genome-wide association study (GWAS); Gene-gene co-association; Gene-gene interaction (GGI); Kernel canonical correlation analysis (KCCA)
In genetic association study, especially in GWAS, gene- or region-based methods have been more popular to detect the association between multiple SNPs and diseases (or traits). Kernel principal component analysis combined with logistic regression test (KPCA-LRT) has been successfully used in classifying gene expression data. Nevertheless, the purpose of association study is to detect the correlation between genetic variations and disease rather than to classify the sample, and the genomic data is categorical rather than numerical. Recently, although the kernel-based logistic regression model in association study has been proposed by projecting the nonlinear original SNPs data into a linear feature space, it is still impacted by multicolinearity between the projections, which may lead to loss of power. We, therefore, proposed a KPCA-LRT model to avoid the multicolinearity.
Simulation results showed that KPCA-LRT was always more powerful than principal component analysis combined with logistic regression test (PCA-LRT) at different sample sizes, different significant levels and different relative risks, especially at the genewide level (1E-5) and lower relative risks (RR = 1.2, 1.3). Application to the four gene regions of rheumatoid arthritis (RA) data from Genetic Analysis Workshop16 (GAW16) indicated that KPCA-LRT had better performance than single-locus test and PCA-LRT.
KPCA-LRT is a valid and powerful gene- or region-based method for the analysis of GWAS data set, especially under lower relative risks and lower significant levels.
Parametric linkage methods for quantitative trait locus mapping require explicit specification of the probability model of the quantitative trait and hence can lead to misleading linkage inferences when the model assumptions are not valid. Ghosh and Majumder developed a nonparametric regression method based on kernel-smoothing for linkage mapping of quantitative trait locus using squared differences in trait values of independent sib pairs, which is relatively more robust than parametric methods with respect to violations in distributional assumptions. In this study, we modify the above mentioned nonparametric regression method by considering local linear polynomials instead of the Nadaraya-Watson estimator and squared sums of sib-pair trait values in addition to squared differences to perform a genome-wide scan of rheumatoid factor-IgM levels on sib pairs in the Genetic Analysis Workshop 15 simulated data set. We obtain significant evidence of linkage very close to the quantitative trait locus controlling for RF-IgM. We find that the simultaneous use of squared differences and squared sums increases the power to detect linkage compared to using only squared differences. However, because of all the sib pairs are selected for rheumatoid arthritis, there is reduced variance of RF-IgM values, and empirical power to detect linkage is not very high. We also compare the performance of our method with two linear regression approaches: the classical Haseman-Elston method using squared sib-pair trait differences and its extension proposed by Elston et al. using mean-corrected sib-pair cross-products. We find that the proposed nonparametric method yields more power than the linear regression approaches.
Detection of discriminating patterns in gene expression data can be accomplished by using various methods of statistical learning. It has been proposed that sample pooling in this context would have negative effects; however, pooling cannot always be avoided. We propose a simulation framework to explicitly investigate the parameters of patterns, experimental design, noise, and choice of method in order to find out which effects on classification performance are to be expected. We use a two-group classification task and simulated gene expression data with independent differentially expressed genes as well as bivariate linear patterns and the combination of both. Our results show a clear increase of prediction error with pool size. For pooled training sets powered partial least squares discriminant analysis outperforms discriminance analysis, random forests, and support vector machines with linear or radial kernel for two of three simulated scenarios. The proposed simulation approach can be implemented to systematically investigate a number of additional scenarios of practical interest.
Testing multiple markers simultaneously not only can capture the linkage disequilibrium patterns but also can decrease the number of tests and thus alleviate the multiple-testing penalty. If a gene is associated with a phenotype, subjects with similar genotypes in this gene should also have similar phenotypes. Based on this concept, we have developed a general framework that is applicable to continuous traits. Two similarity-based tests (namely, SIMc and SIMp tests) were derived as special cases of the general framework. In our simulation study, we compared the power of the two tests with that of the single-marker analysis, a standard haplotype regression, and a popular and powerful kernel machine regression. Our SIMc test outperforms other tests when the average r-square (a measure of linkage disequilibrium) between the causal variant and the surrounding markers is larger than 0.3 or when the causal allele is common (say, frequency = 0.3). Our SIMp test outperforms other tests when the causal variant was introduced at common haplotypes (the maximum frequency of risk haplotypes > 0.4). We also applied our two tests to an adiposity data set to show their utility.
Haplotype; Similarity; Genomic distance; Linkage disequilibrium; Multi-marker test; Body-mass index; CPE gene
Many statistical methods for microarray data analysis consider one gene at a time, and they may miss subtle changes at the single gene level. This limitation may be overcome by considering a set of genes simultaneously where the gene sets are derived from prior biological knowledge. Limited work has been carried out in the regression setting to study the effects of clinical covariates and expression levels of genes in a pathway either on a continuous or on a binary clinical outcome. Hence, we propose a Bayesian approach for identifying pathways related to both types of outcomes. We compare our Bayesian approaches with a likelihood-based approach that was developed by relating a least squares kernel machine for nonparametric pathway effect with a restricted maximum likelihood for variance components. Unlike the likelihood-based approach, the Bayesian approach allows us to directly estimate all parameters and pathway effects. It can incorporate prior knowledge into Bayesian hierarchical model formulation and makes inference by using the posterior samples without asymptotic theory. We consider several kernels (Gaussian, polynomial, and neural network kernels) to characterize gene expression effects in a pathway on clinical outcomes. Our simulation results suggest that the Bayesian approach has more accurate coverage probability than the likelihood-based approach, and this is especially so when the sample size is small compared with the number of genes being studied in a pathway. We demonstrate the usefulness of our approaches through its applications to a type II diabetes mellitus data set. Our approaches can also be applied to other settings where a large number of strongly correlated predictors are present.
Gaussian random process; kernel machine; pathway
Accounting for interactions with environmental factors in association studies may improve the power to detect genetic effects and may help identifying important environmental effect modifiers. The power of unphased genotype-versus haplotype-based methods in regions with high linkage disequilibrium (LD), as measured by D', for analyzing gene × environment (gene × sex) interactions was compared using the Genetic Analysis Workshop 15 (GAW15) simulated data on rheumatoid arthritis with prior knowledge of the answers. Stepwise and regular conditional logistic regression (CLR) was performed using a matched case-control sample for a HLA region interacting with sex. Haplotype-based analyses were performed using a haplotype-sharing-based Mantel statistic and a test for haplotype-trait association in a general linear model framework. A step-down minP algorithm was applied to derive adjusted p-values and to allow for power comparisons. These methods were also applied to the GAW15 real data set for PTPN22.
For markers in strong LD, stepwise CLR performed poorly because of the correlation/collinearity between the predictors in the model. The power was high for detecting genetic main effects using simple CLR models and haplotype-based methods and for detecting joint effects using CLR and Mantel statistics. Only the haplotype-trait association test had high power to detect the gene × sex interaction.
In the PTPN22 region with markers characterized by strong LD, all methods indicated a significant genotype × sex interaction in a sample of about 1000 subjects. The previously reported R620W single-nucleotide polymorphism was identified using logistic regression, but the haplotype-based methods did not provide any precise location information.
GWAS has facilitated greatly the discovery of risk SNPs associated with complex diseases. Traditional methods analyze SNP individually and are limited by low power and reproducibility since correction for multiple comparisons is necessary. Several methods have been proposed based on grouping SNPs into SNP sets using biological knowledge and/or genomic features. In this article, we compare the linear kernel machine based test (LKM) and principal components analysis based approach (PCA) using simulated datasets under the scenarios of 0 to 3 causal SNPs, as well as simple and complex linkage disequilibrium (LD) structures of the simulated regions. Our simulation study demonstrates that both LKM and PCA can control the type I error at the significance level of 0.05. If the causal SNP is in strong LD with the genotyped SNPs, both the PCA with a small number of principal components (PCs) and the LKM with kernel of linear or identical-by-state function are valid tests. However, if the LD structure is complex, such as several LD blocks in the SNP set, or when the causal SNP is not in the LD block in which most of the genotyped SNPs reside, more PCs should be included to capture the information of the causal SNP. Simulation studies also demonstrate the ability of LKM and PCA to combine information from multiple causal SNPs and to provide increased power over individual SNP analysis. We also apply LKM and PCA to analyze two SNP sets extracted from an actual GWAS dataset on non-small cell lung cancer.
Identifying gene-environment (G × E) interactions has become a crucial issue in the past decades. Different methods have been proposed to test for G × E interactions in the framework of linkage or association testing. However, their respective performances have rarely been compared. Using Genetic Analysis Workshop 15 simulated data, we compared the power of four methods: one based on affected sib pairs that tests for linkage and interaction (the mean interaction test) and three methods that test for association and/or interaction: a case-control test, a case-only test, and a log-linear approach based on case-parent trios. Results show that for the particular model of interaction between tobacco use and Locus B simulated here, the mean interaction test has poor power to detect either the genetic effect or the interaction. The association studies, i.e., the log-linear-modeling approach and the case-control method, are more powerful to detect the genetic effect (power of 78% and 95%, respectively) and taking into account interaction moderately increases the power (increase of 9% and 3%, respectively). The case-only design exhibits a 95% power to detect G × E interaction but the type I error rate is increased.
Genome scan meta-analysis (GSMA) can prove very useful in detecting genetic effects too small to be detected in an individual linkage study and can also lead to more consistent results. In this paper, we propose a new kernel-based estimation procedure for GSMA. Instead of estimating identity by descent between markers, as performed in interval mapping approaches, we estimated directly the nonparametric linkage score between markers using a kernel procedure. The GSMA is then extended to take into account the kernel estimate of the nonparametric linkage score and its variance at a given chromosomal position. The method is applied to the rheumatoid arthritis genome scan data (Genetic Analysis Workshop 15 Problem 2).
To detect genetic association with common and complex diseases, two powerful yet quite different multi-marker association tests have been proposed, genomic distance-based regression (GDBR) (Wessel and Schork 2006, AJHG 79:821-833) and kernel-machine regression (KMR) (Kwee et al 2008, AJHG 82:386-397; Wu et al 2010, AJHG 86:929-942). GDBR is based on relating a multi-marker similarity metric for a group of subjects to variation in their trait values, while KMR is based on nonparametric estimates of the effects of the multiple markers on the trait through a kernel function or kernel matrix. Since the two approaches are both powerful and general, but appear quite different, it is important to know their specific relationships. In this report, we show that, under the condition that there is no other covariate, there is a striking correspondence between the two approaches for a quantitative or a binary trait: if the same positive semi-definite matrix is used as the centered similarity matrix in GDBR and as the kernel matrix in KMR, the F-test statistic in GDBR and the score test statistic in KMR are equal (up to some ignorable constants). The result is based on the connections of both methods to linear or logistic (random-effects) regression models.
F-test; Genome-wide association study; GWAS; multi-marker analysis; score test; SNP; SSU test
Compared to model-based approaches, nonparametric methods for quantitative trait loci mapping are more robust to deviations in distributional assumptions. In this study, we modify a nonparametric regression method and the "contrast function"- based regression method to analyze total cholesterol level in the younger cohort (the offspring generation) of the Genetic Analysis Workshop 13 simulated data set.
We obtained significant evidence of linkage near four of the six non-sex-specific genes in at least 30% of the replicates.
The proposed nonparametric method seems to be a powerful robust alternative to distribution-based methods.
A major goal of genetic association studies concerned with single nucleotide polymorphisms (SNPs) is the detection of SNPs exhibiting an impact on the risk of developing a disease. Typically, this problem is approached by testing each of the SNPs individually. This, however, can lead to an inaccurate measurement of the influence of the SNPs on the disease risk, in particular, if SNPs only show an effect when interacting with other SNPs, as the multivariate structure of the data is ignored. In this article, we propose a testing procedure based on logic regression that takes this structure into account and therefore enables a more appropriate quantification of importance and ranking of the SNPs than marginal testing. Since even SNP interactions often exhibit only a moderate effect on the disease risk, it can be helpful to also consider sets of SNPs (e.g. SNPs belonging to the same gene or pathway) to borrow strength across these SNP sets and to identify those genes or pathways comprising SNPs that are most consistently associated with the response. We show how the proposed procedure can be adapted for testing SNP sets, and how it can be applied to blocks of SNPs in linkage disequilibrium (LD) to overcome problems caused by LD.
Feature selection; GENICA; Importance measure; logicFS; Logic regression
Motivation: Association tests based on next-generation sequencing data are often under-powered due to the presence of rare variants and large amount of neutral or protective variants. A successful strategy is to aggregate genetic information within meaningful single-nucleotide polymorphism (SNP) sets, e.g. genes or pathways, and test association on SNP sets. Many existing methods for group-wise tests require specific assumptions about the direction of individual SNP effects and/or perform poorly in the presence of interactions.
Results: We propose a joint association test strategy based on two key components: a nonlinear supervised dimension reduction approach for effective SNP information aggregation and a novel kernel specially designed for qualitative genotype data. The new test demonstrates superior performance in identifying causal genes over existing methods across a large variety of disease models simulated from sequence data of real genes. In general, the proposed method provides an association test strategy that can (i) detect both rare and common causal variants, (ii) deal with both additive and interaction effect, (iii) handle both quantitative traits and disease dichotomies and (iv) incorporate non-genetic covariates. In addition, the new kernel can potentially boost the power of the entire family of kernel-based methods for genetic data analysis.
Availability: The method is implemented in MATLAB. Source code is available upon request.
Recently we proposed a novel two-step approach to test for pathway effects in disease progression. The goal of this approach is to study the joint effect of multiple single-nucleotide polymorphisms that belong to certain genes. By using random effects, our approach acknowledges the correlations within and between genes when testing for pathway effects. Gene-gene and gene-environment interactions can be included in the model. The method can be implemented with standard software, and the distribution of the test statistics under the null hypothesis can be approximated by using standard chi-square distributions. Hence no extensive permutations are needed for computations of the p-value. In this paper we adapt and apply the method to family data, and we study its performance for sequence data from Genetic Analysis Workshop 17. For the set of unrelated subjects, the performance of the new test was disappointing. We found a power of 6% for the binary outcome and of 18% for the quantitative trait Q1. For family data the new approach appears to perform well, especially for the quantitative outcome. We found a power of 39% for the binary outcome and a power of 89% for the quantitative trait Q1.
Although prospective logistic regression is the standard method of analysis for case-control data, it has been recently noted that in genetic epidemiologic studies one can use the “retrospective” likelihood to gain major power by incorporating various population genetics model assumptions such as Hardy-Weinberg-Equilibrium (HWE), gene-gene and gene-environment independence. In this article, we review these modern methods and contrast them with the more classical approaches through two types of applications (i) association tests for typed and untyped single nucleotide polymorphisms (SNPs) and (ii) estimation of haplotype effects and haplotype-environment interactions in the presence of haplotype-phase ambiguity. We provide novel insights to existing methods by construction of various score-tests and pseudo-likelihoods. In addition, we describe a novel two-stage method for analysis of untyped SNPs that can use any flexible external algorithm for genotype imputation followed by a powerful association test based on the retrospective likelihood. We illustrate applications of the methods using simulated and real data.
Case-control studies; Empirical-Bayes; Genetic epidemiology; Haplotypes; Model averaging; Model robustness; Model selection; Retrospective studies; Shrinkage