Haplotypes, the combinations of alleles at multiple loci along individual homologous chromosomes, define the functional units of a gene through which the underlying protein product is made (Clark 2004
). Association studies based on haplotypes, which can capture interloci interactions as well as “indirect association” due to linkage disequilibrium (LD) with unobserved causal variants, can be a powerful approach to the discovery and characterization of the genetic basis of complex diseases (Schaid 2004
). Thus, in recent years, there has been tremendous interest in developing methods for haplotype-based regression analysis of genetic epidemiologic data. A technical problem has been that traditional epidemiologic studies only collect locus-specific genotype data, which does not provide the “phase information”, that is, which alleles appear at multiple loci along the individual chromosomes. Statistically, the lack of phase information can be viewed as a special missing data problem.
For logistic regression analysis of unmatched case-control studies, two classes of methods have evolved. The “prospective” methods (Schaid, Rowland, Tines, Jacobson, and Polalnd 2002
; Zhao, Li, and Khalid 2003
; Lake et al. 2003
) ignore the underlying retrospective nature of the case-control design. These methods are considered robust in the sense that they depend very weakly on the underlying assumptions of Hardy-Weinberg equilibrium (HWE) and gene-environment (G
) independence, although the assumptions cannot be totally avoided because of the phase ambiguity problem. In contrast, “retrospective” methods (Epstein and Satten 2003
; Stram et al. 2003
; Satten and Epstein 2004
; Spinka, Carroll, and Chatterjee 2005
; Lin and Zeng 2006
), which properly account for case-control sampling, can fully exploit the assumptions of HWE and G
independence to gain major efficiency over the prospective methods. It is often debatable which of the two types of methods is more suitable for a particular study. Prospective estimates of haplotype-effects and haplotype-environment interactions involving relatively rare haplotypes often tend to be very imprecise. Retrospective methods can produce much more precise estimates of those parameters, but concern often remains about the potential for bias because of the possible violation of the underlying assumptions, a potential we see in our simulations.
The potential for bias in retrospective methods can be reduced by flexible modeling approaches that relax the underlying assumptions. Alternative population genetic models that can relax the HWE assumptions have been used for retrospective haplotype analysis of case-control data (Satten and Epstein 2004
; Lin and Zeng 2006
). It has been also shown that the assumption of G
independence can be relaxed to a large extent by assuming haplotypes are independent of E
given unphased genotypes, but allowing the conditional distribution of E
given the unphased genotypes to remain completely unrestricted (Lin and Zeng 2006
). These solutions, which can alleviate the concern about bias, are not completely satisfactory. First, models for relaxing the HWE assumption can capture only certain types of departures from the underlying constraints for the diplotype distribution and may not be able to model phenomena such as excess heterozygosity. Second, even if a completely nonparametric model for the G
distribution is available, we may still be able to gain efficiency in analysis of case-control data by exploiting the fact that HWE and G-E independence often do hold, approximately if not exactly. In the existing methods, if one uses a very general model for the distribution of G
, then the concern about bias will be minimized, but inevitably efficiency will be lost.
Our main objective is to develop methods for haplotype-based analysis of case-control studies, which can gain efficiency by exploiting model assumptions of HWE and G-E independence for the underlying population and yet are resistant to bias when those model assumptions are violated. The basic idea involves shrinkage of a “model-free” estimator that is robust to HWE and G-E independence toward a “model-based” estimator that directly exploits those assumptions. The amount of “shrinkage” is sample size and data adaptive so that in large samples the method has no bias whether the assumptions of HWE and G-E independence hold, and yet the method can gain efficiency by shrinking the analysis toward HWE and G-E independence, but only to the extent the data validates the assumptions.
There are several novel aspects of our proposal. First, in Section 2.2, we propose a novel retrospective likelihood approach to haplotype analysis of case-control data that is robust to the nature of the gene-environment distribution in the underlying population. Second, in Section 3, we develop an empirical Bayes (EB)-type shrinkage estimation approach and a rigorous asymptotic theory for it. The key difficulty is that the problem is semiparametric, in that there are infinite-dimensional nuisance parameters associated with the joint distribution of the gene and the environment. Our method overcomes this difficulty by focusing only on the parameters of interest. In Section 4, we develop a penalized likelihood approach and asymptotic theory for it. The penalized likelihood involves shrinkage, not of a parameter or set of parameters to zero as is usually done, but to a model-based estimator, and also overcomes the problem of infinite-dimensional nuisance parameters. Effectively, we try to shrink the difference of the model-free and model-based estimators toward zero. In Sections 5 and 6, we use simulation studies and two real data examples to illustrate that unlike the existing haplotype-based regression methods, whose utility depends crucially on specific model assumptions, the proposed shrinkage methods adapt themselves to a wide range of situations.
Finally, although our scientific focus of this article is haplotype-based case-control studies, this article makes a far more general contribution. Using modern shrinkage and penalization techniques to combine assumption-laden and assumption-free methods in semiparametric problems with infinite dimensional nuisance parameters is an idea that transcends genetic association studies. We hope that our article will lead to further research in this more general area.