For haplotype-based studies, the underlying genetic covariate for a subject is defined by “diplotypes,” that is, the 2 haplotypes the individual carries in his/her pair of homologous chromosomes, where each haplotype is the combination of alleles at the loci of interest along an individual chromosome. Following the notation developed in
Spinka and others (2005), let the diplotype status for a subject be
Hdi = (
H1,
H2), where
H1 and
H2 denote the constituent haplotypes. We assume that there are
J possible haplotypes indexed by
hj for
j = 1, …,
J. The diplotypes are then indexed by

,
j1 = 1, …
j1,
j2 = 1, …,
j2. The diplotype data, however, is not directly observable. Instead, for each subject, the multilocus genotype data
G is observed, which contains information on the pair of alleles the individual carries at each individual locus but does not provide the phase information, that is which combination of alleles appears along each of the individual chromosomes. Thus, the same genotype data
G could be consistent with multiple diplotypes. We will denote

(
G) to be the set of all possible diplotypes that are consistent with the genotype data
G.
Given the diplotype data Hdi and a set of environmental covariate X, we assume that the risk of the disease is given by the logistic regression model
for some known function
m(·,
β1). Often one further imposes structural assumptions on the odds ratio parameters
β1 by modeling the effect of the diplotypes through constituent haplotypes according to a “dominant,” “additive,” or “recessive” mode of effect (
Wallenstein and others, 1998). For example, a logistic regression model which assumes an additive effect for each copy of a haplotype corresponds to
where
βX is the main effect of
X,
βhjk is the main effect of haplotype
hjk, k = 1, 2, and
βhjk X is the interaction effect of
X with haplotype
hjk,
k = 1, 2. Such modeling may be necessary due to identifiability considerations (
Epstein and Satten, 2003) and is desirable when the effects of the haplotypes themselves are of direct scientific interest.
Unlike
Spinka and others (2005), who assumed independence of
Hdi and
X, we assume a general polytomous logistic regression for the conditional distribution of
Hdi given
X:
where

is a chosen reference diplotype. Observe that model (
2.3) allows association between
Hdi and
X through the regression parameters
γ1j1j2. Let
γ0 and
γ1 denote the vectorized forms for the parameters
γ0 j1j2 and
γ1j1j2. Let
qhap(
hdi|
x,
γ0,
γ1) denote pr(
Hdi =
hdi|
X =
x) as defined by model (
2.3). We allow the marginal distribution of
X, denoted by
F(
x), to remain completely unspecified. If
Hdi were directly observable, then, in principle, no further assumptions are necessary, and one can estimate
γ0 and
γ1 together with the odds ratio parameters of the disease risk using the profile likelihood approach developed by
Chatterjee and Carroll (2005). In the presence of phase ambiguity, however, the diplotypes being not directly observable, further constraints on the parameters
γ0 and
γ1 are needed for the purpose of identifiability. In the following, we show how certain natural genetic models can be used to impose these constraints.
Given that genetic susceptibility may influence environmental exposures and not vice versa, for causal interpretation of parameters it is more natural to consider a model for the environmental exposures given the diplotypes. However, the odds ratios associated with the distributions [
X|
H] and [
H|
X] being the same, the parameters in
γ1 can be interpreted as measures of “diplotype effects” on the distribution of exposure. Thus, it is natural to specify the
γ1 parameters according to certain mode of effects of the underlying haplotypes. For example, assuming an additive effect for the haplotypes, one can write
γ1 j1j2 =
γ1,
j1 +
γ1,
j2, which allows the diplotype effects to be determined by a reduced set of “haplotype effect” parameters
γ1,
j; in this case,
γ1 would denote the vectorized form for the parameters
γ1, j. Similarly, other commonly used models, such as dominant or recessive models, could be used to impose natural constraints on the
γ1 parameters in model (
2.3). We also observe that the parametric model (
2.3), combined with the non-parametric distribution
F(
x), imposes a semiparametric model on the distribution of [
X|
H] with a density
This class of semiparametric models includes the parametric submodel where
X|
Hdi =
hdi follows a multivariate normal distribution with mean
μhdi and common variance–covariance matrix Σ. In this case, it is easy to see that

, which is a measure of the shift in the mean of the distribution of
X due to differences in the diplotypes.
The parameter
γ0 in model (
2.3) defines the population diplotype frequencies for a baseline value of the exposure
X. It is common to use population genetics models, such as HWE, to specify a relationship between diplotype and haplotype frequencies. However, observe that if the diplotypes can influence certain environmental exposures, then the frequencies of the diplotypes within exposure categories may not follow the HWE constraints although the underlying population, as a whole, may be in HWE. Thus, the population-level marginal haplotype-pair distribution is assumed to follow HWE and is characterized by the parameters
θ = (
θ2, …,
θJ) so that
where
h1 denotes the chosen reference haplotype and
θ1 = 0. Let
be the marginal frequency for the diplotype
hdi. Recall that in the proposed model,
γ0 is defined as an implicit function of
γ1,
θ, and
F(
x) through the relationship
Note that F is left unspecified, and hence the model propoised is semiparametric.