Our alternative hybrid approach starts with a random sample of affected individuals and a random sample of unaffected individuals. Cases and their parents are enrolled and genotyped but only mothers of controls and the controls themselves are genotyped. We are interested in testing for association between disease risk and the offspring and maternal genotypes at a di-allelic autosomal locus. We assume that the disease is rare and that Mendelian transmission probabilities hold for that locus in the underlying population, and hence among controls. Validly combining information from case and control families requires an assumption that any population structure is benign with respect to bias, an assumption that can be probed with the data at hand. Neither Hardy-Weinberg equilibrium nor random mating is required for validity.
Let p denote the frequency of the minor or ‘variant’ allele. Which allele is designated as the ‘variant’ has no effect on estimation or testing beyond the mathematical inversions. Let M, F and C represent the number of variant alleles (0,1,2) carried by the mother, father and child, respectively. D is an indicator variable for disease status, which is 1 for case families and 0 for control families.
Following Schaid and Sommer [1993], we define nine different mating types based on the number of variant copies carried by the mother and the father. These mating types along with their possible offspring genotypes lead to 15 possible (
M,F,C) categories, and, consequently, one can imagine two 15-cell multinomial distributions of offspring and parental genotypes, one for control triads and one for case triads. In hybrid designs, typically the full (
M,F,C) data are recorded for case families whereas only partial data are collected from control families; here, only (
M,C) data. Initially we assume that any genotyping called for by the design is complete, but missing-data methods can be employed if some genotypes are missing [
Weinberg, 1999a].
For control families, expected counts in the 15-cell multinomial can be modeled using Mendelian transmission probabilities and mating type parameters (μmf), which are proportional to the frequencies of mother-father pairs with M=m and F=f in the source population. The expected counts for control-mother dyads (the last 7 lines of ) arise from the 15-cell multinomial by summing counts across the genotypes of possible fathers. The distribution of control-mother dyads has two noteworthy features: First, the (M,C) cells (0,2) and (2,0) are not possible, leaving only seven cells with non-zero expected counts. Second, and less obviously, the following relationship is a consequence of Mendelian transmission alone: when M=1, the expected count for C=1 is the sum of the expected counts for C=0 and C=2. This constraint reduces the available degrees-of-freedom contributed by the seven control-dyad cells from six to five.
| Table IExpected counts of case-parent triads and control-mother dyads under mating asymmetry or mating symmetry. |
For case families, expected counts in the 15-cell multinomial involve not only mating-type parameters and Mendelian probabilities but also four genetic relative risk parameters. We denote these relative risk parameters as follows: R1 (R2) is the relative risk for offspring carrying 1 (respectively, 2) copies of the variant compared to offspring carrying none; S1 (S2) is the relative risk for offspring whose mother carries 1 (respectively, 2) copies of the variant allele compared to offspring whose mother carries none. Combining the 15 cells for case-parent triads with the seven cells for control-mother dyads yields a 22-cell multinomial for the proposed design ().
For the case-parents design and for the hybrid design that uses parents of controls, the multinomial expected cell counts are all products of parameters and can be fitted using log-linear Poisson regression. Because the expected counts for control-mother dyads involve sums of parameters, however, the expected counts in for the 22-cell multinomial are not themselves log-linear. A straightforward way to proceed with fitting either model is to regard the 22-cell multinomial as a version of the full 30-cell multinomial (15 cells each for case families and control families) that is missing genotype data for fathers of controls by design. Thus, one can use missing-data methods like the Expectation-Maximization (EM) algorithm [
Dempster et al., 1977] in conjunction with a log-linear model for the 30-cell multinomial. Use of the EM algorithm also allows inclusion of any case-parent triads or control-mother dyads with missing genotypes that may arise through genotyping failure or incomplete ascertainment. For valid analysis, one must be able to assume that these genotype data are missing at random conditional on disease status and the observed genotypes. When control fathers are missing only by design, this assumption is satisfied without doubt because all of them are missing.
Assuming a multiplicative model for risk and no bias from population structure, a log-linear model for the full 30-cell multinomial (corresponding to , column 5) would be:
I( ) is an indicator function which takes a value of 1 if the parenthetical condition is met and 0 otherwise.
β1 and
β2 denote the natural logarithms of the offspring genetic relative risks,
R1 and
R2, respectively; and
α1 and
α2 denote the natural logarithms of maternal genetic relative risks,
S1 and
S2, respectively. γ corresponds to the natural logarithm of the normalizing factor B. The offset, denoted Off
mfc is the constant multiplier (1, ½ or ¼) given in , column 5. In this model, the nine mating-type parameters,
μmf, are common to both cases and controls and can be interpreted as proportional to the mating type frequencies in the source population.
This model is fundamental to the analysis of data from our proposed design. Modifications of the model by omitting or including certain additional terms allow one to construct likelihood ratio tests (LRTs) of hypotheses about the genetic relative risk parameters or to test the assumptions about bias from population stratification or about mating symmetry. In addition, models addressing maternal-fetal incompatibility can be constructed by including two additional relative risk parameters [
Sinsheimer et al., 2003]. All tests of assumptions that we subsequently develop under the four risk parameters in model (1) can be developed and applied without difficulty when these two additional risk parameters are present.
Likelihood ratio tests in this missing data situation must be based on the observed-data likelihood, not the pseudo-complete-data likelihood. We have used the program LEM [
van den Oord and Vermunt, 2000] in our subsequent analyses. LEM was designed to fit log-linear models with missing data via the EM algorithm. Consequently, dealing with missing fathers among controls or with other patterns of missing data does not require special programming. The program allows incorporation of the Mendelian constraints and returns valid test statistics as well as valid estimates for risk parameters and their standard errors. Examples of the LEM scripts that we used are available at
http://www.niehs.nih.gov/research/atniehs/labs/bb/staff/weinberg/index.cfm#downloads.