Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2118060

Formats

Article sections

Authors

Related links

Genet Epidemiol. Author manuscript; available in PMC 2007 December 6.

Published in final edited form as:

Genet Epidemiol. 2007 January; 31(1): 18–30.

doi: 10.1002/gepi.20189PMCID: PMC2118060

NIHMSID: NIHMS33502

See other articles in PMC that cite the published article.

Genotype-based likelihood ratio tests (LRT) of association that examine maternal and parent-of-origin effects have been previously developed in the framework of log-linear and conditional logistic regression models. In the situation where parental genotypes are missing, the expectation maximization (EM) algorithm has been incorporated in the log-linear approach to allow incomplete triads to contribute to the likelihood ratio test. We present an extension to this model which we call the Combined_LRT that incorporates additional information from the genotypes of unaffected siblings to improve assignment of incompletely typed families to mating type categories, thereby improving inference of missing parental data. Using simulations involving a realistic array of family structures, we demonstrate the validity of the Combined_LRT under the null hypothesis of no association and provide power comparisons under varying levels of missing data and using sibling genotype data. We demonstrate the improved power of the Combined_LRT compared with the family-based association test (FBAT), another widely used association test. Lastly, we apply the Combined_LRT to a candidate gene analysis in Autism families, some of which have missing parental genotypes. We conclude that the proposed log-linear model will be an important tool for future candidate gene studies, for many complex diseases where unaffected siblings can often be ascertained and where epigenetic factors such as imprinting may play a role in disease etiology.

Increased attention has been paid in recent years to the importance of epigenetic alterations in the etiology of many disorders including Prader-Willi syndrome, Angelman syndrome, and cancer. Over sixty known genes are currently described in the imprinted genes and parent-of-origin (*POO*) effects database (http://www.otago.ac.nz/IGC). For other complex diseases such as diabetes, hereditary paragangliomas, schizophrenia, intra-uterine growth retardation, autism, and neural tube defects, a role for parent-of-origin genes has been strongly hypothesized, although specific genes have not been identified conclusively as yet [Temple et al. 1995; van Schothorst et al. 1996; Abel 2004; Samaco et al. 2005; Chatkupt et al. 1992].

Current statistical approaches for detecting *POO* effects are limited. A handful of methods exist that aim to detect linkage to genes subject to *POO* effects [Paterson et al. 1999; Strauch et al. 2000; Triepels et al. 2001; Shete and Amos 2002; Shete et al. 2003]. For candidate gene studies of dichotomous traits, log-linear and conditional logistic regression models have been used to design tests for *POO* effects in family triads consisting of a mother, father and affected offspring [Weinberg et al. 1998; Weinberg 1999b; Cordell et al. 2004]. The likelihood approach employed by these tests can be thought of as closely related to the transmission disequilibrium test (TDT) [Spielman et al 1993], but with the flexibility to model *POO* effects, maternal genotype effects, and gene-gene or gene-environment interactions. For a di-allelic locus, these likelihood-based tests resist bias due to population stratification and allow valid estimation of relative risks. The parameter in the model that corresponds to a parent-of-origin (*POO*) effect is the ratio of the relative risk associated with inheriting a single variant copy from the mother (with 0 copies as referent) to that associated with inheriting a single variant copy from the father. When parental data are missing, the expectation-maximization (EM) algorithm can be used to allow incompletely genotyped triads to contribute information to the likelihood-ratio test (LRT) [Weinberg 1999a].

In practice, additional family members are often routinely sampled. Information can be extracted from the genotypes of unaffected siblings to improve inference of missing parental genotype data. This approach has been illustrated for full siblings under the scenario where the child's risk is modeled directly, but risks related to maternal genotype effects or parent-of-origin effects have not been considered [Schaid and Li 1997]. Further, the formerly proposed approach requires the assumption of Hardy-Weinberg equilibrium (HWE).

In this paper, we present an LRT for examining child genotype risk, maternal genotype effects and *POO* effects, which takes advantage of unaffected sibling genotypes when genotype data are missing for one or both parents, but does not require the assumption of HWE. We follow the log-linear modeling approach described by Weinberg et al [Weinberg et al. 1998]. Using computer simulations, we examine statistical power for the LRT under various scenarios and various nuclear family structures. The main goal of this paper is to examine the gain in power for an LRT that incorporates genotypes from unaffected siblings when one or both parents are missing genotypes and when maternal genotype effects, and *POO* are modeled. First we review the log-linear modeling approach. We then describe a likelihood approach for inclusion of incomplete families with unaffected siblings using the EM algorithm. We compare LRT power calculations with previously published power results for a general class of family-based association tests (FBAT) [Lange and Laird 2002b]. Finally, we demonstrate the utility of the LRT in an application to data from a candidate gene study in a set of families with Autism, some of which having missing parental genotypes and unaffected siblings.

Weinberg et al. [Weinberg et al.1998] described a likelihood-based method for analysis of triad data, which applies to various scenarios where the child's risk for disease is dependent on the child's genotype, the maternal genotype, or the parental origin of transmission. Under this approach, case-parent triads that are genotyped can be categorized based on the number of copies of a particular allele carried by the mother, father, and child (denoted “M”, “F”, and “C”, respectively). For a di-allelic marker, the triads consisting of M, F and C can be categorized according to 15 possible triad genotype outcomes, i.e. following a 15-cell multinomial distribution. We assume that in the population at large that the transmission of genotypes from parent to offspring follow the laws of Mendelian inheritance and that there is symmetry of genotypes in mothers and fathers, (e.g. the frequency of (M=2, F=1) is the same as the frequency of (M=1, F=2)). With parental genotype symmetry, there are 6 distinct parental genotype pairs, referred to as “mating types” [Schaid and Sommer 1993]. We also assume that the child inheriting one copy of the variant allele experiences a proportional increase in risk compared to one inheriting no copies, where that proportional increase is the same across the possible parental genotypes.

Table I shows the categories for the 15-cell multinomial for triads that corresponds to the hypothetical frequencies of a log-linear model for the inherited genotype by the affected child. Mating-type strata parameters are specified by *μ _{i}*, i=1,....6. Under Scenario A where risk is directly related to the number of variants inherited by the offspring,

Frequency table with expected categories of case-parent triads under models for scenarios A (child risk model), B (maternal risk model), and C (*POO* risk model).

Using similar notation to Weinberg (1998), the log of the expected cell counts, E_{(nM,F,C)}, based on the multinomial distribution, can be expressed for scenario A as

$${\mathrm{lnE}}_{\left({n}_{\mathrm{M},\mathrm{F},\mathrm{C}}\right)}={\mu}_{i}+{\beta}_{1}{I}_{\mathrm{C}=1}+{\beta}_{2}{I}_{\mathrm{C}=2}+\mathrm{ln}\left(2\right){I}_{\mathrm{M}=\mathrm{F}=\mathrm{C}=1}$$

The term ln(2)*I*_{M=F=C=1} is an offset used to double the frequency of families in mating type (1,1,1) in which all individuals are heterozygous. The six mating-type stratum parameters are represented by *μ _{i}*. The model can be fit via Poisson regression software, for example SAS, provided ln(2)

$${\mathrm{lnE}}_{\left({n}_{\mathrm{M},\mathrm{F},\mathrm{C}}\right)}={\mu}_{i}+{\alpha}_{1}{I}_{\mathrm{M}=1}+{\alpha}_{2}{I}_{\mathrm{M}=2}+\mathrm{ln}\left(2\right){I}_{\mathrm{M}=\mathrm{F}=\mathrm{C}=1}$$

The relative risk for a mother with one or two copies of the disease allele (*S _{1}* or

$${\mathrm{lnE}}_{\left({n}_{\mathrm{M},\mathrm{F},\mathrm{C}}\right)}={\mu}_{i}+{\beta}_{1}{I}_{\mathrm{C}=1}+{\beta}_{2}{I}_{\mathrm{C}=2}+{\alpha}_{1}{I}_{\mathrm{M}=1}+{\alpha}_{2}{I}_{\mathrm{M}=2}+{\epsilon}_{M}{\mathrm{I}}_{\left(\mathrm{M}\text{-derived copy}\right)}{\mathrm{I}}_{\mathrm{C}=1}+\mathrm{ln}\left(2\right){I}_{\mathrm{M}=\mathrm{F}=\mathrm{C}=1}$$

The relative increase in the risk for a single copy of the variant inherited from the mother rather than the father, namely *I _{M}* can be estimated by exponentiating the estimate for

Families with missing parental genotype data can contribute to the log-linear model via the EM algorithm [Dempster et al. 1977; Weinberg 1999a]. For families where one or both parents are missing genotypes, unaffected full siblings can be used to probabilistically infer the missing parental genotypes. Our approach differs from that of Schaid and Li (1997) since the assumption of HWE is not necessary. We assume low penetrance for the disease-locus genotypes, so that unaffected siblings essentially receive random genotypes from their parents. Hence, genotypes from unaffected siblings of the proband are used in the expectation step of the EM algorithm, but their phenotype information is not informative. In this way, the unaffected siblings serve as partial surrogates for a missing parent. The low penetrance assumption is a common one that might be expected for a single locus contributing to a complex disease. The same assumption has been made for the statistical tests implemented for example in the TRANSMIT [Clayton 1999] and the association in the presence of linkage test (APL) [Martin et al. 2003] programs.

To illustrate our approach, consider a dataset consisting of ‘quad’ families each with two parents, one affected *and* one unaffected sibling. Twenty distinct genotype configurations can be counted. The probabilities for each resulting genotype configuration (Mother=M, Father=F, Affected sib=C, Unaffected sib=U) can be expressed in terms of the parental mating-type and penetrance parameters. For example, for the second parental mating-type (M=1, F=2 or M=2, F=1), four possible distinct genotype configurations are possible for the two offspring: (2,2), (1,2) (2,1) and (1,1). The corresponding theoretical frequencies can be given as: *f _{2}* (1-

Nuclear families were simulated based on the multinomial distributions for genotype configurations for three possible nuclear family types. We refer to ‘triads’ as families having a mother, father, and an affected offspring, ‘quads’ as families where the affected proband has one genotyped unaffected full sibling, and ‘quints’ as families in which the affected proband has two genotyped unaffected full siblings. We limited our simulations to families having at most two additional siblings, although generalization to more than two siblings is straightforward. We allowed genotype data for one or both parents to be missing randomly with respect to genotype. Family data were simulated to include maternal risks and *POO* effects. Studies that included 200 families were simulated 2000 times for each. For comparison, we analyzed the data in several ways. For each scenario, we compared the results for the fully genotyped dataset (assuming no missing data), with that from a dataset comprised only of completely genotyped families (by removing families assigned to have missing data), and also to a dataset that included the incompletely genotyped families (by applying the EM algorithm). Type-I error rates and power were estimated as the proportion of the 2000 simulated replicates that rejected the null hypothesis at a nominal significance level of 0.05.

To evaluate Type-I error, a disease allele frequency of 0.30 was used, and 30% of families were allowed to have missing genotype data from one of the parents. We simulated family data under the null hypothesis (*R _{1}*=

A primary contribution of this paper is the development of a version of the LRT that can accommodate datasets with a combination of different family structures. Previous work by Weinberg [Weinberg et al. 1998] showed that the LRT method based on the model of Table I is a valid measure of association in a stratified population. We show that the “Combined_LRT” (i.e. that exploiting information from unaffected siblings) remains valid as a test of child genotype risk in a stratified population that is not in HWE. We simulated a stratified population as a mixture of two distinct subpopulations that are each assumed to only marry within the subpopulation and each to be in HWE. The first subpopulation, which accounted for 80% of the overall mixed population, had an allele frequency of 0.10 and a background risk (i.e. risk when the child and the mother carry no copies of the variant allele) of 0.01. The second subpopulation accounted for the remaining 20% of the mixed population and had an allele frequency of 0.30 and a background risk of 0.05. This population stratification produces a strong association between the allele and the disease, as an artifact, unless one studies families. We carried out simulations so that in each subpopulation *R _{1}*=

For the power studies our goal was to examine whether having additional sibling data improved the power of the LRT when some parental genotypes were missing. We carried out simulations under the scenario where families were missing one or both parents. A disease allele frequency of 0.30 was assumed. When the child's genotype risk was to be simulated as a function of the child's genotype, the data were simulated under dominant, additive and multiplicative models, where *R _{1}*=

We performed power calculations to compare the LRT with previously published power studies for a family-based association test, FBAT [Lange and Laird 2002a, 2002b; Morris R. unpublished thesis]. For families with incomplete parental genotypes, FBAT computes a sufficient statistic for the missing parental genotypes from the partially observed parental genotypes and offspring genotype classification. Lange and Laird (2002b, Table I) calculated power for a bipolar disorder study consisting of 213 triads with both parents genotyped, 175 quads with one parent and one sibling, 175 quints with one parent and two siblings, 220 quads with both parents missing, and 220 quints with both parents missing. They assumed an additive penetrance model where *f*_{0}=0.03, *f*_{1}=0.02, and *f*_{2}=0.01. We demonstrate LRT power for the model where risk depends on the child's genotype, using the same additive model. We focus our comparisons on the results from Table I of their paper [Lange and Laird 200b] when phenotypes from additional siblings were not used, since this assumption is also made for the LRT. We also compare the LRT and FBAT under a multiplicative model where *R _{1}*=1.8 and 200 families are simulated. We allow 20% of families to be completely genotyped, 40% to have one parent missing, and the remaining 40% to have two parents missing. When one parent is missing, we allow 20% of those families to have up to two additional unaffected siblings. Likewise, when both parents are missing, we allow 20% of those families to have up to two unaffected siblings.

The LRT provides relative risk parameter estimates, which can be used to evaluate the strength of an observed association. Calculation of an accompanying Wald confidence interval makes use of the fact that the maximum likelihood estimates (MLEs) are asymptotically normal. However, when the sample size is small or parameter estimates lie on the interval boundary, alternative methods may be necessary for computing more reliable confidence intervals. Moreover, when families with missing data are included in the analysis, the relative risk estimates are based on the EM algorithm. However, the Wald confidence intervals should not be computed using the naïve standard errors based on the pseudo-data, i.e. under the assumption that the expected counts represent the true data, as if there were no missing data. These confidence intervals would be too narrow and would lead to coverage percentages smaller than the nominal. In order to calculate appropriate confidence intervals for risk estimates we apply a bootstrap resampling approach to validly estimate the standard error. Starting with an original simulated sample of 200 families, we select a new sample of 200 families (with replacement) 1000 times. For each new sample, we determine a relative risk parameter estimate ($\widehat{\beta}$) and then calculate the variance over all 1000 randomly selected samples such that $\widehat{\mathrm{V}}\mathrm{ar}\left(\widehat{\beta}\right)=\Sigma {\scriptstyle \frac{{({\widehat{\beta}}_{i}-\stackrel{\u2012}{\beta})}^{2}}{\mathrm{n}-1}}$, where $\stackrel{\u2012}{\beta}$ represents the mean relative risk parameter estimate and n=1000. We have written a SAS macro that calculates a bootstrap confidence interval for relative risks generated from the LRT, to be made publicly available.

We applied the Combined_LRT to candidate gene data from a large autism study. We analyzed a single nucleotide polymorphism (SNP) in the *reelin* (RELN) gene, a gene that is involved in neurodevelopmental processes, notably in the proper formation of brain structures. Several studies have identified genetic associations with RELN markers in family-based studies of autism [Persico et al 2001; Zhang et al 2002; Skaar et al. 2005]. Here we analyzed a SNP (rs2075043) in exon 44 of the RELN gene which has been previously shown to be associated with risk using the pedigree disequilibrium test (PDT) in this dataset, but has not been tested for *POO* or maternal genotype effects [Skaar et al. 2005]. For this example, we analyzed a dataset consisting of 347 autism families, of which 289 were completely genotyped and 58 families (17%) had genotypes for only one parent. For the families with one missing parent, 11 had one unaffected sibling and 4 had two unaffected siblings. Inclusion criteria for this study are discussed in detail in Skaar et al (2005). We compared the Combined_LRT with the original LRT (EM_LRT), which can be used to analyze triads with missing parental data but does not take advantage of the contributions from unaffected siblings.

Table II shows the results of simulations performed under the null hypothesis of either no linkage or no association, when 30% of families were missing genotypes for one parent. The LRTs for triads, quads, and quints all show estimates of Type I error rates consistent with the nominal 0.05 level for a child genotype risk model (*R _{1}*=

Figure 1 shows results of power simulations for the scenario when the child's genotype risk is modeled. For these simulations, all families had genotypes for only one parent. A 15−19% gain in power was observed when one or two unaffected siblings were included. The gain in power was seen for additive, dominant and multiplicative models. Having one additional unaffected sibling when parental data were missing seemed to result in the largest gain in power. The second unaffected sibling did not appear to contribute much more information in comparison. Previously, Weinberg [1999a] noted that when more than 50% of 100 families studied were missing one parent, the estimated Type I error rate was >0.05. In our own simulations of 200 families we found the Type I error rates to be consistent with the 0.05 level even with 100% of families missing data for one parent. When both parent's genotypes were missing in all 200 families, the Type I error rates were 0.056 and 0.053 for quads and quints respectively. These Type 1 error rates are statistically compatible with the nominal rate of 0.05 though slightly inflated. Because of concern about possibly elevated error rates, no power results are reported for the child risk model for the scenario when both parents are missing in all 200 families. Clearly, one cannot study either imprinting or maternal effects when both parents are missing.

Simulation-based power calculations when effects of the *child genotype on risk* are simulated and modeled. Results for 2000 samples of 200 families with 1 missing parent simulated under additive, dominant and multiplicative models where *R*_{1}=1.8 and the **...**

Results for power simulations where maternal genotype effects were modeled and the percent of families with missing parental genotypes varied are displayed in Figure 2. Power increased when the EM algorithm was used to incorporate incompletely typed families. Having one unaffected sibling led to additional gain, close to that of the fully genotyped dataset even when 40% of families were missing one parent. With 50% of families missing one parent, having unaffected siblings improved the power from 67% for the observed full dataset to 77% using incomplete triads with the EM algorithm, 82% with the addition of one unaffected sibling, and 85% with the addition of two unaffected siblings. In simulations where all 200 families were missing the father's genotypes, having additional unaffected siblings resulted in a substantial improvement in power. The power increased from 52% for observed full triads to 78% for quads and 83% for quints (results not shown).

Plot of estimated power based on 2000 simulations where maternal effects are modeled, shown as a function of the percentage of families where the father is missing genotype data. The model was simulated so that *R*_{1}=1, *R*_{2}=1, *S*_{1}=1.8, *S*_{2}=2.5, and *Im*=1. A **...**

Figure 3 shows results for power simulations for a *POO* effects model when varying percentages of families were missing genotypes for one parent. The power of the LRT using only completely genotyped families declined steadily as the amount of missing data increased. As expected, power was regained with an increasing availability of genotypes from unaffected siblings. When two unaffected siblings were included, the power was close to that of the fully genotyped dataset, even with 50% of families missing data. When 100% of families were missing data for one parent, the power for triads was 22% and increased to 37% with the EM, 43% for quads, and 45% for quints respectively (data not shown).

Plot of estimated power based on simulations where a *POO* effect is modeled and as a function of the percentage of families where **1 parent is missing** genotype data. The model was simulated so that *R*_{1}=1.8, *R*_{2}=2.5, *Im*=2.5 and *S*_{1}=*S*_{2}=1. A 1-df LRT was used **...**

We allowed varying percentages of families to have missing genotypes for both parents and modeled *POO* effects. The results of those simulations are shown in Figure 4. When both parents are missing, the gain in power from including additional siblings is less dramatic. Including one unaffected sibling increased the power from 29% to 34% when 50% of families were missing genotypes for both parents. Adding another additional sibling increased the power to 41%.

Plot of estimated power based on 2000 simulations where a *POO* effect is modeled and as a function of the percentage of families where **both parents are missing** genotype data. The model was simulated so that *R*_{1}=1.8, *R*_{2}=2.5, *Im*=2.5 and *S*_{1}=*S*_{2}=1. A 1-df LRT **...**

Table III presents power comparisons for a 2 df LRT for child genotype risks with previously published results for FBAT from Table I of Lange and Laird (2002b) for families with only a single affected offspring. These results show that for this example, the power of the LRT is greater than FBAT, particularly for the lower allele frequencies. We also found that, under a multiplicative model where *R _{1}*=1.8 and with an allele frequency equal to 0.30, FBAT also had less power (75%) than the LRT (88%) to detect a difference in child genotype risk.

Table IV shows the results from the autism dataset from the Combined_LRT using unaffected siblings, and the EM_LRT presented in Weinberg (1999a). The additional information gained from inclusion of unaffected siblings from 14 families results in modestly more significant results for the child genotype risk and the *POO* model. Neither test shows significant maternal effects.

The LRT we present in this paper expands upon the previous work of Weinberg [1999a], by allowing unaffected siblings to contribute to the EM algorithm when parental data are unavailable. Overall, our power calculations indicate that using genotype data from unaffected siblings improves the power of the LRT to detect risk due to the child's genotype, maternal genotype, and *POO* effects under varying amounts of missing parental genotypes. When the child's genotype risk is of interest, having one unaffected sibling appears to result in the biggest gain in power in the models that we examined. The addition of a second unaffected sibling does not appear to add a lot of power of the LRT in these scenarios. However, if one is studying a rare disease like autism or neural tube defects and power is severely constrained by the small number of affected individuals available, it may well be worth it to genotype additional unaffected siblings. We see a similar improvement in power for the test of maternal genotype risk, particularly when all families are missing one parent. As expected, incorporating genotypes of unaffected siblings also results in increased power to detect *POO* effects when parental data are missing for one or both parents. For the simulated examples, the power to detect *POO* effects is almost (see Figure 3 and page 17) completely recovered to that of the fully genotyped dataset when two unaffected siblings are used and all families are missing one parent.

As has been previously noted for the LRT, multiple affected siblings cannot be considered independent if a locus under investigation is in linkage disequilibrium with the true disease locus, or other loci contribute to susceptibility to the disease phenotype [Schaid and Sommer1993; Weinberg et al. 1998; Martin et al 2003]. This is because failing to account for the correlation between affected relatives complicates the interpretation of significant results by making it difficult to distinguish evidence for linkage alone versus both linkage and association. Because of this, the LRT is not a valid test of association in the presence of linkage when multiple affected siblings are used. An alternative method for testing maternal genotype and *POO* effects in family data is the conditioning on parental genotypes (CEPG) test which fits the same model as the LRT, except in a conditional logistic regression framework [Cordell 2004]. This test, as opposed to the LRT, can incorporate genotypes from multiple affected individuals in a family using the Huber-White “information sandwich” estimation [Huber 1967; Whitehead et al 1982]. However, the LRT has improved power compared with the *CEPG* to detect *POO* and mother-child genotype effects since it incorporates *POO* ambiguous trios (1,1,1) via the EM algorithm [Cordell 2004]. In addition, although families with missing genotypes for one parent can be analyzed with the *CEPG*, this test does not implement the more efficient EM-based missing data likelihood method that is incorporated into the LRT.

Assumed genetic models can be fit to the log-linear model by restricting the maternal or offspring relative risk (RR), for example, by considering the homozygous and heterozygous RR's to be equal under a dominant model. Specification of an assumed model can improve the power of the LRT if the specified model reflects the true mode of inheritance. However, if the underlying mode of inheritance differs from the specified model, this can result in a substantial loss of power. Starr et al. (2005) showed that loss of power can be controlled by assuming a log-additive model.

We provided a comparison of the LRT and another family-based association test, FBAT using data from a bipolar study. Our power calculations under an additive model suggest that LRT has greater power than FBAT, regardless of allele frequencies. Since FBAT is limited to models of child genotype risk, additional power comparisons for maternal effects and *POO* effects could not be performed.

The application of the Combined_LRT to the autism candidate gene RELN demonstrates that the addition of unaffected siblings gives more significant results for the child genotype risk and *POO* model compared with an alternative EM_LRT that does not incorporate unaffected siblings. The significance of this SNP for the child risk model was seen previously using the PDT and Geno-PDT (global p-value = 0.028 and 0.033) [Skaar et al 2005], but the LRTs in this paper provide the first significant evidence of *POO* effects.

In summary, we have shown that the power of the LRT can be improved if unaffected siblings are used in the EM algorithm when parental genotypes are missing. This conclusion is documented through simulation-based power comparisons for the parent-of-origin and maternal effects models when varying proportions of parental genotypes are missing, and is supported by improved significance in an application to real data from a study of autism. Power calculations for the maternal genotype risk model had been previously performed for triads with missing parents (Weinberg 1999a) but these did not include unaffected siblings nor has the power to detect *POO* effects been studied. We demonstrated the improved power of the LRT for offspring genotype risk over a similar family-based test, FBAT, under an additive and multiplicative model. We also provided a bootstrap approach for calculation of appropriate confidence intervals for datasets with missing data, which has been implemented along with the LRT in a SAS macro that is available online. These results show that when nuclear families are studied and parental genotypes are missing, that bias-resistant likelihood methods can be used to take full advantage of genotype data from unaffected siblings.

The authors would like to thank Min Shi, and Norman Kaplan for their contributions to this paper. Clarice R. Weinberg's research was supported in part by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences. We are grateful for the generous support by grants from the National Institutes of Health (ES11375, NS39818, ES11961) and a Ruth L. Kirschstein National Research Service Awards predoctoral fellowship (NS046249). We also wish to thank the patients with autism and their family members who participated in this study. The autism example was generated through funding from NIH grants NS26630 and NS36768.

Consider a sample of *N* family quads. Let *N _{ijkl}* indicate the observed number of families in which the mother (M), father (F), affected child (A), and unaffected full sibling (U) carry

1. From iteration *r*, we have estimates of cell probabilities ${\widehat{p}}_{ijkl}^{\left(r\right)}$ for all *i,j,k,l*. (For *r*=0 these are initial starting values.)

2. Expectation step at iteration *r*+*1*: Compute the expected cell counts based on estimated cell probabilities from previous iteration.

$${Y}_{ijkl}^{(r+1)}={N}_{ijkl}+{M}_{?jkl}\frac{{\widehat{p}}_{ijkl}^{\left(r\right)}}{\sum _{possible\phantom{\rule{thickmathspace}{0ex}}i}{\widehat{p}}_{ijkl}^{\left(r\right)}}+{F}_{i?kl}\frac{{\widehat{p}}_{ijkl}^{\left(r\right)}}{\sum _{possible\phantom{\rule{thickmathspace}{0ex}}j}{\widehat{p}}_{ijkl}^{\left(r\right)}}+{D}_{??kl}\frac{{\widehat{p}}_{ijkl}^{\left(r\right)}}{\sum _{possible\phantom{\rule{thickmathspace}{0ex}}ij}{\widehat{p}}_{ijkl}^{\left(r\right)}}$$

3. Maximization step at iteration *r*+*1*: Maximize the full-data likelihood,

$$\mathrm{log}\left(L\right)=\sum _{possible\phantom{\rule{thickmathspace}{0ex}}ijkl}{Y}_{ijkl}^{(r+1)}\mathrm{log}\phantom{\rule{thickmathspace}{0ex}}{p}_{ijkl},$$

to obtain new estimates of cell probabilities ${\widehat{p}}_{ijkl}^{(r+1)}={\scriptstyle \frac{{Y}_{ijkl}^{r+1}}{N}}$.

This process is repeated until the parameter estimates converge. The resulting cell probabilities *p _{ijkl}* are then used in the log likelihood calculation for the observed data as follows:

$$\begin{array}{cc}\hfill & \mathrm{log}\left(L\right)=\sum _{possibleij\phantom{\rule{thickmathspace}{0ex}}kl}{N}_{ijkl}\phantom{\rule{thickmathspace}{0ex}}\mathrm{log}\left({p}_{ijkl}\right)+\sum _{possiblejkl}{M}_{?\phantom{\rule{thickmathspace}{0ex}}jkl}\phantom{\rule{thickmathspace}{0ex}}\mathrm{log}\left(\sum _{possiblei}{p}_{ijkl}\right)+\sum _{possibleikl}{F}_{i?kl}^{}\phantom{\rule{thickmathspace}{0ex}}\mathrm{log}\left(\sum _{possiblej}{p}_{ijkl}\right)\hfill \\ \hfill & +\sum _{possiblekl}{D}_{??kl}\phantom{\rule{thickmathspace}{0ex}}\mathrm{log}\left(\sum _{possibleij}{p}_{ijkl}\right)\hfill \end{array}$$

Here we show how to compute step 2 in the EM algorithm (Appendix A) for some examples of quad families. Refer to Table V for calculations of cell probabilities. Here we expand the calculations presented in Table I for triads to quads. For simplicity, we show only a subset of the complete table and do not model imprinting or maternal genotype effects.

$$\begin{array}{cc}\hfill {Y}_{1110}^{(r+1)}& ={N}_{1110}+{M}_{?110}\frac{{\widehat{p}}_{1110}^{\left(r\right)}}{\sum _{possible\phantom{\rule{thickmathspace}{0ex}}i}{\widehat{p}}_{i110}^{\left(r\right)}}+{F}_{1?10}\frac{{\widehat{p}}_{1110}^{\left(r\right)}}{\sum _{possible\phantom{\rule{thickmathspace}{0ex}}j}{\widehat{p}}_{1j10}^{\left(r\right)}}+{D}_{??10}\frac{{\widehat{p}}_{1110}^{\left(r\right)}}{\sum _{possible\phantom{\rule{thickmathspace}{0ex}}ij}{\widehat{p}}_{ij10}^{\left(r\right)}}\hfill \\ \hfill & ={N}_{1110}+({M}_{?110}+{F}_{1?10})\frac{{\widehat{\mu}}_{4}}{{\widehat{\mu}}_{4}+{\widehat{\mu}}_{5}}+{D}_{??10}\frac{{\widehat{\mu}}_{4}}{{\widehat{\mu}}_{4}+2{\widehat{\mu}}_{5}}\hfill \\ \hfill {Y}_{1111}^{(r+1)}& ={N}_{1111}+({M}_{?111}+{F}_{1?11})\frac{{\widehat{\mu}}_{4}}{{\widehat{\mu}}_{4}+\frac{1}{2}{\widehat{\mu}}_{5}+\frac{1}{2}{\widehat{\mu}}_{2}}+{D}_{??11}\frac{{\widehat{\mu}}_{4}}{{\widehat{\mu}}_{4}+{\widehat{\mu}}_{5}+{\widehat{\mu}}_{2}}\hfill \end{array}$$

ELECTRONIC SOFTWARE INFORMATION

The URL for the LRT SAS macros, including a bootstrap LRT macro, along with a user manual is: http://wwwchg.duhs.duke.edu/software/index.html (for LRT).

- Abel KM. Fetal origins of schizophrenia: testable hypotheses of genetic and environmental influences. Br J Psychiatry. 2004;184:383–385. [PubMed]
- Chatkupt S, Lucek PR, Koenigsberger MR, Johnson WG. Parental sex effect in spina bifida: a role for genomic imprinting? Am J Med Genet. 1992;44:508–512. [PubMed]
- Clayton D. A Generalization of the Transmission/Disequilibrium Test for Uncertain-Haplotype Transmission. Am J Hum Genet. 1999;65:1170–1177. [PubMed]
- Cordell HJ, Barratt BJ, Clayton DG. Case/pseudocontrol analysis in genetic association studies: A unified framework for detection of genotype and haplotype associations, gene-gene and gene-environment interactions, and parent-of-origin effects. Genet Epidemiol. 2004;26:167–185. [PubMed]
- Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc B. 1977;B39:1–38.
- Lange C, Laird NM. On a general class of conditional tests for family-based association studies in genetics: The asymptotic distribution, the conditional power, and optimality considerations. Genetic Epidemiology. 2002a;23:165–180. [PubMed]
- Lange C, Laird NM. Power calculations for a general class of family-based association tests: Dichotomous traits. American Journal of Human Genetics. 2002b;71:575–584. [PubMed]
- Lange C, DeMeo D, Silverman EK, Weiss ST, Laird NM. PBAT: Tools for Family-Based Association Studies. Am J. Hum. Genet. 2004;74:367. [PubMed]
- Martin ER, Bass MP, Gilbert JR, Pericak_Vance MA, Hauser ER. Genotype-based association test for general pedigrees : the geno-type-PDT. Genetic Epidemiol. 2003;25:203–213. [PubMed]
- Morris RW. Likelihood ratio tests for association with multiple disease susceptibility alleles, genotyping errors, or missing parental data. 2003. Unpublished PhD thesis, North Carolina State University.
- Paterson AD, DeLisi L, Faraone SV, Gejman PV, Goossens D, Hovatta I, Kaufmann CA, Klauck SM, Kunugi H, Levinson DF, Mors O, Norton N, Smalley SL. Sixth World Congress of Psychiatric Genetics X Chromosome Workshop. Am J Med Genet. 1999;88:279–286. [PubMed]
- Persico AM, D'Agruma L, Maiorano N, Totaro A, Militerni R, Bravaccio C, et al. Reelin gene alleles and haplotypes as a factor predisposing to autistic disorder. Mol Psychiatry. 2001;6:150–159. [PubMed]
- Samaco RC, Hogart A, LaSalle JM. Epigenetic overlap in autism-spectrum neurodevelopmental disorders: MECP2 deficiency causes reduced expression of UBE3A and GABRB3. Hum Mol Genet. 2005;14:483–492. [PMC free article] [PubMed]
- Schaid DJ, Li H. Genotype relative-risks and association tests for nuclear families with missing parental data. Genet Epidemiol. 1997;14:1113–1118. [PubMed]
- Schaid DJ, Sommer SS. Genotype relative risks: methods for design and analysis of candidate- gene association studies. Am J Hum Genet. 1993;53:1114–1126. [PubMed]
- Shete S, Amos CI. Testing for genetic linkage in families by a variance-components approach in the presence of genomic imprinting. Am J Hum Genet. 2002;70:751–757. [PubMed]
- Shete S, Zhou X, Amos CI. Genomic imprinting and linkage test for quantitative-trait Loci in extended pedigrees. Am J Hum Genet. 2003;73:933–938. [PubMed]
- Skaar DA, Shao Y, Haines JL, Stenger JE, Jaworski J, Martin ER, DeLong GR, Moore JH, McCauley JL, Sutcliffe JS, Ashley-Koch AE, Cuccaro ML, Folstein SE, Gilbert JR, Pericak-Vance MA. Analysis of the RELN gene as a genetic risk factor for autism. Mol Psychiatry. 2005;10:563–571. [PubMed]
- Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet. 1993;52(3):506–16. [PubMed]
- Starr JR, Hsu L, Schwartz SM. Assessing maternal genetic associations. A comparison of the log-linear approach to case-parent triad data and a case-control approach. Epidemiology. 2005;16(3):294–303. [PubMed]
- Strauch K, Fimmers R, Kurz T, Deichmann KA, Wienker TF, Baur MP. Parametric and nonparametric multipoint linkage analysis with imprinting and two-locus-trait models: application to mite sensitization. Am J Hum Genet. 2000;66:1945–1957. [PubMed]
- Temple IK, James RS, Crolla JA, Sitch FL, Jacobs PA, Howell WM, Betts P, Baum JD, Shield JP. An imprinted gene(s) for diabetes? Nat Genet. 1995;9:110–112. [PubMed]
- Triepels RH, Hanson BJ, van den Heuvel LP, Sundell L, Marusich MF, Smeitink JA, Capaldi RA. Human complex I defects can be resolved by monoclonal antibody analysis into distinct subunit assembly patterns. J Biol Chem. 2001;276:8892–8897. [PubMed]
- van Schothorst EM, Jansen JC, Bardoel AF, van der Mey AG, James MJ, Sobol H, Weissenbach J, Van Ommen GJ, Cornelisse CJ, Devilee P. Confinement of PGL, an imprinted gene causing hereditary paragangliomas, to a 2-cM interval on 11q22-q23 and exclusion of DRD2 and NCAM as candidate genes. Eur J Hum Genet. 1996;4:267–273. [PubMed]
- Weinberg C. Allowing for missing parents in genetic studies of case-parent triads. Am J Hum Genet. 1999a;64:1186–1193. [PubMed]
- Weinberg CR. Methods for detection of parent-of-origin effects in genetic studies of case-parents triads. Am J Hum Genet. 1999b;65:229–235. [PubMed]
- Weinberg CR, Wilcox AJ, Lie RT. A log-linear approach to case-parent-triad data: Assessing effects of disease genes that act either directly or through maternal effects and that may be subject to parental imprinting. Am J Hum Genet. 1998;62:969–978. [PubMed]
- Zhang H, Liu X, Zhang C, Mundo E, Macciardi F, Grayson DR, et al. Reelin gene alleles and susceptibility to autism spectrum disorders. Mol Psychiatry. 2002;7:1012–1017. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |