|Home | About | Journals | Submit | Contact Us | Français|
Genotype-based likelihood ratio tests (LRT) of association that examine maternal and parent-of-origin effects have been previously developed in the framework of log-linear and conditional logistic regression models. In the situation where parental genotypes are missing, the expectation maximization (EM) algorithm has been incorporated in the log-linear approach to allow incomplete triads to contribute to the likelihood ratio test. We present an extension to this model which we call the Combined_LRT that incorporates additional information from the genotypes of unaffected siblings to improve assignment of incompletely typed families to mating type categories, thereby improving inference of missing parental data. Using simulations involving a realistic array of family structures, we demonstrate the validity of the Combined_LRT under the null hypothesis of no association and provide power comparisons under varying levels of missing data and using sibling genotype data. We demonstrate the improved power of the Combined_LRT compared with the family-based association test (FBAT), another widely used association test. Lastly, we apply the Combined_LRT to a candidate gene analysis in Autism families, some of which have missing parental genotypes. We conclude that the proposed log-linear model will be an important tool for future candidate gene studies, for many complex diseases where unaffected siblings can often be ascertained and where epigenetic factors such as imprinting may play a role in disease etiology.
Increased attention has been paid in recent years to the importance of epigenetic alterations in the etiology of many disorders including Prader-Willi syndrome, Angelman syndrome, and cancer. Over sixty known genes are currently described in the imprinted genes and parent-of-origin (POO) effects database (http://www.otago.ac.nz/IGC). For other complex diseases such as diabetes, hereditary paragangliomas, schizophrenia, intra-uterine growth retardation, autism, and neural tube defects, a role for parent-of-origin genes has been strongly hypothesized, although specific genes have not been identified conclusively as yet [Temple et al. 1995; van Schothorst et al. 1996; Abel 2004; Samaco et al. 2005; Chatkupt et al. 1992].
Current statistical approaches for detecting POO effects are limited. A handful of methods exist that aim to detect linkage to genes subject to POO effects [Paterson et al. 1999; Strauch et al. 2000; Triepels et al. 2001; Shete and Amos 2002; Shete et al. 2003]. For candidate gene studies of dichotomous traits, log-linear and conditional logistic regression models have been used to design tests for POO effects in family triads consisting of a mother, father and affected offspring [Weinberg et al. 1998; Weinberg 1999b; Cordell et al. 2004]. The likelihood approach employed by these tests can be thought of as closely related to the transmission disequilibrium test (TDT) [Spielman et al 1993], but with the flexibility to model POO effects, maternal genotype effects, and gene-gene or gene-environment interactions. For a di-allelic locus, these likelihood-based tests resist bias due to population stratification and allow valid estimation of relative risks. The parameter in the model that corresponds to a parent-of-origin (POO) effect is the ratio of the relative risk associated with inheriting a single variant copy from the mother (with 0 copies as referent) to that associated with inheriting a single variant copy from the father. When parental data are missing, the expectation-maximization (EM) algorithm can be used to allow incompletely genotyped triads to contribute information to the likelihood-ratio test (LRT) [Weinberg 1999a].
In practice, additional family members are often routinely sampled. Information can be extracted from the genotypes of unaffected siblings to improve inference of missing parental genotype data. This approach has been illustrated for full siblings under the scenario where the child's risk is modeled directly, but risks related to maternal genotype effects or parent-of-origin effects have not been considered [Schaid and Li 1997]. Further, the formerly proposed approach requires the assumption of Hardy-Weinberg equilibrium (HWE).
In this paper, we present an LRT for examining child genotype risk, maternal genotype effects and POO effects, which takes advantage of unaffected sibling genotypes when genotype data are missing for one or both parents, but does not require the assumption of HWE. We follow the log-linear modeling approach described by Weinberg et al [Weinberg et al. 1998]. Using computer simulations, we examine statistical power for the LRT under various scenarios and various nuclear family structures. The main goal of this paper is to examine the gain in power for an LRT that incorporates genotypes from unaffected siblings when one or both parents are missing genotypes and when maternal genotype effects, and POO are modeled. First we review the log-linear modeling approach. We then describe a likelihood approach for inclusion of incomplete families with unaffected siblings using the EM algorithm. We compare LRT power calculations with previously published power results for a general class of family-based association tests (FBAT) [Lange and Laird 2002b]. Finally, we demonstrate the utility of the LRT in an application to data from a candidate gene study in a set of families with Autism, some of which having missing parental genotypes and unaffected siblings.
Weinberg et al. [Weinberg et al.1998] described a likelihood-based method for analysis of triad data, which applies to various scenarios where the child's risk for disease is dependent on the child's genotype, the maternal genotype, or the parental origin of transmission. Under this approach, case-parent triads that are genotyped can be categorized based on the number of copies of a particular allele carried by the mother, father, and child (denoted “M”, “F”, and “C”, respectively). For a di-allelic marker, the triads consisting of M, F and C can be categorized according to 15 possible triad genotype outcomes, i.e. following a 15-cell multinomial distribution. We assume that in the population at large that the transmission of genotypes from parent to offspring follow the laws of Mendelian inheritance and that there is symmetry of genotypes in mothers and fathers, (e.g. the frequency of (M=2, F=1) is the same as the frequency of (M=1, F=2)). With parental genotype symmetry, there are 6 distinct parental genotype pairs, referred to as “mating types” [Schaid and Sommer 1993]. We also assume that the child inheriting one copy of the variant allele experiences a proportional increase in risk compared to one inheriting no copies, where that proportional increase is the same across the possible parental genotypes.
Table I shows the categories for the 15-cell multinomial for triads that corresponds to the hypothetical frequencies of a log-linear model for the inherited genotype by the affected child. Mating-type strata parameters are specified by μi, i=1,....6. Under Scenario A where risk is directly related to the number of variants inherited by the offspring, R1 and R2 represent the relative risks for a child with one or two copies of the variant allele, respectively, compared with a child with no copies. Under Scenario B where the risk depends on the number of copies carried by the mother, S1 and S2 correspond to the risk associated with the mother having one or two copies of the variant, respectively, compared with no copies. Under the null hypothesis of no association or no linkage, the multinomial distribution is specified solely by the mating-type parameters μi, under the assumption of Mendelian inheritance for scenarios A and B . When POO effects are modeled, the relative risk for a child who inherits one copy of the variant allele from the father is represented by Rp (Scenario C). If a single copy is inherited from the mother then the child's relative risk can be represented by ImRp [Weinberg 1999b]. If there are no POO effects then Im=1, while Im≠1 suggests a POO effect. To fit the model with POO effects, we partition the 111 cell into two cells according to the (unobserved) parent-of-origin, and use the EM algorithm to estimate the imprinting parameter by maximizing the likelihood iteratively, accounting for the fact that we cannot assign the parent of origin for 111 triads.
Using similar notation to Weinberg (1998), the log of the expected cell counts, E(nM,F,C), based on the multinomial distribution, can be expressed for scenario A as
The term ln(2)IM=F=C=1 is an offset used to double the frequency of families in mating type (1,1,1) in which all individuals are heterozygous. The six mating-type stratum parameters are represented by μi. The model can be fit via Poisson regression software, for example SAS, provided ln(2)IM=F=C=1 is declared as an offset. The relative risk for a child with one or two copies of the disease allele (R1 or R2) can be estimated by exponentiating the estimate for β1 or β2 respectively. For the maternal genotype risk model (scenario B or Table I), the log of the expected cell counts can be written as
The relative risk for a mother with one or two copies of the disease allele (S1 or S2) can be estimated by exponentiating the estimate for α1 or α2, respectively. If POO effects are evaluated, (as in scenario C of Table I) then the model can be written as
The relative increase in the risk for a single copy of the variant inherited from the mother rather than the father, namely IM can be estimated by exponentiating the estimate for εM. The LRTs for testing the child's genotype risk or for the maternal genotype risk are 2-df tests. The LRT for testing for a POO effect is a 1-df test, because it is involves a single parameter.
Families with missing parental genotype data can contribute to the log-linear model via the EM algorithm [Dempster et al. 1977; Weinberg 1999a]. For families where one or both parents are missing genotypes, unaffected full siblings can be used to probabilistically infer the missing parental genotypes. Our approach differs from that of Schaid and Li (1997) since the assumption of HWE is not necessary. We assume low penetrance for the disease-locus genotypes, so that unaffected siblings essentially receive random genotypes from their parents. Hence, genotypes from unaffected siblings of the proband are used in the expectation step of the EM algorithm, but their phenotype information is not informative. In this way, the unaffected siblings serve as partial surrogates for a missing parent. The low penetrance assumption is a common one that might be expected for a single locus contributing to a complex disease. The same assumption has been made for the statistical tests implemented for example in the TRANSMIT [Clayton 1999] and the association in the presence of linkage test (APL) [Martin et al. 2003] programs.
To illustrate our approach, consider a dataset consisting of ‘quad’ families each with two parents, one affected and one unaffected sibling. Twenty distinct genotype configurations can be counted. The probabilities for each resulting genotype configuration (Mother=M, Father=F, Affected sib=C, Unaffected sib=U) can be expressed in terms of the parental mating-type and penetrance parameters. For example, for the second parental mating-type (M=1, F=2 or M=2, F=1), four possible distinct genotype configurations are possible for the two offspring: (2,2), (1,2) (2,1) and (1,1). The corresponding theoretical frequencies can be given as: f2 (1-f2)μ2, f1(1-f2) μ2, f2(1-f1)μ2, and f1(1-f1) μ2 where fi is the probability of being affected given that one carries i copies of the variant allele and μj is a mating-type parameter for the jth stratum. If the penetrance is assumed to be low, then for each genotype 1-fi will be close to 1. Hence, the model can be reduced to the two relative risk parameters R1 and R2 by dividing f1 and f2 by f0, respectively. Thus the relative frequencies can be rewritten as: R2μ2, R1μ2, R2μ2, R1μ2. In families with full parental data, the unaffected siblings do not contribute to tests of genetic effects. However, in families with missing parental data, the genotypes of unaffected siblings of the proband can be used in the EM algorithm to improve the parental genotype inference. The steps for the EM algorithm that incorporates genotypes from unaffected siblings along with an example are outlined in Appendices A and B, respectively.
Nuclear families were simulated based on the multinomial distributions for genotype configurations for three possible nuclear family types. We refer to ‘triads’ as families having a mother, father, and an affected offspring, ‘quads’ as families where the affected proband has one genotyped unaffected full sibling, and ‘quints’ as families in which the affected proband has two genotyped unaffected full siblings. We limited our simulations to families having at most two additional siblings, although generalization to more than two siblings is straightforward. We allowed genotype data for one or both parents to be missing randomly with respect to genotype. Family data were simulated to include maternal risks and POO effects. Studies that included 200 families were simulated 2000 times for each. For comparison, we analyzed the data in several ways. For each scenario, we compared the results for the fully genotyped dataset (assuming no missing data), with that from a dataset comprised only of completely genotyped families (by removing families assigned to have missing data), and also to a dataset that included the incompletely genotyped families (by applying the EM algorithm). Type-I error rates and power were estimated as the proportion of the 2000 simulated replicates that rejected the null hypothesis at a nominal significance level of 0.05.
To evaluate Type-I error, a disease allele frequency of 0.30 was used, and 30% of families were allowed to have missing genotype data from one of the parents. We simulated family data under the null hypothesis (R1=R2=S1=S2=Im=1), so that all genotype risks were equal to 1. We evaluated the Type-I error under various scenarios: (A) when the model included only parameters for the child's risk, (B) when only maternal genotype effects were modeled, and (C) when POO was modeled against a background model with parameters for both maternal and offspring genetic effects. In scenario C, we compared the Type-I error rates when POO effects were correctly and incorrectly modeled. For this scenario, we simulated maternal genotype risks (S1=1 and S2=2) and POO genotype risk (Im =1), and carried out a test for POO effects. We compared the test of POO effects when we correctly included maternal risk parameters in the model to the test in which we failed to include these parameters. Previously, Weinberg [1999b] showed that a POO model that does not account for maternal genotype effects is invalid when maternal genotype effects are indeed present. We also tested the performance of the “Combined_LRT”, in scenarios involving a mixture of family types (i.e. a mixture of triads, quads, and quints in one dataset).
A primary contribution of this paper is the development of a version of the LRT that can accommodate datasets with a combination of different family structures. Previous work by Weinberg [Weinberg et al. 1998] showed that the LRT method based on the model of Table I is a valid measure of association in a stratified population. We show that the “Combined_LRT” (i.e. that exploiting information from unaffected siblings) remains valid as a test of child genotype risk in a stratified population that is not in HWE. We simulated a stratified population as a mixture of two distinct subpopulations that are each assumed to only marry within the subpopulation and each to be in HWE. The first subpopulation, which accounted for 80% of the overall mixed population, had an allele frequency of 0.10 and a background risk (i.e. risk when the child and the mother carry no copies of the variant allele) of 0.01. The second subpopulation accounted for the remaining 20% of the mixed population and had an allele frequency of 0.30 and a background risk of 0.05. This population stratification produces a strong association between the allele and the disease, as an artifact, unless one studies families. We carried out simulations so that in each subpopulation R1=R2.
For the power studies our goal was to examine whether having additional sibling data improved the power of the LRT when some parental genotypes were missing. We carried out simulations under the scenario where families were missing one or both parents. A disease allele frequency of 0.30 was assumed. When the child's genotype risk was to be simulated as a function of the child's genotype, the data were simulated under dominant, additive and multiplicative models, where R1=1.8 and all families were missing one parent. Specifically, for the additive model, we set R1=1.8 and R2=2.5, and for the multiplicative model we set R1=1.8 and R2=3.2. When the maternal genotype was important, we simulated the data so that S1=1.8 and S2=2.5. For the POO model, the data was simulated so that R1=1.8, R2=2.5, S1=S2=1, Im=2.5. We varied the amount of missing parental data by allowing 20%, 40%, 50% and 100% of families to have missing genotypes for one parent (randomly selected to be the mother or the father). We allowed for up to 50% of families to have both parents with missing genotypes.
We performed power calculations to compare the LRT with previously published power studies for a family-based association test, FBAT [Lange and Laird 2002a, 2002b; Morris R. unpublished thesis]. For families with incomplete parental genotypes, FBAT computes a sufficient statistic for the missing parental genotypes from the partially observed parental genotypes and offspring genotype classification. Lange and Laird (2002b, Table I) calculated power for a bipolar disorder study consisting of 213 triads with both parents genotyped, 175 quads with one parent and one sibling, 175 quints with one parent and two siblings, 220 quads with both parents missing, and 220 quints with both parents missing. They assumed an additive penetrance model where f0=0.03, f1=0.02, and f2=0.01. We demonstrate LRT power for the model where risk depends on the child's genotype, using the same additive model. We focus our comparisons on the results from Table I of their paper [Lange and Laird 200b] when phenotypes from additional siblings were not used, since this assumption is also made for the LRT. We also compare the LRT and FBAT under a multiplicative model where R1=1.8 and 200 families are simulated. We allow 20% of families to be completely genotyped, 40% to have one parent missing, and the remaining 40% to have two parents missing. When one parent is missing, we allow 20% of those families to have up to two additional unaffected siblings. Likewise, when both parents are missing, we allow 20% of those families to have up to two unaffected siblings.
The LRT provides relative risk parameter estimates, which can be used to evaluate the strength of an observed association. Calculation of an accompanying Wald confidence interval makes use of the fact that the maximum likelihood estimates (MLEs) are asymptotically normal. However, when the sample size is small or parameter estimates lie on the interval boundary, alternative methods may be necessary for computing more reliable confidence intervals. Moreover, when families with missing data are included in the analysis, the relative risk estimates are based on the EM algorithm. However, the Wald confidence intervals should not be computed using the naïve standard errors based on the pseudo-data, i.e. under the assumption that the expected counts represent the true data, as if there were no missing data. These confidence intervals would be too narrow and would lead to coverage percentages smaller than the nominal. In order to calculate appropriate confidence intervals for risk estimates we apply a bootstrap resampling approach to validly estimate the standard error. Starting with an original simulated sample of 200 families, we select a new sample of 200 families (with replacement) 1000 times. For each new sample, we determine a relative risk parameter estimate () and then calculate the variance over all 1000 randomly selected samples such that , where represents the mean relative risk parameter estimate and n=1000. We have written a SAS macro that calculates a bootstrap confidence interval for relative risks generated from the LRT, to be made publicly available.
We applied the Combined_LRT to candidate gene data from a large autism study. We analyzed a single nucleotide polymorphism (SNP) in the reelin (RELN) gene, a gene that is involved in neurodevelopmental processes, notably in the proper formation of brain structures. Several studies have identified genetic associations with RELN markers in family-based studies of autism [Persico et al 2001; Zhang et al 2002; Skaar et al. 2005]. Here we analyzed a SNP (rs2075043) in exon 44 of the RELN gene which has been previously shown to be associated with risk using the pedigree disequilibrium test (PDT) in this dataset, but has not been tested for POO or maternal genotype effects [Skaar et al. 2005]. For this example, we analyzed a dataset consisting of 347 autism families, of which 289 were completely genotyped and 58 families (17%) had genotypes for only one parent. For the families with one missing parent, 11 had one unaffected sibling and 4 had two unaffected siblings. Inclusion criteria for this study are discussed in detail in Skaar et al (2005). We compared the Combined_LRT with the original LRT (EM_LRT), which can be used to analyze triads with missing parental data but does not take advantage of the contributions from unaffected siblings.
Table II shows the results of simulations performed under the null hypothesis of either no linkage or no association, when 30% of families were missing genotypes for one parent. The LRTs for triads, quads, and quints all show estimates of Type I error rates consistent with the nominal 0.05 level for a child genotype risk model (R1=R2=1), for a maternal genotype risk model (S1=S2 =1), and for a POO effects model (R1=R2=S1=S2=Im=1) (Table II). When maternal genotype effects are simulated so that S1=1 and S2=2, but parameters for the maternal genotype effects are not included in the model, the test for POO effects shows Type I error rates considerably larger than the 0.05 level (shown in row 4 of Table II), demonstrating that the test is biased when the wrong model is fitted. The Combined_LRT that models a mixture of family types is also valid under the null hypothesis for all scenarios except in the presence of maternal genotype effects when the risk due to maternal genotype effects is incorrectly modeled (shown in column 7 of Table II). Type I error rates for the “Combined_LRT” also remain consistent with the nominal 0.05 level, even in a stratified population (p=0.050) (data not shown).
Figure 1 shows results of power simulations for the scenario when the child's genotype risk is modeled. For these simulations, all families had genotypes for only one parent. A 15−19% gain in power was observed when one or two unaffected siblings were included. The gain in power was seen for additive, dominant and multiplicative models. Having one additional unaffected sibling when parental data were missing seemed to result in the largest gain in power. The second unaffected sibling did not appear to contribute much more information in comparison. Previously, Weinberg [1999a] noted that when more than 50% of 100 families studied were missing one parent, the estimated Type I error rate was >0.05. In our own simulations of 200 families we found the Type I error rates to be consistent with the 0.05 level even with 100% of families missing data for one parent. When both parent's genotypes were missing in all 200 families, the Type I error rates were 0.056 and 0.053 for quads and quints respectively. These Type 1 error rates are statistically compatible with the nominal rate of 0.05 though slightly inflated. Because of concern about possibly elevated error rates, no power results are reported for the child risk model for the scenario when both parents are missing in all 200 families. Clearly, one cannot study either imprinting or maternal effects when both parents are missing.
Results for power simulations where maternal genotype effects were modeled and the percent of families with missing parental genotypes varied are displayed in Figure 2. Power increased when the EM algorithm was used to incorporate incompletely typed families. Having one unaffected sibling led to additional gain, close to that of the fully genotyped dataset even when 40% of families were missing one parent. With 50% of families missing one parent, having unaffected siblings improved the power from 67% for the observed full dataset to 77% using incomplete triads with the EM algorithm, 82% with the addition of one unaffected sibling, and 85% with the addition of two unaffected siblings. In simulations where all 200 families were missing the father's genotypes, having additional unaffected siblings resulted in a substantial improvement in power. The power increased from 52% for observed full triads to 78% for quads and 83% for quints (results not shown).
Figure 3 shows results for power simulations for a POO effects model when varying percentages of families were missing genotypes for one parent. The power of the LRT using only completely genotyped families declined steadily as the amount of missing data increased. As expected, power was regained with an increasing availability of genotypes from unaffected siblings. When two unaffected siblings were included, the power was close to that of the fully genotyped dataset, even with 50% of families missing data. When 100% of families were missing data for one parent, the power for triads was 22% and increased to 37% with the EM, 43% for quads, and 45% for quints respectively (data not shown).
We allowed varying percentages of families to have missing genotypes for both parents and modeled POO effects. The results of those simulations are shown in Figure 4. When both parents are missing, the gain in power from including additional siblings is less dramatic. Including one unaffected sibling increased the power from 29% to 34% when 50% of families were missing genotypes for both parents. Adding another additional sibling increased the power to 41%.
Table III presents power comparisons for a 2 df LRT for child genotype risks with previously published results for FBAT from Table I of Lange and Laird (2002b) for families with only a single affected offspring. These results show that for this example, the power of the LRT is greater than FBAT, particularly for the lower allele frequencies. We also found that, under a multiplicative model where R1=1.8 and with an allele frequency equal to 0.30, FBAT also had less power (75%) than the LRT (88%) to detect a difference in child genotype risk.
Table IV shows the results from the autism dataset from the Combined_LRT using unaffected siblings, and the EM_LRT presented in Weinberg (1999a). The additional information gained from inclusion of unaffected siblings from 14 families results in modestly more significant results for the child genotype risk and the POO model. Neither test shows significant maternal effects.
The LRT we present in this paper expands upon the previous work of Weinberg [1999a], by allowing unaffected siblings to contribute to the EM algorithm when parental data are unavailable. Overall, our power calculations indicate that using genotype data from unaffected siblings improves the power of the LRT to detect risk due to the child's genotype, maternal genotype, and POO effects under varying amounts of missing parental genotypes. When the child's genotype risk is of interest, having one unaffected sibling appears to result in the biggest gain in power in the models that we examined. The addition of a second unaffected sibling does not appear to add a lot of power of the LRT in these scenarios. However, if one is studying a rare disease like autism or neural tube defects and power is severely constrained by the small number of affected individuals available, it may well be worth it to genotype additional unaffected siblings. We see a similar improvement in power for the test of maternal genotype risk, particularly when all families are missing one parent. As expected, incorporating genotypes of unaffected siblings also results in increased power to detect POO effects when parental data are missing for one or both parents. For the simulated examples, the power to detect POO effects is almost (see Figure 3 and page 17) completely recovered to that of the fully genotyped dataset when two unaffected siblings are used and all families are missing one parent.
As has been previously noted for the LRT, multiple affected siblings cannot be considered independent if a locus under investigation is in linkage disequilibrium with the true disease locus, or other loci contribute to susceptibility to the disease phenotype [Schaid and Sommer1993; Weinberg et al. 1998; Martin et al 2003]. This is because failing to account for the correlation between affected relatives complicates the interpretation of significant results by making it difficult to distinguish evidence for linkage alone versus both linkage and association. Because of this, the LRT is not a valid test of association in the presence of linkage when multiple affected siblings are used. An alternative method for testing maternal genotype and POO effects in family data is the conditioning on parental genotypes (CEPG) test which fits the same model as the LRT, except in a conditional logistic regression framework [Cordell 2004]. This test, as opposed to the LRT, can incorporate genotypes from multiple affected individuals in a family using the Huber-White “information sandwich” estimation [Huber 1967; Whitehead et al 1982]. However, the LRT has improved power compared with the CEPG to detect POO and mother-child genotype effects since it incorporates POO ambiguous trios (1,1,1) via the EM algorithm [Cordell 2004]. In addition, although families with missing genotypes for one parent can be analyzed with the CEPG, this test does not implement the more efficient EM-based missing data likelihood method that is incorporated into the LRT.
Assumed genetic models can be fit to the log-linear model by restricting the maternal or offspring relative risk (RR), for example, by considering the homozygous and heterozygous RR's to be equal under a dominant model. Specification of an assumed model can improve the power of the LRT if the specified model reflects the true mode of inheritance. However, if the underlying mode of inheritance differs from the specified model, this can result in a substantial loss of power. Starr et al. (2005) showed that loss of power can be controlled by assuming a log-additive model.
We provided a comparison of the LRT and another family-based association test, FBAT using data from a bipolar study. Our power calculations under an additive model suggest that LRT has greater power than FBAT, regardless of allele frequencies. Since FBAT is limited to models of child genotype risk, additional power comparisons for maternal effects and POO effects could not be performed.
The application of the Combined_LRT to the autism candidate gene RELN demonstrates that the addition of unaffected siblings gives more significant results for the child genotype risk and POO model compared with an alternative EM_LRT that does not incorporate unaffected siblings. The significance of this SNP for the child risk model was seen previously using the PDT and Geno-PDT (global p-value = 0.028 and 0.033) [Skaar et al 2005], but the LRTs in this paper provide the first significant evidence of POO effects.
In summary, we have shown that the power of the LRT can be improved if unaffected siblings are used in the EM algorithm when parental genotypes are missing. This conclusion is documented through simulation-based power comparisons for the parent-of-origin and maternal effects models when varying proportions of parental genotypes are missing, and is supported by improved significance in an application to real data from a study of autism. Power calculations for the maternal genotype risk model had been previously performed for triads with missing parents (Weinberg 1999a) but these did not include unaffected siblings nor has the power to detect POO effects been studied. We demonstrated the improved power of the LRT for offspring genotype risk over a similar family-based test, FBAT, under an additive and multiplicative model. We also provided a bootstrap approach for calculation of appropriate confidence intervals for datasets with missing data, which has been implemented along with the LRT in a SAS macro that is available online. These results show that when nuclear families are studied and parental genotypes are missing, that bias-resistant likelihood methods can be used to take full advantage of genotype data from unaffected siblings.
The authors would like to thank Min Shi, and Norman Kaplan for their contributions to this paper. Clarice R. Weinberg's research was supported in part by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences. We are grateful for the generous support by grants from the National Institutes of Health (ES11375, NS39818, ES11961) and a Ruth L. Kirschstein National Research Service Awards predoctoral fellowship (NS046249). We also wish to thank the patients with autism and their family members who participated in this study. The autism example was generated through funding from NIH grants NS26630 and NS36768.
Consider a sample of N family quads. Let Nijkl indicate the observed number of families in which the mother (M), father (F), affected child (A), and unaffected full sibling (U) carry i,j,k,l copies of the variant allele, respectively. Assume that a proportion of families are missing genotypes for the mother, father or for both parents. Let M?jkl denote the number of families where the mother genotype is unknown, but the father, affected child and unaffected sibling's genotypes are available. Let Fi?kl denote the number of families where only the father's genotype is unknown. Let D??kl denote the number of families where genotypes for both mother and father are unknown. Let represent an estimate of the expected count of cell (i,j,k,l) at iteration r. Let represents the fitted probability for cell (i,j,k,l) at iteration r. The steps for the EM algorithm are outlined below:
1. From iteration r, we have estimates of cell probabilities for all i,j,k,l. (For r=0 these are initial starting values.)
2. Expectation step at iteration r+1: Compute the expected cell counts based on estimated cell probabilities from previous iteration.
3. Maximization step at iteration r+1: Maximize the full-data likelihood,
to obtain new estimates of cell probabilities .
This process is repeated until the parameter estimates converge. The resulting cell probabilities pijkl are then used in the log likelihood calculation for the observed data as follows:
Here we show how to compute step 2 in the EM algorithm (Appendix A) for some examples of quad families. Refer to Table V for calculations of cell probabilities. Here we expand the calculations presented in Table I for triads to quads. For simplicity, we show only a subset of the complete table and do not model imprinting or maternal genotype effects.
ELECTRONIC SOFTWARE INFORMATION
The URL for the LRT SAS macros, including a bootstrap LRT macro, along with a user manual is: http://wwwchg.duhs.duke.edu/software/index.html (for LRT).