The proposed likelihood model is sufficiently flexible for general purpose usage. It accommodates nuclear families of any size, unrelated singletons and combinations of the two. As special cases it reduces to the conditional on parental genotypes model [30
] in nuclear families with complete data, and to retrospective likelihood analysis of unrelated subjects [34
]. It allows for missing data and uncertain haplotype phase using standard likelihood methods. It has similar operating characteristics to TRANSMIT [12
], owing to the relation between their score functions given in the Appendix
, but does so within an ordinary likelihood framework.
The main innovations are separation of association parameters in the parental and conditional terms in the likelihood, and conditioning on the inheritance vector. The former has been implicitly done by previous authors [5
] who fit a saturated model to the parental mating type. Here, a distinction is made between genotype frequencies and association effects in the parents, which allows more parsimonious models to be fit, including haplotype coding under the HWE assumption. When the mating type model is saturated, all families are the same size and all sibships have the same trait vector, then the association effects cannot be identified in the parental terms and the present model is equivalent to previous work. A related approach is the decomposition of total assocation into between- and within-family components [20
]. In a prospective design, the within-family association is a valid estimate in the presence of population stratification [24
]. However in the retrospective design used here, the frequency model does not factor out of the likelihood when the data are complete, and so must be correctly specified. This approach is therefore never valid under population stratification, unlike the present model.
Conditioning on the inheritance vector was previously proposed in the context of conditioning on sufficient statistics for missing genotype data [9
]. Those authors found a noticable loss of power, owing to a large number of uninformative families, and preferred to use a cluster variance estimate in a test without conditioning. Here, the conditioning has been set into a missing data likelihood framework, in which all families are informative for association. Comparison with the APL program, which estimates the haplotype-sharing probabilities without conditioning on linkage, indicate that the cost in power from the additional conditioning is very small. Further simulations (data not shown) compared power with and without conditioning, when no linkage was assumed, and again found the loss in power to be small.
For combining family samples with unrelated subjects, the proposed approach is similar to that of Epstein et al. [35
]. The main difference is that here, HWE is assumed in the singletons, which allows a simple adjustment for population heterogeneity, and reduction to a standard retrospective analysis when there are only singletons. The rationale is that HWE is a common working assumption for unrelated subjects, being somewhat ensured by standard quality control measures, and heterogeneity is quite likely between samples ascertained under different criteria. Adjustment for heterogeneous genotype frequencies is done through an indicator covariate, and this approach can also be used to combine samples of the same type coming from multiple populations.
In the simulations reported here, the UNPHASED implementation performed as well as the best available methods over a range of situations. In families with a single affected child, the operating characteristics were very similar to those of TRANSMIT. Indeed, when the parental association parameters are set to zero, the results of UNPHASED and TRANSMIT are nearly perfectly correlated, for reasons suggested in the Appendix
. When the parameters are freely estimated, the correlation is weaker but the type-1 error and power are still similar. Estimation of the parental parameters is desirable for testing hypotheses in which some effects are nonzero, for estimating effect sizes and allowing for prior linkage in sibships. The additional estimation incurs a small cost in power.
The power of UNPHASED is generally higher than FBAT, at a cost of a small increase in type-1 error when there are missing genotypes and population stratification. The APL program had similar power to UNPHASED, but higher type-1 error under population stratification. PCPH and MITDT had similar power to UNPHASED in trios but cannot currently handle larger sibships.
PCPH is the locally optimal test among those making no assumption on the missing data, but when there is a strong effect and a high proportion of missing data, it loses power in comparison with UNPHASED. Its main advantage is that it is always robust to population stratification, but the compromise approach adopted here appears to incur only small increases in type-1 error. Of course, situations may be constructed in which the increase is much more severe, but in practice careful ascertainment and quality control measures such as HWE testing should ensure that undetected population stratification has only a minor effect on the proposed approach.
The methods may be adapted to categorical, time-to-onset and other traits, through appropriate specification of the trait distribution. The generalized linear model is a convenient representation for many distributions [37
], although the retrospective formulation may lead to identifiability issues needing special treatment, as in the proposed model for normal traits. In general the multinomial regression approximation gives a valid test of β
= 0, although it may not be powerful against strong effects and is less appropriate for testing other hypotheses.
In general pedigrees, a simple approach is to extract nuclear families and treat them as independent sampling units. This may ignore correlations between nuclear families in the presence of linkage, although the effect is likely to be small. It is possible to condition on the entire linkage information in a pedigree [42
], although the impact of this conditioning may be more severe than in the case of sib pairs considered here. Likelihood models for association in general pedigrees remain an interesting subject for further work. A special case is a sample of sibships without parents. If more than one trait value is present, then the methods described here can be applied, but the analysis may be time-consuming. An alternative is to incorporate conditioning on the sufficient statistic for the missing parental genotypes [42
] into a likelihood for the sibships only. This approach has not been pursued here, but offers potential for extending this work to situations in which it is currently not computationally efficient.