Home | About | Journals | Submit | Contact Us | Français |

**|**Springer Open Choice**|**PMC2953624

Formats

Article sections

- Abstract
- Introduction
- Methodology
- Simulation study
- Optimal matching
- Application to SGA data
- Conclusions
- References

Authors

Related links

Behavior Genetics

Behav Genet. 2010 May; 40(3): 404–414.

Published online 2009 December 24. doi: 10.1007/s10519-009-9322-8

PMCID: PMC2953624

Yudi Pawitan, Email: es.ik@natiwap.iduy.

Received 2008 September 18; Accepted 2009 November 28.

Copyright © Springer Science+Business Media, LLC 2009

Family data are used extensively in quantitative genetic studies to disentangle the genetic and environmental contributions to various diseases. Many family studies based their analysis on population-based registers containing a large number of individuals composed of small family units. For binary trait analyses, exact marginal likelihood is a common approach, but, due to the computational demand of the enormous data sets, it allows only a limited number of effects in the model. This makes it particularly difficult to perform joint estimation of variance components for a binary trait and the potential confounders. We have developed a data-reduction method of ascertaining informative families from population-based family registers. We propose a scheme where the ascertained families match the full cohort with respect to some relevant statistics, such as the risk to relatives of an affected individual. The ascertainment-adjusted analysis, which we implement using a pseudo-likelihood approach, is shown to be efficient relative to the analysis of the whole cohort and robust to mis-specification of the random effect distribution.

Family data have been used extensively for complex genetic modelling such as quantitative-trait linkage (e.g., Amos 1994; Blangero et al. 2001) or segregation analysis to separate genetic and environmental contributions to non-Mendelian diseases (e.g., Falconer 1965; Mather and Jinks 1977; Neale and Cardon 1992). Other than overcoming the sample size problem associated with twin studies, especially when the disease of interest has a low prevalence, family data potentially provide richer genetic information (e.g., Pawitan et al. 2004). However, this information is likely to be concentrated in ‘genetically loaded’ families, so that it is not efficient to collect data from, nor to analyse, all families from a population register. Non-random ascertainment is commonly used in genetics research to maximize the amount of information in the data for a given sample size (e.g., Elston and Sobel 1979). One of the most common methods of non-random ascertainment is to include families with at least one affected member: for variance component models this has been suggested, for example, in deAndrade and Amos (2000), Epstein et al. (2002), and Burton (2003). However, this sampling scheme may not be optimal, and in fact it has been shown (Glidden and Liang 2002; Noh et al. 2005) that the analysis of the ascertained data is sensitive to mis-specification of the random-effect distribution. In this paper we develop an efficient and robust method of ascertaining informative families from population-based family registers for the purpose of complex genetic modeling involving variance component analysis of a binary trait.

In epidemiological analyses, we often need or wish to account for confounding factors. For variance component analysis of a binary trait, the most straightforward way to adjust for potential confounders is to include them as covariates in a generalized linear mixed model (GLMM) (Breslow and Clayton 1993; Lee and Nelder 1996). Marginal likelihood provides a flexible computational approach, and can be extended to multivariate binary traits analysis, as we have demonstrated in an analysis of the the co-morbidity of schizophrenia and bipolar disorder (Yip et al. 2008; Lichtenstein et al. 2009). The exact marginal likelihood computation is also more efficient than other computational approaches such as the Gibbs sampling (Zeger and Karim 1991; Burton et al. 1999), but for family data, it is still slow because of the high-dimensional integration (e.g., Pawitan et al. 2004). Because of this limitation, recent likelihood-based methods in family data analysis are limited in their ability to handle general covariates.

To avoid the integration step, Noh et al. (2006) used a hierarchical-likelihood method with Laplace approximation. However, for the volume of data that is typical for family studies, the computational requirements of these methods are still enormous, so they cannot be used during the model building stage, where numerous exploratory analyses are performed. Moger et al. (2008) suggested case-cohort methods as a way of dealing with large population-based family data with survival traits. We adapted their idea here and extended the exact marginal likelihood approach to a pseudo-likelihood approach to analyze ascertained family data.

Intuitively, information about familial clustering comes from families with at least two affected members. Thus, provided the genetic information in the full data can be preserved, ascertainment of families with at least two affected members offers the potential for dramatic data reduction; see Sect. 2.1 for a specific example. For computational efficiency it is natural to first group families with the same configuration of disease status and covariates. A novel aspect in our method is an ascertainment of the family configurations rather than family units. We propose an optimized matching method where we ascertain family configurations that are most informative, while making sure that relevant features, such as the risk to relatives of cases, are similar in the sampled data and the full cohort.

To summarize the contribution of this paper, we have developed a method to facilitate routine exploratory analysis of large population-based family data sets, where interest is focused on estimating the genetic and environmental contributions to a binary trait with adjustment for confounding. We propose an ascertainment scheme, where families are first grouped by the pattern of the outcomes and covariates of their members, and the ascertainment is of family configurations rather than family units. In our application, all families with two affected members are sampled, and the remaining families are sub-sampled in such a way that the sampled data matches the whole cohort with respect to the odds ratio for affected siblings. Our ascertainment-adjusted analysis, which uses a pseudo-likelihood method, is robust against mis-specification of the random-effect distribution and has high efficiency versus exact likelihood. We illustrate our method in a substantive analysis of a population-based dataset of birth outcome in pairs of siblings.

For motivation and illustration we use the small-for-gestational-age (SGA) data as described in Svensson et al. (2006). This dataset was obtained by linkage of the Swedish Multi-Generation Register and Medical Birth Register. We include covariates that have been suggested as potential risk factors to SGA, such as maternal age, preeclampsia diagnosis in an earlier pregnancy, smoking and body-mass index (BMI). Due to availability of information on some of the covariates, our data covers the calendar period (1981–2001). As in Svensson et al. (2006), we identified pairs of full siblings, where both of them had at least one delivery recorded in the Medical Birth Register, and we collected the birth information from the different types of sibships: sister–sister, brother–brother and sister–brother pairs.

The final dataset consists of 326,629 family-pairs (pairs of siblings with their spouses and offspring). The optimally matched sample is ascertained from these data, and its performance is compared to the analysis of the full data. To limit the data to a manageable volume, we used the information from a maximum of 4 pregnancies from any sib-pair. There were 129,593 family-pairs with 2 pregnancies, 125,405 with 3 pregnancies and 71,631 with 4 pregnancies. There were 921,925 offspring between 1981–2001, among whom we observed 21,103 born small for their gestational age (SGA). The following table shows the distribution of the number of SGA offspring within the families of sib-pairs:

Ascertaining only family-pairs with at least two affected members, we can limit the case family-pairs to just 1,055 + 58 + 3 = 1,116, instead of 1,116 + 18,807 = 19,923 if we consider at least one affected member.

It has been shown previously (Svensson et al. 2006) that genetic factors, especially the fetal component, account for the majority of the liability of having an SGA birth. However, birth order was the only fixed covariate included in the model, while, as the authors pointed out, the genetic liability to SGA may be partly mediated by well-known maternal risk factors for SGA births, such as smoking and preeclampsia.

Let be the vector of binary outcomes from *n*_{i} members of family *i*, for *i* = 1,…, *N*. The families are assumed to be independent. Let *x*_{1},…, *x*_{N} be the corresponding covariate matrices, each of size *n*_{i} × *p*. Also available is the information on relationships between members of a family, thus determining structures such as full siblings, cousins, paternal-halfsibs, etc. Conditional on the random effect we assume *y*_{ij} to be an independent Bernoulli with parameter *p*_{ij}, following a general linear mixed model (GLMM)

where g() is a link funciton, is a *p*-vector of fixed regression parameters. The random parameter captures the dependencies between family members; the design vector *z*_{ij} shows the contribution of to the outcome. To complete the specification, we assume is normal with mean zero and variance where contains all the variance component parameters.

In GLMM framework, the logit link is the canonical link function for binary-trait models. However, there are at least two reasons why the probit link may be preferred. Firstly, the probit link fits directly in the liability model (Sham 1998), which is commonly used in biometrical genetics applications. Secondly, the probit link also led to a convenient computation of the marginal likelihood in terms of multivariate normal probabilities (Pawitan et al. 2004). Noh et al. (2006) illustrated that the parameters estimated from the two models with different link functions are comparable after adjustment by a simple scale factor.

Specifically for the SGA data, a family structure consists of a pair of nuclear families made by full siblings. The vector *y*_{i} is the pregnancy outcomes from the two families. (The pregnancies are treated as the offspring of the families.) We consider the model (now in vector notation)

1

where *m*_{i} is the vector of maternal effects, *f*_{i} the fetal effects, *c*_{i} the common couple environment effect and *s*_{i} the common sibling environment. The common couple environment is the unique environment created by the father and the mother, and the sibling environment is the common childhood and adolescent environment experienced by the siblings. The common family environment is the unique environment created by the father and mother, and the sibling environment is the common childhood environment experienced by the sisters. We assume that *m*_{i}~ *N*(0, σ_{m}^{2}*R*_{m}), *f*_{i}~*N*(0, σ _{f}^{2}*R*_{f}), *c*_{i}~*N*(0, σ_{c}^{2}*R*_{c}) and *s*_{i}~ *N*(0, σ_{s}^{2}*R*_{s}). To illustrate the discrepancy in the correlation matrices for the random effects, let assume a sister–sister pair family where each sibling had two preqnancies. The outcome *y*_{i} is a binary vector that indicates SGA status of the 4 pregnancies, and

The (1,2)-element of *R*_{m} is equal to one since the first two outcomes come from the same mother, i.e. the first sister. The (1,2)-element of *R*_{f} is 0.5 since the first two foetuses are full siblings and the (1,3)-element of *R*_{f} is 0.125 since it refers to a cousin pair. Similar reasoning applies to *R*_{c} and *R*_{s}. For more details, we refer the reader to Pawitan et al. (2004).

From the probit model, we have

where the *Z*_{j}’s are independent standard normal variates. Thus we have the marginal probability

2

3

4

where *q* is the dimension of and The vector *V*_{i} (*V*_{i1},…, *V*_{in_i}) is with

where denotes the matrix obtained by stacking the row vectors *z*_{ij}, and *I*_{i} is the *n*_{i} × *n*_{i} identity matrix. The upper bound if *y*_{ij} = 1, and *u*_{ij} = ∞ if *y*_{ij} = 0. Similarly, the lower bound *l*_{ij} = − ∞ if *y*_{ij} = 1, and if *y*_{ij} = 0. Computation of the normal probability (4) is done using a Monte–Carlo algorithm (Genz 1992).

Let where we make the parameters explicit; the total log-likelihood would then be

For the SGA data, *N* is of the order of 325,000 family-pairs. Since the evaluation of each probability requires a non-trivial Monte–Carlo integration, a naive approach is out of the question. Our problem is compounded by the fact that the resulting likelihood is not smooth, while we need to use a derivative-free optimization method.

The likelihood computation will obviously be faster if the data are grouped according to the configurations of the family outcomes and covariates The total likelihood can then be written as

5

where *w*_{k} is the number of families with the *k*th configuration, and *M* is the total number of configurations.

If the family data consist of information on binary outcomes and *p* binary covariates from *k* family members, then *M*≤2^{k(p+1)}, and the number of probability computations can be reduced by a factor of *N*/*M*. However, for analysis of families with up to 4 members, this grouping will substantially reduce the computation time when the we only use one or two covariates. As we increase the number of covariates, *M* increases rapidly, so even the grouped data become too large to analyse with exact methods. For the SGA data, with one covariate we have *M* = 185, but with 5 covariates it increases to more than 11,000.

For the grouped data, ascertainment is naturally done on the family configurations rather than on the family units. Let *S* = {1,…, *M*} be the index set of all family configurations, and suppose that *S* can be divided into two disjoint sets, *S* = *S*_{0} *S*_{1}, where *S*_{1} is the set of all families with at least *k* affected members, and *S*_{0} is the set of control families. In line with the usual case–control studies, we will keep all case-family configurations. Control-family configurations will in general be included with probability less than one.

Let *A*_{j} = 1 if family *j* is ascertained, and 0 otherwise, and *a*_{j} = *P*(*A*_{j} = 1). Typically *a*_{j} is a function of the number of affected members, but it can also be a function of covariates. Then the exact ascertainment-adjusted likelihood contribution from an observed is

where *k* runs over all possible configurations from the same covariate *x*_{j}, such that Note that the denominator needs the evaluation of probabilities for all families that might get ascertained, even if many of those are in fact unobserved. Thus the computational burden of the exact likelihood is still too demanding for routine analysis.

We instead consider a weighted-likelihood

6

which is clearly an unbiased estimate of the log-likelihood (5). The main advantage over the exact likelihood is that we only need to evaluate the probabilities for family configurations that are both observed and ascertained.

Because of the Monte Carlo approximation, the log-likelihood (6) is not smooth. We use the derivative-free Nelder-Mead simplex algorithm (Nelder and Mead 1965) to get near to the solution, then use the Gauss-Seidel method with the smoothed log-likelihood to arrive at the final solution. The statistical software R was used for all computations.

Standard inference in the pseudo-likelihood framework typically relies on the asymptotic normality of the estimates, with the so-called sandwich formula for the variance (e.g., Kalbfleisch and Lawless 1988). Unfortunately, for our problem, deriving the sandwich formula analytically is too complicated. So in our examples we use the bootstrap method on the grouped family data. Under the bootstrap sampling, the total frequencies of the grouped data have a multinomial distribution. Since, conditional on the sum, the collection of Poisson variates has multinomial distribution, we can approximate the bootstrap samples by generating Poisson variates with means given by the observed frequencies. This means we can generate the bootstrap samples of the grouped data quite fast. We use *B* = 25 bootstrap replicates, which are sufficient since we will only use the bootstrap to compute standard errors.

We will address three issues by simulation: (1) robustness, (2) efficiency and (3) case definition. Previous studies (Glidden and Liang 2002; Noh et al. 2005) indicated that the exact likelihood analysis of ascertained data is sensitive to mis-specification of the random-effect distribution. Noh et al. (2005) used a complex procedure based on hierarchical likelihood to perform a robust analysis. Here we show that robustness can also be achieved with the standard analysis if we also ascertain some proportion of the control group. Furthermore, even though the procedure in Noh et al. is robust, there is a severe loss of information from the case-only design. We show that we can retain most of the information in the full cohort by sampling all case families and a fraction of the control families. In our simulation we will compare a case-only vs case–control designs. In addition, we compare exact and pseudo-likelihood approaches for the ascertained data.

Typically a case family is defined as having at least one affected member. However, intuitively, information about variance components is captured by familial clustering, i.e., at least two affected members in the family. In our SGA problem we will also get a great computational advantage from this definition of a case family, as it leads to a substantial reduction in the number of case-family configurations. In the simulation, we will compare the efficiency of the estimates under different definitions of case family (at least one affected vs at least two affected members).

Following the example in Noh et al. (2005), we simulated a population of 100,000 families, each comprising *n*_{i} = 5 siblings (children only, parents not included). Additionally, considering the small family sizes in the real data, we also simulated families with *n*_{i} = 3 siblings. The binary outcomes are assumed Bernoulli with probability π_{ij}, which follows the logistic mixed-model

where *b*_{i} is assumed *N*(0, θ). We define two family-level covariates (i.e. fixed across members within a family), both generated to follow the standard uniform distribution. (Having siblings within a family share the same covariates simplify the computations, but with real data this is not necessary.) The fixed parameters are set at β_{0} = − 0.5, β_{1} = 0.15, β_{2} = 0.20. Given these assumptions, the basic likelihood contribution from the *i*th sibship is

where π_{i}π_{ij}, and ϕ(*b*_{i}) is the normal density for *b*_{i}. The integral can be computed very fast using the Gaussian quadrature method.

To assess the robustness against mis-specification, we generate the random effects *b*_{i} according to each of the following distributions:

The first model provides a partial check on the simulation, where the exact method should work well. The second is a heavy-tailed model, which has been shown to produce biased estimates (Noh et al. 2005). In both cases, *b*_{i} has a true variance θ = 4.5. (We also tried another heavy-tailed model where *t*(5) is the *t*-distribution with 5 degrees of freedom, and a skewed model where *u*_{i} follows the gamma distribution with shape parameter 2 and scale parameter 1. The results were similar and will not be shown here)

The means and standard deviations of the parameters estimates from 200 datasets are presented in Table 1. When the model is correct, i.e., the true random-effect distribution is normal, all procedures are consistent (scenarios A1 to A4). However, note the substantial increase of variance in the case-only design (A2). For estimation of θ in A2, the efficiency vs the full cohort is (0.085/0.306)^{2} = 0.08. Even when only 5% of the control families are included A3, the efficiency is (0.085/0.099)^{2} = 0.75. Lower efficiency is achieved for the regression parameters, but this can be improved by increasing the sampling proportion of the controls. Furthermore, the results also indicate that the pseudo-likelihood (A4) achieves similar results as the exact likelihood (A3).

Means and standard deviations from 200 simulations for full and different ascertained samples, assuming normal random effects the in logistic mixed-effect model, and using *n*_{i} = 5 siblings per family

If the true random effects are not normally distributed, but the model still assumes normality, then the case-only design (A6) can produce very misleading estimates. This problem was first presented in Glidden and Liang (2002), and investigated further in Noh et al. (2005). Also, the variances of the estimates are substantially larger than those from the full cohort. In contrast, analysis of the full cohort (A5) is quite resistant to the mis-specification. More importantly, by including 5% of the control families, both the exact and pseudo-likelihood analysis of the ascertained data (A7 and A8) produce very close results to those from the full cohort data. Again, the pseudo-likelihood (A8) achieves high efficiency compared to the exact likelihood (A7).

When we change the case-family definition from at least 1 to at least 2 affected members, the number of case families drops substantially from around 13,700 to 4,400, while the number of control families increases from around 86,300 to 95,600. To achieve a similar number of ascertained families as in panel A, we increase the proportion of controls to 10% (scenarios B3 and B4). Case-only analysis (B2 and B6) continue to have robustness problems. For the case–control designs, we obtain similar results for robustness and efficiency as obtained in panel A. Finally, Table 2 shows similar results for small family size *n*_{i} = 3.

Means and standard deviations from 200 simulations for full and different ascertained samples, assuming normal random effects the in logistic mixed-effect model

In summary, from the simulation study we learn three things that are directly relevant in our current problem:

- Inclusion of control families increase the robustness against mis-specification of the random-effect distribution.
- In this logistic mixed-model setting, the pseudo-likelihood has high efficiency vs the exact likelihood.
- We can define case families as those having at least two affected members with little loss of information/efficiency.

In our experience with the real SGA data, a direct application of the suggested sampling approach for the family case–control data does not work well: it often produces estimates that are very far from the full-likelihood estimates. This is mainly because the sampled data are often too different from the full data with respect to certain features that reflect the parameters of interest. Since the full data are available, we devise a scheme to match these features in the full data with the same features in the ascertained data.

The vector of unknown parameters can be divided into two groups: regression parameters and variance components. Hence, there are two types of statistics that are natural for matching:

- Estimates from an ordinary generalized linear model (GLM) (without the random effects),
- Odd-ratios (ORs, between family members) that capture familial risk.

We have shown previously (Yip et al. 2008) that ORs describing risk in relatives are good proxy measures of the magnitude of variance components. So if the ORs from the sampled data are similar to the ORs from the full data, then we would expect the estimates of variance components from the two datasets to be of the same magnitude. Similar thinking applies to the estimation of the regression parameters. It is of course important that the estimation of the ordinary GLM and ORs can be done extremely fast even for the full data, so we perform the following scheme:

- Sample case and control families from the full data with the desired ascertainment probabilities.
- Obtain ordinary GLM estimates and ORs from the full data and from the sampled data, where the latter estimates account for the ascertainment.
- Compute the criteriawhere
*h*is the number of ORs and*p*is the number of covariates. Combine the criteria into*Q*=*Q*_{1}+*Q*_{2}from each sampled data. - Repeat the procedure a large number of times, and select the sampled data that minimizes
*Q*. In our examples, the best sample was chosen from 1000 samples. Once the sample is chosen, the estimation of the mixed model is based on the weighted likelihood (6).

While the ascertainment process looks complex, the principle is quite simple, i.e. we try to ascertain ‘balanced’ samples, where the balance is determined by the ORs and regression coefficients that are observed in the full data. In general, the process belongs to a stratified or two-stage sampling method. Had the data been much simpler, e.g. consisting only of families of size two and the condition involves only a single OR, then the ascertainment process becomes more transparent. In this situation, we ascertain all the case families, then sample the controls such that the ratio of cases to controls is the same as in the full data, thereby preserving the observed OR in the full data.

Since we use the bootstrap method for computing the standard errors, the complex matching does not present any analytical problem. We note that there are two ways to bootstrap the data: before or after the ascertainment step. In the former, we boostrap the full data and include the optimal matching step in the bootstrap. However, if we treat the ascertained data as a stratified sample, then it should be possible also to apply the bootstrap to the ascertained data, so the optimal matching is performed only once (to generate the ascertained data). We show later (Sect. 5) that these two methods in fact produce similar results.

The purpose of our analysis of the SGA data is to extend the results in Svensson et al. (2006) by including potential risk factors, such as birth order, maternal smoking and maternal body-mass index (BMI). We fit model (1), which includes 4 random components: maternal, fetal, couple environment and sibling environment effects. To assess the confounding between these risk factors and the genetic and environmental effects, we first fit a simple model that includes only birth order (first = 0, subsequent = 1). In the second model we include information on preeclampcia (yes = 1, no = 0), smoking (yes = 1, no = 0) and BMI (low, medium and high).

For the purpose of matching, the data are categorized by the type of sibling pairs and the number of offspring. The sib-pair types are sister–sister, brother–sister and brother–brother, but the last two can be combined since they have the same covariance structure (Pawitan et al. 2004). From each category we compute within-sib and between-sib ORs. Irrespective of the sib-pair type (sister–sister, sister–brother, brother–brother), within-sib ORs capture the maternal and couple environment effects. In the sister–sister pairs, between-sib ORs capture the maternal and sibling environment effects. In the brother–sister and brother–brother pairs, between-sib ORs capture fetal and sibling environment effects. Table 3 shows the descriptive statistics and the ORs from the full data and the optimally-matched sample. All ORs are very well-matched between the two datasets.

Counting statistics in various categories defined by sib-pair type, total number of offspring and number of offspring of the second sibling

To illustrate that matching is indeed necessary, the boxplots in Fig. 1 show a substantial variation between the ORs from the 1,000 randomly ascertained samples. Many samples produce ORs that are far from the corresponding full-data ORs (8.51 and 1.33 from the last category in Table 3). Without any matching, this large variability will lead to larger uncertainty in the estimates obtained from a single ascertained sample.

The results of various analysis are presented in Table 4. There is a higher risk of SGA on first or preeclamptic pregnancies, for smokers, or for women who have low BMI. Because of the large size of the data, all fixed regression parameters are very precisely estimated with very small standard errors. However, the estimation of the variance components, particularly the fetal component, is much less precise. With the simple model, there are only 185 observed family configurations, so there is no need for case–control sampling.

The result of the simple model is similar to Svensson et al. (2006), with a substantial fetal genetic component for SGA, accounting for a much larger contribution than the other three effects. (The result is not exactly the same, because of the different followup periods; see Sect. 2.1.) The genetic variance component parameters can be interpreted in terms of heritability (Pawitan et al. 2004), for example

is the heritability of the maternal genetic effect. The contributions to total liability from the fetal, common family and sibling environments are 38.5, 11.0, and 0.3%, respectively. The total genetic effect (maternal + fetal) explains 55.4 (16.9 + 38.5) per cent of the liability to SGA.

When we introduce more covariates, the number of family configurations increases to 11,151, of which 802 are case-family configurations. We ascertain three-times as many control configurations as case configurations, so we achieve substantial data reduction by this sampling, while obtaining results that are very close to those of the full data. (We also tried the same number of control and case configurations, and the estimates were also quite close, except for the fetal variance.) The standard errors (SEs) of the regression estimates from the full-data bootstrap are generally larger than the SEs from the full likelihood. However, for the variance parameters the SEs are comparable. Furthermore, the SEs obtained from bootstrapping the ascertained data are comparable to the full-data bootstrap, suggesting that there is no need to bootstrap the optimal matching step. The ‘naive SEs’, computed from the weighted likelihood, appear too optimistic for the regression estimates, but quite comparable to the bootstrap SEs for the variance-component parameters. In practice, since the naive SEs are more readily available, we might consider using them as a first approximation, particularly for the variance-component parameters.

While still significant, in the more complex model the maternal and fetal genetic variance components drop substantially. This reflects some confounding between these components and the risk factors, which is not surprising. For example, preeclampsia is associated with both maternal and fetal genetic effects (Pawitan et al. 2004). We also expect maternal BMI to have some genetic component. The result here indicates that there are further maternal and fetal genetic effects beyond those already explained by the risk factors.

Finally, we add two more covariates: (1) maternal country of birth (Nordic = 1, other = 0), and (2) maternal age at delivery (< 26, 26–32 and > 32). The number of configurations is now 67,997, of which 1,082 are case-family configurations. Now a full-data analysis is no longer practical, particularly when numerous exploratory analysis are needed. As before, we ascertain three times as many control configurations as case configuration, and obtain the following estimates for the four variance components: 0.33(SE = 0.03), 0.80(0.20), 0.25(0.04) and 0.01(0.01). These estimates are close to those found in Table 4, which means that the two extra covariates are not confounding the genetic and environmental effects.

Our work on population-based family data has been motivated by questions in genetic epidemiology, particularly in estimating the relative contribution of genetic and environmental components to human diseases. Previous works in this area have been hampered by the inability to include general covariates, mainly due to computational problems in dealing with the integration of marginal likelihood. In this paper we investigated a novel approach to sampling informative families with at least two affected members, together with control families. We showed that inclusion of controls is important to preserve the robustness of the full cohort data against model mis-specification. We also showed that the pseudo-likelihood approach leads to efficient computations and the statistical properties compared well to those of the exact likelihood approach.

In the application to SGA data, the more complex model reveals more insight into the contribution of genetic factors to this condition. For example, comparing one covariate (birth order) model with the more complex model (birth order, preeclampcia, maternal BMI, smoking) we found the total genetic contribution of liability to SGA drop from 55.4 to 45.5 per cent. This means that the genetic contribution to SGA is mostly independent of the known covariates. Similar comparison is very useful, if genotyped data are available. Then comparison between models with and without known (or candidate) risk associated single-nucleotide polymorphism (SNPs) will give us insight on how much of the total genetic effect was explained by those SNPs.

It is worth noting from the SGA analysis that, while fixed-effect regression parameters can be well estimated in this large dataset, the fetal variance component has a large standard error. This highlights the need for large population-based family data for precise estimation of genetic effects, and hence practical methods for dealing with such large datasets.

The case–control study design, commonly used in medical studies to reduce cost, collects information on cases and a subsample of controls. It is well known that a case–control study has a high efficiency compared to a full cohort study. We have a similar goal here: to sample a dataset that gives parameter estimates close to those obtained from the full cohort. Existing family-based case–control methods (e.g., Lu and Wang 2002; Moger et al. 2008) are focused on estimation of the fixed regression parameters rather than the variance components, and they usually involve only simple family structures that allow only a single variance component. For fitting the complex genetic and environmental component models, existing family-based case–control methods are still not practical enough for routine use.

While our motivation has been to reduce the computation in dealing with binary traits, it is clear that the issues and methods that we investigated here can be applied more generally to other questions, for example for quantitative-trait linkage analysis (e.g. Amos 1994; Blangero et al. 2001), where both segregation and linkage is required.

One weakness in our approach is that we can only deal with categorical covariates; continuous covariates will generate too many family configurations that the procedure becomes too slow. This is also a problem with other methods that rely on computing the likelihood for each configuration. Another weakness is typical with ascertainment methods, where, because of the subsampling, there is a potential loss of efficiency compared to the full data. However, it is worth noting that our approach is particular useful during model building stage, where speed is important but full precision less so. Once we arrive at the final stage, if feasible, we can of course use the full data for analysis.

In general, our optimal matching approach is applicable to situations where some population statistics are available. Our approach is akin to balancing considerations in two-stage sampling methodology (e.g., Reilly 1996), but the simulation approach to sample selection allows much more complex stratification.

Edited by Stacey Cherny.

- Amos CI. Robust variance-components approach for assessing genetic linkage in pedigrees. Am J Hum Genet. 1994;54:535–543. [PubMed]
- Blangero J, Williams JT, Almasy L. Variance component methods for detecting complex trait loci. In: Rao DC, Province MA, editors. Genetic dissection of complex traits. London: Academic Press; 2001. pp. 151–182.
- Breslow NE, Clayton D. Approximate inference in generalized linear mixed models. J Am Stat Assoc. 1993;88:9–25. doi: 10.2307/2290687. [Cross Ref]
- Burton PR. Correcting for nonrandom ascertainment in generalized linear mixed models (GLMMs), fitted using Gibbs sampling. Genet Epidemiol. 2003;24:24–35. doi: 10.1002/gepi.10206. [PubMed] [Cross Ref]
- Burton PR, Tiller KJ, Gurrin LC, Cookson WOCM, Musk AW, Palmer LJ. Genetic variance components analysis for binary phenotypes using generalized linear mixed models (GLMMs) and Gibbs sampling. Genet Epidemiol. 1999;17:118–140. doi: 10.1002/(SICI)1098-2272(1999)17:2<118::AID-GEPI3>3.0.CO;2-V. [PubMed] [Cross Ref]
- Andrade M, Amos CI. Ascertainment issues in variance component models. Genet Epidemiol. 2000;19:333–344. doi: 10.1002/1098-2272(200012)19:4<333::AID-GEPI5>3.0.CO;2-#. [PubMed] [Cross Ref]
- Elston RC, Sobel E. Sampling considerations in the gathering and analysis of pedigree data. Am J Hum Genet. 1979;31:62–69. [PubMed]
- Epstein MP, Lin X, Boehnke M. Ascertainment-adjusted parameter estimates revisited. Am J Hum Genet. 2002;70:886–895. doi: 10.1086/339517. [PubMed] [Cross Ref]
- Falconer DS. The inheritance of liability to certain diseases estimated from the incidence among relatives. Ann Hum Genet. 1965;29:51–76. doi: 10.1111/j.1469-1809.1965.tb00500.x. [Cross Ref]
- Genz A. Numerical computation of multivariate normal probabilities. J Comput Graph Stat. 1992;1:141–149. doi: 10.2307/1390838. [Cross Ref]
- Glidden D, Liang KY. Ascertainment adjustment in complex diseases. Genet Epidemiol. 2002;23:201–208. doi: 10.1002/gepi.10204. [PubMed] [Cross Ref]
- Kalbfleisch JD, Lawless JF. Likelihood analysis of multistate models for disease incidence and mortality. Stat Med. 1988;7:147–160. doi: 10.1002/sim.4780070116. [PubMed] [Cross Ref]
- Lee Y, Nelder JA. Hierarchical generalized linear models (with discussion) J R Stat Soc B. 1996;58:619–678.
- Lichtenstein P, Yip BH, Björk C, Pawitan Y, Cannon TD, Sullivan PF, Hultman CM (2009) Common genetic determinants of schizophrenia and bipolar disorder in Swedish families: a population-based study. Lancet 373:234–39 [PMC free article] [PubMed]
- Lu SE, Wang MC. Cohort case–control design and analysis for clustered failure-time data. Biometrics. 2002;58:764–772. doi: 10.1111/j.0006-341X.2002.00764.x. [PubMed] [Cross Ref]
- Mather K, Jinks JL. Introduction to biometrical genetics. Ithaca: Cornell University Press; 1977.
- Moger TA, Pawitan Y, Borgan O. Case-cohort methods for survival data on families from routine registers. Stat Med. 2008;27:1062–1074. doi: 10.1002/sim.3004. [PubMed] [Cross Ref]
- Neale MC, Cardon LR. Methodology for genetic studies of twins and families. Dordrecht: Kluwer Academic; 1992.
- Nelder JA, Mead R. A simplex method for function minimization. Comput J. 1965;7:308–313.
- Noh M, Lee Y, Pawitan Y. Robust ascertainment-adjusted parameter estimation. Genet Epidemiol. 2005;29:68–75. doi: 10.1002/gepi.20078. [PubMed] [Cross Ref]
- Noh M, Yip B, Lee Y, Pawitan Y. Multicomponent variance estimation for binary traits in family-based studies. Genet Epidemiol. 2006;30:37–47. doi: 10.1002/gepi.20099. [PubMed] [Cross Ref]
- Pawitan Y, Reilly M, Nilson E, Cnattingius S, Lichtenstein P. Estimation of genetic and environmental factors for binary traits using family data. Stat Med. 2004;23:449–465. doi: 10.1002/sim.1603. [PubMed] [Cross Ref]
- Reilly M. Optimal sampling strategies for two-stage studies. Am J Epidemiol. 1996;143:92–100. [PubMed]
- Sham PC. Statistics in human genetics. London: Arnold; 1998.
- Svensson A, Pawitan Y, Cnattingius S, Reilly M, Lichtenstein P. Familial aggregation of small-for-gestational-age births: the importance of fetal genetic effects. Am J Obstet Gynecol. 2006;194:475–9. doi: 10.1016/j.ajog.2005.08.019. [PubMed] [Cross Ref]
- Yip B, Björk C, Lichtenstein P, Hultman C, Pawitan Y. Covariance components models for multivariate binary-traits in family data analysis. Stat Med. 2008;27:1086–1105. doi: 10.1002/sim.2996. [PubMed] [Cross Ref]
- Zeger SL, Karim MR. Generalized linear models with random effects: a Gibbs sampling approach. J Am Stat Assoc. 1991;86:79–86. doi: 10.2307/2289717. [Cross Ref]

Articles from Springer Open Choice are provided here courtesy of **Springer**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |