|Home | About | Journals | Submit | Contact Us | Français|
Variance-component methods are popular and flexible analytic tools for elucidating the genetic mechanisms of complex quantitative traits from pedigree data. However, variance-component methods typically assume that the trait of interest follows a multivariate normal distribution within a pedigree. Studies have shown that violation of this normality assumption can lead to biased parameter estimates and inflations in type-I error. This limits the application of variance-component methods to more general trait outcomes, whether continuous or categorical in nature. In this paper, we develop and apply a general variance-component framework for pedigree analysis of continuous and categorical outcomes. We develop appropriate models using generalized-linear mixed model theory and fit such models using approximate maximum-likelihood procedures. Using our proposed method, we demonstrate that one can perform variance-component pedigree analysis on outcomes that follow any exponential-family distribution. Additionally, we also show how one can modify the method to perform pedigree analysis of ordinal outcomes. We also discuss extensions of our variance-component framework to accommodate pedigrees ascertained based on trait outcome. We demonstrate the feasibility of our method using both simulated data and data from a genetic study of ovarian insufficiency.
Mixed-model procedures (see McCulloch and Searle 2000 for an overview) have an impressive history in the statistical analysis of clustered, hierarchical, and spatial data. Researchers also have applied mixed models extensively to perform genetic analyses of correlated trait data from relatives within pedigrees. Initially, studies applied such mixed models to familial trait data to assess whether a quantitative trait of interest contained a significant genetic component and was heritable (Lange et al. 1976; Hopper and Mathews 1982). Such analyses consist of partitioning the trait variance within a family into estimated components due to separate genetic and environmental effects and subsequently examining whether the estimated component of variance due to the genetic effect has a significant impact on the trait of interest. Amos (1994) and Almasy and Blangero (1998) later extended this variance-component (VC) method to perform linkage analysis of quantitative trait data by modeling a separate component of variation due to a putative major-gene locus of interest. Additional studies demonstrated further extensions of the VC framework to allow for interaction testing (Mitchell et al. 1997) as well as outcome data from longitudinal studies (de Andrade et al. 2000)
The VC method has many appealing practical features for pedigree analysis of quantitative trait data. Compared to relative-pair based methods (e.g. Haseman and Elston 1972; Kruglyak and Lander 1995), VC methods utilize a framework that is more flexible and permits easier modeling of covariates and interaction effects. More importantly, many studies have shown that variance-component methods often have more power than relative-pair methods for detecting linkage (Williams and Blangero 1999; Sham and Purcell 2001; Yu et al. 2004). Nevertheless, VC methods require strong assumptions for valid inference. In particular, traditional VC methods typically rely on the assumption that the familial trait data follow a multivariate normal distribution in order to facilitate analyses. One can test this assumption using diagnostic tools such as those described in Hopper and Mathews (1982) and de Andrade et al. (2003). If the distributional assumption is violated, the traditional variance-component method may yield biased parameter estimates and elevated type I error rates for testing different effects. For example, Allison et al. (1999), Blangero et al. (2000), and Epstein et al. (2003) demonstrated these problems for quantitative data derived from Laplace, χ2, and censored normal distributions, respectively. Moreover, the normality assumption severely hinders the potential application of variance-component pedigree analysis to categorical outcomes, such as presence/absence of disease (binary outcome) or a disease-severity scale (ordinal outcome).
Even if one can transform the trait data to approximate normality, Allison et al. (1999) and Blangero et al. (2001) demonstrated the traditional variance-component method still yields inference problems when the transformed data have a standardized kurtosis greater than zero, which indicates more probability in the tails of the distribution than for a normal distribution. This finding impedes our interpretation of results for data that naturally follow a gamma distribution (Figure 1). Typically, variance-component analysis of gamma-distributed data proceeds by first applying an appropriate normality transformation (such as a logarithmic transformation) and then applying the traditional variance-component method to the transformed data. However, the transformed data may still have a positive kurtosis (as shown in Figure 2). In such a situation, one could perhaps find a better transformation for the data that reduces the kurtosis; however, such a transformation could be difficult to obtain.
If the assumptions of the traditional variance-component method are violated, one can apply robust variance-component methods that allow for trait distribution misspecification (Amos 1994; Blangero et al. 2001; Diao and Lin 2005) although it is unclear whether such methods are applicable to categorical outcomes. Moreover, if the trait distribution is approximately known, modeling the distribution could lead to increased power and more efficient parameter estimates relative to the robust methods. In an earlier paper (Epstein et al. 2003), we extended the variance-component method to test for linkage of a major gene that influences censored normal data using generalized linear mixed model (GLMM) theory (Breslow and Clayton 1993). Results indicated that linkage analyses of censored normal data using our tobit variance-component method had improved efficiency and more appropriate type-I error rates compared to application of traditional variance-component methods that explicitly assumed trait normality.
Given this success, we now develop a general variance-component framework for pedigree analysis of non-normal data that follow an exponential-family distribution, which includes Bernoulli, Poisson, and gamma distributions as special cases. Additionally we also develop a VC framework for pedigree analysis of ordinal categorical data. Like the tobit VC method, we base this framework on the GLMM framework proposed by Breslow and Clayton (1993). Our proposed method generalizes the work of Duggirala et al. (1997; 1999) and Hasstedt (1993), who analyzed binary and polychotomous traits, respectively, using VC methods, but assumed trait values were determined by liability values being above or below specific thresholds. Our method is also related to the work of Burton et al. (1999), who applied GLMM theory to analyze binary trait data but did not test for linkage. Further, the authors employed Gibbs-Sampling procedures for inference whereas we employ approximate maximum-likelihood procedures for this purpose.
We organize the rest of this article as follows. We first describe the general VC framework for pedigree analysis of an arbitrary trait with an exponential-family distribution and subsequently extend the framework to accommodate ordinal data. We next discuss approximate maximum-likelihood procedures for fitting our VC models followed by discussion of appropriate hypothesis testing of heritability and linkage for an outcome of interest. We use simulated pedigree data to evaluate our VC framework and further illustrate our approach with an application to a genetic study of ovarian insufficiency. We then make some concluding remarks about our approach and discuss potential extensions.
Consider a family of n relatives. Let yj be the trait value of the jth relative and y = (y1,y2,…,yn)be the trait data for the family. We assume yj originates from the sum of independent effects due to observed and unobserved factors both genetic and environmental in nature. Observed factors consist of fixed covariates such as age and gender. We denote Xj as a vector of such fixed factors (covariates) for the jth relative. We assume the unobserved random factors that influence yj consist of a major-gene locus of interest and a number of independent genes of small effect (polygenes). While we will not do so here, we easily could assume other random effects such as two or more major genes, gene-gene interaction, or unobserved shared environment. We assume the alleles of the major gene and polygenes act additively on the trait; the additive-alleles assumption is easily relaxed. We define MGj and PGj, to be the additive allelic effects of the unobserved major gene and polygenes for the jth relative. We assume MGj and PGj, are independent and normally distributed with means zero and variances and , respectively. Finally, we denote Uj = MGj + PGj as the total unobserved genetic effects for the jth relative and U = (U1, U2,…,Un) as the set of unobserved genetic effects for the family. Conditional on U, the familial trait values y = (y1, y2,…,yn) are independent.
We construct the likelihood of the family’s trait data L(y1,y2,…yn) as a function of the fixed and unobserved effects. We first integrate this likelihood across the unobserved genetic effects U such that
We assume a subject’s trait value conditional on the unobserved genetic effects follows an exponential-family distribution such that L(yj|U) = f(μj,) where f(•) is an exponential-family density function, μj = E[yj|U] is the conditional trait mean, and denotes nuisance parameters. The exponential-family distributions include gamma, binomial, Poisson, and normal distributions (McCullagh and Nelder 1983).
To model the relationship between yj|U and both the fixed and random effects, we employ a link function g that models μj on Xj and Uj such that
Here, β denotes a vector of regression coefficients for the covariates Xj. For simplicity, we assume Xj contains an intercept. Once we specify the link function g in (2), we can reparameterize the likelihood in (1) as a function of β and Uj.
The specification of both f(μj,) and g(μj) depends on the trait distribution. For a normally-distributed trait, f(μj,) is a probability density function for a normal distribution with mean μj and variance , while the link function is the identity link g(μj) = μj. For a binary trait, f(μj,) is a probability mass function for a Bernoulli distribution with mean μj and the link function is the logit link g(μj) = logit(μj). For a gamma-distributed trait, f(μj,) is a probability density function for a gamma distribution with mean μj and scale parameter ν, while g(μj) is either a log link function g(μj) = log(μj) or a reciprocal link function . One typically applies the log link function, since the reciprocal link function can lead to unstable estimation procedures. We note that examples of f(μj,) and g(μj) for other distributions are found in McCullagh and Nelder (1983).
The final step in constructing likelihood (1) is specification of the distribution of the unobserved genetic effects U . As the genetic effects will induce similarity among relatives, there will be covariance among the Uj for different relatives in a family. As shown by Jacquard (1974) and Hopper and Mathews (1982), we can write the covariance for two non-inbred relatives j and k as
Here, πjk denotes the proportion of alleles shared identical by descent (IBD) at the major gene by j and k. A relative pair shares two alleles IBD at a locus if the alleles are physical copies of the same ancestral allele. For autosomal loci, the proportion of alleles shared IBD by a relative pair at a particular locus is equivalent to the number of alleles shared IBD by the pair divided by 2. Generally, we cannot observe πjk but can estimate it using a multipoint algorithm based on available genetic marker data (for example, the algorithms of Lander and Green (1987) and Fulker et al. (1995)). 2ϕjk is the expected proportion of genes shared IBD by the relative pair, where ϕjk is called the kinship coefficient (Jacquard 1974). For any relative pair, 2ϕjk is a known function of the relationship of the pair. For example, 2ϕjk = 1 for monozygotic-twin pairs, 0.5 for full-sib pairs and parent-offspring pairs, 0.25 for half-sib pairs and avuncular pairs, and 0.125 for full-cousin pairs.
Using the covariance structure in (3), we assume that the random genetic effects U follow a multivariate normal distribution with mean vector 0 and covariance matrix
Here, Π and 2Φ are n x n matrices with (j,k)th elements πjk and 2ϕjk, respectively. VC studies typically assume a multivariate normality assumption for U and this assumption certainly will hold if the underlying genes act additively and independently on the trait (Lange 1978; Lange and Boehnke 1983).
While ordinal data do not follow one of the traditional exponential-family distributions described in the previous section, we can still implement a variance-component procedure for pedigree analysis of such categorical data. Suppose yj denotes an ordinal outcome that takes one of M possible values. We assume there is a clear ordering of these values such that, for example, larger outcome values would denote increased severity compared to smaller outcome values. We model the ordinal outcome using a variation of the proportion odds model as described in McCullagh (1980). Let denote the probability that relative j has ordinal outcome m (m=1,…,M) conditional on the random genetic effects and define as the corresponding cumulative probability that relative j has an ordinal outcome in the range between 1 and m. Using these definitions, we can model the ordinal data by fitting M−1 proportional odds models
where θm denotes a specific intercept for category m. Here, we assume the same slope β for each of the M−1 proportional odds models. If interest exists, we could also generalize the model in (5) to model different slopes β(m) for each category, which corresponds to fitting M−1 cumulative logit models.
which is a function of parameters (β,θ) in (5). We complete specification of the likelihood (1) for ordinal data by assuming (as done previously) that the random genetic effects U follow a multivariate normal distribution with the covariance structure shown in (4).
For a sample of I independent families, we construct the full likelihood as where L denotes the likelihood in (1) and i indexes family. We use LF to obtain estimates of fixed and random effects that maximize the likelihood. If the trait follows a normal distribution, then the integral in likelihood (1) has a closed-form solution and maximum-likelihood inference (Lange et al. 1976) is straightforward. However, if the trait follows a non-normal distribution, the integral does not have a closed-form solution, which complicates inference.
To resolve the intractability of the integral for non-normal data, statisticians have developed inference methods that maximize an approximate version of LF. Such procedures include the Solomon-Cox (1992) approximation, penalized quasi-likelihood (Breslow and Clayton 1993), and Gibbs sampling (Zeger and Karim 1991; Burton et al. 1999). Here, we apply an approximate maximum-likelihood approach for inference called adaptive Gaussian quadrature (Pinheiro and Bates 1995) that is implemented in the SAS (1999) procedure NLMIXED. This method approximates the integral in LF. using a weighted sum over predefined weights for Ui (Gauss-Hermite integration). Adaptive Gaussian quadrature then maximizes the approximate likelihood using one of several optimization algorithms. We use a quasi-Newton maximization procedure, which is the default algorithm implemented in NLMIXED.
Accurate likelihood approximation in adaptive Gaussian quadrature requires a suitable set of evaluation points (known as quadrature points) and their corresponding weights. In NLMIXED, one can either directly choose the number of quadrature points for analysis or let the program adaptively select the appropriate number. Adaptive Gaussian quadrature with one quadrature point corresponds to a Laplace approximation (Breslow and Clayton 1993), which is computationally fast but may not accurately approximate the likelihood. As the number of quadrature points increases, the likelihood approximation becomes more accurate, but the complexity of the maximization algorithm also increases, leading to longer computer run times.
To test whether a trait outcome is heritable within the family sample, we perform the hypothesis test . We test this hypothesis by calculating the likelihood ratio statistic, 2loge(A/0) where A and 0 are the maxima of LF fit under the alternative and null hypotheses. Under the null hypothesis, is set to 0 while, under the alternative hypothesis, is estimated together with the other unknown parameters ( is not modeled under either hypothesis). As the value of under the null hypothesis is on the boundary of the parameter space, the likelihood ratio statistic is asymptotically distributed as a 1/2:1/2 mixture of and a point mass of zero under the null hypothesis (Self and Liang 1987). As an alternative to a likelihood-ratio test, we also can employ a score statistic (Lin 1997) that has the appealing feature of robustness in the presence of random-effect distribution misspecification.
The hypothesis test for linkage at the major gene is . We test this hypothesis by calculating a likelihood ratio statistic, which is also asymptotically distributed as a 1/2:1/2 mixture of and a point mass of zero under the null hypothesis. Under the null hypothesis, is set to 0 while, under the alternative hypothesis, is estimated together with the other unknown parameters.
One often employs non-random sampling in genetic studies to increase the information content of the sample. For a family-based analysis of a rare genetic disease, a common ascertainment-sampling scheme is to collect families with at least one or at least two affected relatives. For a quantitative trait, a common sampling scheme is to collect families with a proband whose trait value is found in one or the other tail of the population distribution.
If one applies the random-sampling LF to ascertained data, results often will be biased (Burton et al. 2000). Therefore, given non-random sampling, we adjust LF to account for the ascertainment criterion by dividing the unconditional likelihood by the probability of the ascertainment event. Let ASCi denote the ascertainment criterion for family i. The proper ascertainment-adjusted likelihood for this family takes the form:
and the ascertained-adjusted likelihood for I sampled families takes the form , where Li,ASC denotes the ascertained-adjusted likelihood for the ith family shown in (7). Using this ascertainment-adjusted likelihood, we expect to obtain unbiased population-based estimates of (β̠, , , )(Epstein et al. 2002; Glidden and Liang 2002), although certain situations may occur where the estimates are not identifiable; an example of this for binary data can be found in Epstein (2002).
We used simulated pedigree data to examine bias, type-I error, and power of our general VC framework for genetic analysis of both continuous and categorical outcomes. We assumed variable number of simulated pedigrees consisting of sibships of various sizes. For a given sibship, we simulated marker data by inserting a major gene at 55 cM on a 110 cM chromosome. We simulated a 10 cM map of 12 genetic markers each with four equally-frequent alleles. For each locus, we randomly assigned alleles to the parents of the sibship, after which we created offspring genotypes using the Haldane mapping function. We then removed the parental genotypes from the dataset.
We examined the performance of our VC framework for pedigree analysis of both continuous and categorical outcomes. For continuous outcomes, we simulated gamma-distributed data using the model in (2) where we assumed g(μ) = log(μ) and the covariance matrix Σ in (4). For fixed effects, we assumed an intercept with β = log(2) , a categorical covariate with two equally-frequent outcomes that has an effect of βcat = 0.5 , and a continuous covariate derived from a standard normal distribution that has an effect of βcon = 0.1. We assumed the scale parameter of the gamma distribution to be ν=2.0 and varied (, ) among different combination of values to investigate the performance of our gamma VC model to detect heritability and linkage.
We analyzed the gamma-distributed data using two different procedures. First, we analyzed the untransformed data using the gamma variance-component method. Then, we applied a logarithmic transformation to the data to obtain approximate normality and applied the traditional variance-component method for analysis. For linkage analyses, we estimated IBD sharing at the major-gene locus using the Lander-Green (1987) algorithm as implemented in Genehunter (Kruglyak et al. 1996). We performed pedigree analysis for each procedure by maximizing the appropriate likelihood in the SAS procedure NLMIXED.
For categorical pedigree analysis, we simulated a three-level ordinal outcome (with possible values 0, 1, and 2) using the proportional-odds model in (5) and the likelihood in (6). Within (5), we set θ0 = −1 and θ1 = 1, such that we would expect to observe levels of 0, 1, and 2 in approximately 25%, 50%, and 25% of our sample, respectively. We assumed a categorical covariate with two equally-frequent outcomes that has an effect of βcat = 0.2, and a continuous covariate derived from a standard normal distribution that has an effect of βcon = 0.05. We then varied (, ) among different combination of values. For linkage analysis, we again used the Lander-Green algorithm to estimate IBD sharing. We implemented the procedure in SAS procedure NLMIXED.
Ovarian insufficiency is a complex disorder that encompasses a variety of conditions related to reproductive dysfunction within women. Such reproductive issues include premature ovarian failure (cessation of menses prior to age 40), altered menstrual-cycle characteristics, and infertility. Many researchers have shown that a premutation within the fragile X mental retardation 1 gene (FMR1) leads to an increased risk for ovarian insufficiency (Murray et al. 2000; Sullivan et al. 2005; Allen et al. 2007). The FMR1 premutation is clinically defined as having 55–199 unmethylated CGG repeats in the 5' untranslated region of the gene (Sherman et al. 2005). When the premutation further expands to over 200 repeats during transmission from mother to child, it becomes a full mutation that hypermethylates the FMR1 gene and leads to the separate disorder of fragile X syndrome. Interestingly, studies have shown that the FMR1 full mutation does not associate with ovarian insufficiency; hence, only premutation carriers have increased risk for this condition (Sherman et al. 2007). This disorder is now commonly referred to as fragile X-associated primary ovarian insufficiency (FXPOI).
The FMR1 premutation plays a substantial genetic role in the development of ovarian insufficiency. For example, the prevalence of premature ovarian failure in FMR1 premutation carriers is 20% whereas it is only 1% in the general population (Sherman 2000). However, once adjusted for FMR1 premutation status, it is unclear how much of the remaining residual variation in ovarian-insufficiency outcomes are explained by additional genetic factors. If significant residual heritability of these outcomes exists, this would motivate the formation of additional studies for further gene mapping of these outcomes.
To investigate whether these reproductive outcomes have significant residual heritability after adjusting for FMR1 premutation status, we applied our VC framework to a reproductive-history dataset consisting of 680 women from 225 families who have a history of fragile X syndrome (and, hence, are enriched for being premutation carriers) and 321 women from 219 families from the general population. We focused on the outcome of menstrual-cycle length in the last year of natural cycling. The distribution of this outcome in the dataset is shown in Figure 3. The estimated skewness of the outcome was 8.48 and the estimated kurtosis was 93.31. We obtained a p-value of p<0.0001 for a test of normality using a Shapiro-Wilk test.
We were unsuccessful in our attempts to transform menstrual-cycle length to approximate normality using log, square-root, or reciprocal transformations. Therefore, as done previously in Allen et al. (2007), we formed a 3-level ordinal variable for menstrual-cycle length based on the 25th and 75th percentiles of the outcome distribution. We defined a short cycle as a menstrual length less than or equal to 25 days, an average cycle to be a length between 25 and 30 days, and a long cycle to be a length greater or equal to 30 days. Using this ordinal variable, we examined whether menstrual-cycle length had significant residual heritability after adjusting for FMR1 premutation status by applying the ordinal VC model in (5) to this outcome. Using the likelihood in (1) and the model in (5), we tested the hypothesis after adjusting for the fixed effects of FMR1 repeat size, age when last cycled, smoking, and ethnicity (Caucasian or non-Caucasian). As done previously (Allen et al. 2007), we modeled FMR1 repeat size as a categorical variable with four levels: <59 repeats (normal), 59–79 repeats (low premutation group), 80–100 repeats (medium premutation group), and >100 repeats (high premutation group).
Table 1 and Table 2 summarize results of heritability analyses of gamma-distributed data for 200 sibpairs using both our proposed gamma VC model and the traditional VC model (after a log transformation). Within Table 1, we show the power of the two models to detect a genetic effect under different levels of . We chose values of corresponding to heritability values of 0, 0.10, 0.20, and 0.30 for the log-transformed data. Based on these results, we see that both approaches have appropriate type-I error when the null hypothesis of holds. However, under alternative models, we observe that the gamma VC method is substantially more powerful for testing for heritability compared to traditional VC analysis of the log-transformed data. This result is further supported by Table 2, which shows parameter estimates and estimated standard errors for model parameters. Both VC approaches yielded unbiased parameter estimates and estimated standard errors that closely mirrored the empirical standard errors. However, we observe that the estimated standard errors for under the gamma VC model are substantially smaller than those from the traditional VC model, which further demonstrates the improved efficiency of the gamma VC model for testing heritability.
While the gamma VC yields improved performance for testing for an overall genetic effect compared to traditional VC analysis of log-transformed data, we interestingly observe that both procedures yield similar inference for testing linkage. We simulated datasets comprised of 400 sibpairs and generated data under the gamma VC model using simulated values of (, ) that, for log-transformed data, corresponded to major-gene heritability values (defined as the proportion of trait variance explained by the major gene) of 0 and 0.05 and overall heritability values (defined as the proportion of trait variance explained by major gene and polygenes) of 0.30 and 0.50. As observed in Table 3, we see that both VC methods have appropriate type-I error when the major-gene heritability is 0 and have similar power for testing a major-gene locus whose major-gene heritability is 0.05. We observed these results for overall heritability values of both 0.30 and 0.50.
Table 4 shows parameter estimates derived from the gamma and traditional VC models for simulation models where the major-gene heritability values are 0.05. For each VC model, we observe that mean parameter estimates are unbiased and the mean standard errors generally match the empirical standard errors (although the mean model-based standard errors for the variance parameters slightly overestimate the empirical values). As in Table 2, we again observe that the estimated standard errors for are smaller for the gamma VC model than for the traditional VC model. For , we observe that the gamma VC model yields smaller standard errors for estimates of this parameter compared to the traditional VC model when the overall heritability of 0.30. For overall heritability of 0.50, the two VC models yield similar standard-error estimates for .
Table 5 shows parameter estimates from application of the ordinal VC model to 3-level ordinal outcomes from simulated datasets comprised of 300 sibtrios that were generated under one of two models. Consistent with our results from the gamma VC simulations, we found that mean parameter estimates are unbiased. For fixed effects, we also found that model-based standard errors corresponded well with empirical values. However, for variance parameters, we found that model-based standard errors for the variance parameters differed somewhat from empirical values. This result suggests that one should use score or likelihood-ratio tests, rather than Wald tests, for testing heritability and linkage under the ordinal VC model.
Using likelihood-ratio statistics, we found that our ordinal VC model had excellent power for testing heritability in datasets consisting of 3-level ordinal data on 300 sibtrios. In choosing simulation values for , we note that corresponds roughly to the average spread of risk of being in level 1 (compared to level 0) or being in level 2 (compared to levels 0–1) due to within-family polygenic effects (Pankratz et al. 2005). We assumed values of of 0.5, 1.0, and 2.0, which corresponded to ~2, ~2.7, and ~4 fold increase or decrease in per-family risk compared to the overall risk. For of 0.5, 1.0, and 2.0, we found that we had power of 0.57, 0.90, and 0.99, respectively, at α=0.05 to detect the overall genetic effect. We also found our approach had appropriate type-I error under the null hypothesis (empirical size of 0.047 at α=0.05). For testing linkage , we found that our ordinal VC model had appropriate size at α=0.05 and α=0.01 for datasets consisting of 300 sibtrios assuming values of ranging between 0.5 and 2.0 (results not shown). Power to detect linkage under these simulated models increased with increasing values of . For example, assuming a total genetic variance , we had power of 0.25 to detect linkage when and power of 0.53 to detect linkage when at α=0.05.
Table 6 shows results of the application of the proportional-odds ordinal VC model to the reproductive-history dataset. Supporting the work in the previous studies of Sullivan et al. (2005) and Allen et al. (2007), we found that FMR1 premutation status was significantly associated with menstrual-cycle length. In particular, compared to FMR1 normal carriers, we observe that low- and medium-premutation carriers showed decreased risk for medium cycles (compared to short cycles) and long cycles (compared to medium and short cycles) (p=0.012 for low premutation carriers, p=0.005 for medium premutation carriers). Interestingly, high-premutation carriers showed no change in risk compared to normal carriers (p=0.501). Using a likelihood-ratio statistic, we tested for the existence of a residual polygenic effect by considering and found significant evidence of additional genetic factors that influence the outcome (p=0.027). Our ordinal VC model yields a variance estimate of , which corresponds to a ~2.6 fold change in per-family risk compared to the average risk. This finding supports previous work by Hunter et al. (2008), who found evidence for additional genetic factors influencing age of menopause (another indicator of ovarian function) after adjusting for FMR1 premutation status in the same set of samples.
In this paper, we show a general VC framework for pedigree analysis of continuous and categorical outcomes that do not follow the multivariate normal distribution assumed by traditional variance-component procedures. Using simulated data, we demonstrate the procedure returns unbiased estimates of both fixed and random effects and has appropriate size for testing heritability and linkage of a trait of interest. For continuous non-normal outcomes, such as those that follow a gamma distribution, we also show that our proposed approach can lead to improved power for testing certain genetic hypotheses relative to traditional VC methods. We also demonstrate that our approach is applicable to real genetic datasets, as shown by our heritability analysis of ordinal menstrual-cycle data from a genetic study of ovarian insufficiency.
We developed our general VC framework for the purpose of heritability or linkage analysis of a complex trait of interest. We could also extend our framework to test whether a specific genetic marker is in linkage disequilibrium (LD) with the trait of interest; that is, associated with the trait in the presence of linkage. We would base this LD extension on the work of Abecasis et al. (2000), who extended the traditional VC framework to test for marker association at a linked gene locus for nuclear families. The authors assumed the trait data conditional on offspring and parental genotypes at the marker of interest follow a multivariate normal distribution with a given mean structure and covariance matrix. They modeled the parentally-transmitted alleles of the offspring as a covariate in the mean structure, while modeling the effects of the linked gene locus and polygenes in the covariance structure (as shown in equations 3 and 4). The authors partitioned the offspring genotype effects into between-family and within-family components and showed that a test of the within-family genotype effect was a valid test of LD in the presence of potential confounders such as population stratification. Using these ideas, we could modify the general VC framework in similar fashion to develop a procedure for LD mapping of general continuous and categorical outcomes. We will explore this idea in a subsequent paper.
For non-normally distributed outcomes, our general VC framework yields a likelihood that is intractable due to the fact that the integration of the likelihood over the random genetic effects does not yield a simple closed-form solution. To resolve such intractability in analysis, we used approximate maximum-likelihood procedures as implemented in SAS PROC NLMIXED. However, we can use other statistical algorithms to circumvent this issue. We could maximize the likelihood using the Gibbs sampler (Zeger and Karim 1991; Burton et al. 1999) as implemented in the computer program WinBUGS (Spiegelhalter et al. 2000). However, the procedure is computationally slower than maximum-likelihood procedures and cannot, under ascertainment sampling, maximize the correct ascertainment-adjusted likelihood (6) (Epstein et al. 2002). One could employ the method of penalized quasi-likelihood (Breslow and Clayton 1993). For this method, one approximates the integral using Laplace’s method (Tierney and Kadane 1986) to obtain an approximate quasi-likelihood that can be modified into a penalized quasi-likelihood for maximization (Green 1987). While this method works well in many situations, it tends to underestimate variance parameters when the trait of interest is categorical (Breslow and Clayton 1993). Also, one likely will have difficulty constructing the appropriate ascertainment-adjusted penalized quasi-likelihood for non-random samples.
For non-normal data, our proposed general variance-component framework maximizes the appropriate likelihood using adaptive Gaussian quadrature. As this maximization procedure requires numerical integration, it is computationally more intensive than the typical maximization procedures (Fisher Scoring or Newton Raphson) used for the traditional variance-component method. The degree of the computational complexity for our method depends on both the number of quadrature points used for likelihood approximation and the dimension of the integral in the likelihood. As both of these quantities increase, so does the amount of computer time required for successful likelihood maximization. The number of quadrature points needed for accurate likelihood approximation depends on the type of data analyzed as fewer points are needed for continuous data relative to discrete data. The dimension of the likelihood integral will increase with the size of the family. Therefore, our proposed framework may have prohibitive computer-run times for linkage analysis of discrete datasets from large families given current computational resources.
To reduce the amount of computer time for linkage analysis using our general variance-component method, one could perform the analyses using a Laplace approximation, which corresponds to 1 quadrature point. For a Laplace approximation, maximization of the likelihood for our general variance-component method typically only requires a few seconds. For continuous non-normal data such as gamma data or censored normal data, the resulting variance-component analyses typically have the same linkage power and type I error as compared with the same variance-component analyses using seven quadrature points, although we observe bias in the variance estimates (Epstein et al. 2003). Therefore, we can use a Laplace approximation to perform an efficient genome scan for identifying regions linked to these continuous traits. We would then recommend repeating the analyses on these linkage regions using more quadrature points to obtain unbiased estimates of these variance parameters. While the Laplace approximation is suitable for linkage tests for continuous non-normal data, the same cannot be said for discrete binary traits. Simulation results (not shown) reveal the Laplace approximation yields tests for binary traits that have little or no power to detect linkage. Therefore, a Laplace approximation appears to be useful for variance-component linkage analysis only when one is studying continuous data.
This research was supported by the University Research Committee of Emory University (to M.P.E.) and National Institutes of Health Grants R01 HG03618 (to M.P.E.), R01 HD29909 (to J.E.H., E.G.A. and S.L.S.), R29 CA76404 (to X.L.), and R01 HG00376 (to M.B.).
The original publication is available at www.springerlink.com