|Home | About | Journals | Submit | Contact Us | Français|
Non-random missing data can adversely affect family-based linkage detection through loss of power and possible introduction of bias depending on how censoring is modeled. We examined the statistical properties of a previously proposed quantitative trait threshold (QTT) model developed for when censored data can be reasonably inferred to be beyond an unknown threshold.
The QTT model is a Bayesian model integration approach implemented in the PPL framework that requires neither specification of the threshold nor imputation of the missing data. This model was evaluated under a range of simulated datasets and compared to other methods with missing data imputed.
Across the simulated conditions, the addition of a threshold parameter did not change PPL’s properties relative to quantitative trait analysis on non-censored data except for a slight reduction in the average PPL as a reflection of the lowered information content due to censoring. This remained the case for non-normally distributed data and extreme sampling of pedigrees.
Overall, the QTT model showed the smallest loss of linkage information relative to alternative approaches and therefore provides a unique analysis tool that obviates the need for ad hoc imputation of censored data in gene mapping studies.
Many human clinical disorders have non-random missing data in quantitative phenotypic measurements used in gene mapping studies. Systematically missing data may include, for example, treatment effects. Effective blood pressure medication ensures unreliable measurements of the underlying pathological blood pressure state. Sometimes a measurement may be confounded with disease status, such as hearing acuity tests not being applicable to individuals unable to hear. In all such cases, the missing data censors the observed trait distribution. While it is always possible to ignore censoring by, on the one side of the spectrum, continuing to use quantitative values with treatment effects, or, on the other side of the spectrum, removing those subjects from the analysis, these options are expected to reduce gene mapping power in families just as they reduce power and produce biased parameter estimates in population-based epidemiology cohorts of unrelated individuals .
Many formalized statistical methods to account for censored observations in population-based data exist, though, for family data, there are fewer methods available [2–8]; some methods for censoring assume a survival analysis context and it is unclear how these methods may be applied in other settings [2,5]. Despite the availability of methods in the literature, family-based studies underutilize these options since they are not yet implemented in widely available software, thus promoting the use of ad hoc fixes. Since the most commonly used linkage software packages do not account for censoring using an a priori statistical model, the phenotypic data are sometimes manipulated, most commonly, by converting censored observations to a fixed value chosen by the investigators, with what Tobin and colleagues  call a “sensible constant” (see also the following for examples [9,10]). The choice of sensible constant may be motivated by exogenous data (i.e., mean phenotypic value for untreated subjects if treatment effects are the reason for censoring) or one many methods for single or multiple imputation not specifically designed for family based analysis [11,12].
Here, we evaluate an approach to the analysis of censored data that avoids the need for post hoc sensitivity analysis, since the method does not require a sensible constant to be defined . The model was defined with the application of two principles. First, for gene mapping studies in pedigrees, phenotyped relatives may mitigate power loss since shared genetics and shared environment predict missing phenotypes proportional to the magnitude of those shared effects. Second, if the direction of the censoring is known or reasonably inferable, such as treatment effects, that information may also be incorporated into the analysis since it effectively limits the range of possible values for the missing data. Here, we examine the properties of using a quantitative trait threshold (QTT) model applied only to the censored observations. Similar to the concept of the liability threshold model , where affected individuals are assumed to have genetic liability beyond an undefined threshold, here we assume that censored individuals have trait values beyond an unspecified threshold with uncertainty in this threshold parameter handled in the model. Then we proceed with the calculation using that threshold model for censored individuals while using a standard quantitative trait analysis model for the remaining observations when building the likelihood of the pedigree for linkage analysis. These principles were incorporated into the PPL framework for computation . Originally formulated as a Bayesian linkage method (PPL stood for the posterior probability of linkage), it is currently best characterized as a family of related methods that differ only the addition or subtraction of trait and/or genetic model parameters from a common pedigree likelihood engine that can accumulate evidence for linkage and/or association . For this reason, “PPL” is not fully appropriate as an acronym, but it is still useful as a unique identifier for the framework. The PPL framework is adaptable for modeling complex disease genetics including, as dictated by the user, linkage, association, joint linkage-association, epistasis, imprinting, sex specific recombination, on either categorical or quantitative traits. It has been successfully applied to a variety of complex gene-mapping efforts including diseases such as autism, schizophrenia and specific language impairment [16–19]. By design, the PPL framework directly measures the probability of a trait gene linked to the marker (or location) being evaluated . It is on the probability scale (0 to 1) and can be readily interpreted as a probability. Characteristics of the PPL, as derived, initially applied, and along with further additional computational enhancements are described elsewhere [20–24].
When analyzing quantitative traits, the PPL is parameterized the same as the quantitative trait likelihood in the traditional LOD score  with trait parameters including one (biallelic) trait allele frequency, three genotypic means and corresponding variances. Transmission parameters are the same as the traditional LOD score. An additional admixture parameter is included to form a heterogeneity LOD score. These nuisance trait parameters are then integrated out of the likelihood. We have shown that this model is quite robust for oligogenic traits, and a further benefit of this form of likelihood is that it is also essentially ‘model-free’ with respect to the distribution of the trait in the population . This issue is of concern when using some alternative approaches such as VC methods, which generally show increased false positive rates (compared to nominal significance levels) when the underlying normality assumption is violated . Additionally, because all forms of the PPL are inherently ascertainment corrected (see Methods for details), we do not need to correct for ascertainment like variance components (VC) analysis, and we do not need to specify or estimate population parameters such as the population mean and variance like pedigree-based regression .
The QTT model, as presented in the literature review, and specified more in the methods section, uses the same principles as the PPL quantitative trait model  but with the addition of a threshold parameter to model the censored data. The PPL was previously extended to include a threshold parameter to handle specific forms of censoring and has been applied in several contexts as part of larger hypothesis-driven research questions [13,18,29]. Using a series of simulations, we examined the properties of the PPL when analyzing censored data where the missing data point can be reasonably inferred to have a value above or below the unknown censoring threshold based on prior knowledge. Using the QTT model does not require estimation or imputation of the missing phenotypes or the censoring threshold. We also analyzed the same simulated data using a sensible constant approach to empirically evaluate how this common ad hoc method performs compared to the QTT model.
In this section, we 1) show the modification of the quantitative trait LOD score likelihood that underlies the PPL computation, 2) discuss how the PPL is calculated from the LOD scores and 3) describe the simulation methods for assessing model performance.
The pedigree likelihood used as the basis for PPL computation is described in full detail elsewhere . Briefly, starting with the pedigree linkage likelihood that forms the basis of the familiar parametric LOD score, the trait model parameters include a recombination fraction (θ), disease allele frequency, and a vector of penetrances, P(xi|gj), the probability of disease (xi) for person i, given the 3 possible trait genotypes (gj) indexed by j. The penetrance function, P(xi|gj), is the link between genotype and phenotype that is reparameterized to allow the likelihood to handle quantitative trait data (the transmission parameters from the linkage likelihood are not changed). For quantitative traits, P(xi|gj) is replaced with ϕ(X=x|μj, σ2j), where j indexes the three genotypic distributions for trait genotypes AA, Aa, and aa and ϕ is the standard normal density. Note that the quantitative trait (QT) model does not assume normality of the overall observed trait data since the QT model is a mixture of 3 normal distributions, one per genotype, each weighted by the genotypic frequency. Therefore, this model may imply markedly non-normal (even tri-modal) trait data. In practice, however, for reasonable effect sizes generally encountered in complex disease human genetics and simulated here, this mixture model implies overall trait data that do not grossly depart from normality.
For the QTT model, the underlying QT likelihood is modified differentially. If the person has a quantitative trait value, then P(xi|gj) = ϕ(X= xi|μj, σ2j) the same as the QT model (previous paragraph). However, if the person is coded categorically as “censored” then we replace ϕ(X= xi|μj, σ2j) with ϕ(xi<t|μj, σ2j) if they are assumed to be below the threshold t, and ϕ(xi>t|μj, σ2j) if not.
All QTT calculations were performed with Kelvin, a package for linkage and association analysis of data sets that can include mixes of pedigrees, trios and case-control data (for an overview of Kelvin and it capabilities see ). In this study, we only employed linkage routines on pedigrees. Briefly, the pedigree likelihood is used to calculate many combinations of trait parameters and heterogeneity LOD scores as a fundamental quantity used for numerical integration over the trait parameters. Numerical integration of the model parameters results in an integrated likelihood ratio (often called a Bayes Ratio, or BR, in the PPL literature) that is input into Bayes theorem to quantify the strength of the linkage evidence on the probability scale. As more data are collected, new results are combined with existing results though a procedure called sequential updating. Sequential updating as more data are added is accomplished by multiplication of the BRs across data sets and recomputation of the PPL . In this way, nuisance parameters are integrated out of the BR independently for each data set, and they are, therefore, allowed to vary across pedigrees, which is advantageous for genetically heterogeneous disorders.
In calculating the PPL using the QTT model, the parameter t represents the threshold value, beyond which an observation is considered censored. Analytically, integration over the parameter t has the limits 0 → ∞ if t is assumed to be above the population mean, and − ∞ → 0 otherwise. However, due to the restrictions of numerical integration required for calculating the PPL, limits are imposed based on the accuracy of the desired result. For approximately normally distributed data, numerical integration from 0 → 3 provides the same answer to within 6 decimal places as numerical integration from 0 → 6 (the same is true when changing the direction of integration from − 3 → 0 the same as − 6 → 0). The prior distribution on t is uniform, the same as the other model parameters, discussed in more detail for the QT model here . Therefore, t is numerically integrated out of the likelihood with the other nuisance model parameters in the same way. Hence the potential strength of the QTT model is that the value of t needs to be neither specified in the analysis nor estimated from the data.
Simulated pedigree data (phenotypes and genotypes) were generated using the same methods and an extended set of model parameters employed to evaluate the original quantitative trait model (no threshold) for the PPL . Briefly, we used SIMNUCLEAR (H. H. Göring) to generate data sets of 5-person nuclear families where N number of families was fixed at different values for each of the ascertainment schemes. The four ascertainment schemes were: families ascertained randomly, or based upon the presence of at least one proband child whose quantitative score exceeded 1 SD above the population mean (ASC 1), 2 SD (ASC 2) or 3 SD (ASC 3). Sample size N for each ascertainment scheme was chosen to ensure the average PPL at a linked marker was not near the probability scale boundaries (0 and 1), which could have obscured comparisons. For the four ascertainment conditions, random, ASC 1, ASC 2, and ASC 3, the sample sizes were N=500, 200, 100 and 50, respectively.
Marker data was simulated for two independent chromosomes so that it was possible to simulate either a single-locus trait and a null chromosome (to assess performance of the QTT model when there is no linkage) or a two-locus trait model. Linked markers were always θ = 0.01 from the disease locus, and in the case of two-locus models, the trait loci were on separate chromosomes, and, therefore, unlinked (θ = 0.5).
The trait genotypic parameter values for the loci were either 1) “Additive” P(A) = 0.7, μAA = −0.41, μAa = 0.28, μaa = 0.97, σ2AA = σ2Aa = σ2aa = 0.33; 2) “Dominant” P(A) = 0.5, μAA = −0.26, μAa = −0.26, μaa = 0.77, σ2AA = σ2Aa = σ2aa = 0.33; or 3) “Null” μAA = μAa = μaa (no linkage). These additive and dominant models have an effects size of 0.2 (i.e., 20% of the trait variance is attributable to the locus). Additionally, in the simulated datasets, the residual familial correlation (RFC) from polygenes was varied from 0 to 0.4 in increment of 0.1. For the remaining variance, a non-shared environment component was simulated. Effects from each locus, the polygenic variance component and the non-shared environmental component were summed to create the quantitative trait (i.e., no interaction effects). In total, there were 100 trait models simulated including five single- and two-trait locus models (Additive-Null, Dominant-Null, Additive-Additive, Additive-Dominant, Dominant-Dominant) by five levels of residual sibling correlation by four levels of ascertainment.
Simulations of non-normally distributed data were generated using the method of Bartlett and Vieland . Briefly, a random deviate for the genetic component is simulated assuming normality but is then converted to a kurtotic distribution by using the cumulative distribution function (CDF) of the simulated component and then calculating the inverse CDF for, in this case, a t-distribution with 12 degrees of freedom and then standardized back to the μ=0, σ=1 scale prior to analysis with the PPL. This procedure has the advantage that the non-normal data can be compared to the normal data since the two are always proportional for each individual in the dataset, making correlation of results from analyses of the normal and non-normal datasets directly interpretable.
The censoring scenario we created was based on the assumption that “affected” individuals would have a treatment effect, thus necessitating removal of those quantitative trait values from the analysis. Assuming the ascertainment threshold in the simulations is the same as the diagnosis threshold of a treatable disease, the presence of extreme trait values in the simulated data indicates which family members will be censored. Any trait value greater than the ascertainment criteria was set to unknown (missing data). This procedure also ensures that every pedigree will have at least one censored data point, since every pedigree was required to have at least one child with a trait value above the ascertainment threshold.
To assess performance of the QTT model, we compared the PPL using the QTT model on the censored data to the QT model using the original simulated data (no censoring). This comparison provides information on how well the QTT model recovers the missing information. We additionally compare the QT model on the original simulated data and data where the censored observation was fixed to a “sensible constant.” For variance components analysis, we used SOLAR version 4.3.1, correcting for ascertainment by fixing the trait mean and standard deviation to their corresponding simulated population values . MERLIN-REGRESS statistic  was calculated with MELRIN v1.1.2 . To quantify the change in a given test statistic the absolute (value) percent change from the baseline analysis was computed as xbaseline – ximputed / xbaseline.
We will discuss the performance of the QTT model for censored data in this section, always comparing results of a mathematical or simulated data manipulation to a baseline or control condition. The key metrics of performance are changes in average PPL across simulated generating models but within a testing condition, and the correlation in the average PPLs.
On collections of randomly ascertained families, the PPL performs as expected with all phenotypes known in the analysis (Table 1, QT model on the left), recapitulating previous PPL results . As in that study, there is a clear dependence of the average PPL on residual familial correlation similar to what is documented in VC methods [33,34], where models with greater residual familial correlation are consistently associated with higher average PPLs. However, this dependence is reversed across ascertainment schemes as ascertainment criteria become more extreme (Table 1, top to bottom). Ending with ASC 3 models, the average PPL decreases with increasing familial correlation. Application of the QTT model (Table 1, QTT on the right) does not alter any of the aforementioned observations. Though the average PPL is lower in all cases reflecting the loss of phenotypic information due to censoring (Figure 1), the reduction in average PPL is slight.
We contrast the QTT model with the ad hoc “sensible constant” approach  whereby data missing due to censoring are imputed with a constant selected to be representative of all missing data based on a priori assumptions alone. We applied this procedure, choosing a constant of 3 SD above the population mean. Regression shows deflation of the test statistic with censored data (Figure 2A). Depending upon the ascertainment condition, variance components analysis (Figure 2B) shows either deflation or inflation of the test statistic, whereby deflation occurs at ASC 3, but inflation occurs with ASC 1 and ASC 2.
For the PPL, the sensible constant procedure shows a high correlation between the QT model without censoring and the use of a sensible constant model (Figure 2C), however, Figure 2D indicates that the QTT model performs better on average. The absolute (value) percent change from the baseline analysis to analysis with the sensible constant was an 8.7% change in the PPL statistic, while regression had an absolute (value) percent change from the baseline of 12.5%, and for variance components, the average change was 58.9%. By way of contrast, for the QTT model had an absolute (value) percent change from the baseline analysis of only 6.8%.
The QT model within the PPL framework was shown to be robust to deviations from normality  when analyzing quantitative traits, specifically when testing excess kurtosis. Similar testing of the QTT model indicates that it also inherits this property (Figure 3A). Analysis of matched normal/non-normal data (see Methods) shows essentially no change in average PPL, though as before, there is a decrease in the average PPL when using the QTT model, consistent with the reduction in phenotypic information from censoring, but non-normality does not appear to have a systematic affect on the robustness of the PPL using the QTT model. As a control condition, to show that the simulated non-normal data was suitably extreme to be a valid test of the PPL’s sensitivity to non-normality, the same data was analyzed with VC and pedigree-based regression. VC shows marked inflation (Figure 3B), indicating the simulated data was suitably non-normal to be a fair test of the PPL QTT model. Pedigree-based regression also shows inflation of the test statistic (data not shown), but is known to be more robust to violations of the normality assumption.
The signature feature of the PPL framework is the flexibility to add in new data and accumulate evidence through sequentially updating across data sets. This procedure increases sample size without assuming homogeneity, within or between datasets, since the PPL is based on the admixture likelihood and model integration occurs within each dataset locally. Only the posterior distribution of the parameter theta (i.e., the linkage information) is updated across datasets. The PPL has been proven to converge to 1 as more data are collected when linkage exists and to converge to 0 when linkage is not true . While this convergence property is not expected to be affected by the inclusion of additional trait parameters, we assessed if the convergence was adversely affected by the addition of the threshold parameter. Within each generating condition, one replicate dataset was randomly chosen to be the “initial” dataset and additional replicate datasets were sampled without replacement for sequential updating with the PPL from the last round of updating. The outcome variable is the number of datasets required to achieve a PPL of 1 when the data were generated assuming linkage. Figure 4 compares the rate of PPL convergence to 1 for the QT model on the full data versus the QTT model on the censored data. For all simulating conditions, the rate of convergence is virtually identical for the QT and QTT model, but analysis of censored data using the QTT model is slightly less efficient, as expected, since the PPL under the QTT model is slightly less than the QT model on average.
We have evaluated the QTT model that handles a fundamental complexity of real-world data; specifically, we examined missing phenotypes due to censoring when the missing data can be reasonably inferred to be above or below an unknown threshold. As hypothesized, the addition of a threshold parameter to the PPL framework appears to have little effect on the desirable properties of the PPL while adding a property for analysis of censored data that was shown to be advantageous compared to alternative methods. The data demonstrate that the pattern of average QTT PPL values are highly similar to QT PPL values across widely disparate generating conditions, including dominance and a range of residual familial correlations and also when analyzing non-normally distributed data. Additionally, including the threshold in the PPL shows the expected convergence to 1 when linkage is present and not when linkage is absent thus empirically demonstrating that the sequential updating feature of the PPL framework is still available under the QTT model. While not explicitly tested, the data also show that when pedigrees have been selected on the basis of an extreme trait, the PPL does not require additional information in the form of an ascertainment correction to capitalize on the sampling method and this also remains true when considering the analysis with a threshold parameter.
The models we tested allow us to compare directly the QTT model applied to a dataset with >20% missing data (at least one censored observation per five person nuclear family) to one where the missing data point is known (where no model of censoring is needed). Overall, the average PPL decreases slightly when using the QTT model, a change of 6.8% from the value of the QT PPL, on average, which reflects the fact that although there is less phenotypic information in the analysis of the censored data (i.e., there are missing data points in the QTT model data sets that are known in the control data set). Preventing steep reductions in average PPL due to a >20% reduction in the sample’s information rests on the assumption that missing data can be inferred to be above or below the threshold. Inaccurately inferring the direction of censoring, i.e., assuming the value should be above the threshold parameter t when this is not true would be expected to reduce the performance of the model. However, for the types of systematic censoring discussed in this paper, involving treatment effects, our directionality assumption is quite reasonable as, for example, treatment effects are the result of medical interventions designed specifically for one tail of the phenotypic distribution.
We also compared the PPL threshold analysis model with an ad hoc procedure where missing data are imputed with a sensible constant that is ideally located in the truncated distribution to minimize error in the linkage statistic, though the ideal value for the sensible constant can never be known in practice. While it is always possible that a better choice for the sensible constant exists and that such a choice would perform better than the analyses we presented, the QTT model does not require choosing a sensible constant and thus circumvents the issue entirely. Our study shows an additional complication of the sensible constant procedure, that performance of the sensible constant is not uniform across all statistical methods applied in our study, with variance components being more greatly affected than regression. Therefore, the choice of analysis method when using a sensible constant is a factor that must be carefully weighed to limit reductions in statistical evidence.
It was previously shown for variance components and score statistics  that under random ascertainment, increasing sibling correlation increases power, as also shown here for all simulating generating models (Figure 5A). Additional analytical work indicated the same increase in power holds for single ascertainment , where the probability of more than one proband is zero. However, under our ascertainment scheme that requires at least one affected child and does permit multiple probands, higher residual sibling correlation seems to reduce power (Figure 5B). We postulate that this is due to parents being indirectly ascertained for a tendency to be homozygous at the trait locus, and therefore, uninformative for linkage. This indirect selection occurs since mating types with one or more homozygous parents are more likely to produce multiple homozygous children with extreme phenotypes. Power loss is larger under the additive models than under the dominant models since dominant models exerts less impact on the parents’ genotypes. Importantly, the QT and QTT models are always highly positively correlated, and incorporation of the threshold parameter, the main focus of our study, is not related to this phenomena. We also observed this pattern using censored and uncensored data, so this is not a property of censoring the observed trait distribution per se.
This study is limited both by the scope of the work and the assumptions herein. To utilize the proposed method on censored data, the investigator must know something about the missing data point. The key development here is that it is not necessary to know the approximate average value of the censored data points (to create a sensible constant); the researchers only need to know on which side of the threshold the datapoints will reside. In practice, the integration boundaries of t may be chosen conservatively to span a greater range than is even biologically permissible, provided it does not extend over all possible values of the quantitative trait. Further, while the PPL is robust to violations of normality in the trait distribution, we have only examined symmetric non-normal distributions since varying dispersion is a simple situation that is of considerable concern due to its association with increased false positive rates in both VC and regression-based analysis. Other non-symmetric distributional forms may require additional deliberation when applying and interpreting the PPL with a threshold.
The QTT model for handling censored data is currently limited to univariate applications. However, it is reasonable to pursue more advanced phenotypic models where correlated multivariate traits have some observations censored and others not. Many multivariate methods require fully observed data for each subject to be included in the calculations, and application of the QTT model to a multivariate context would remove this restriction.
The authors would like to thank Dr. Veronica Vieland for helpful comments on an earlier version of this manuscript. We appreciate our anonymous reviewers’ enthusiasm for our work, and especially the meticulous efforts of one reviewer who provided numerous suggestions that materially improved this manuscript. We gratefully acknowledge NIH funding from R01DC009453 (C.W.B.) and RC1MH088288 (L.H. and C.W.B.). This work was also supported in part by an allocation of computing time from the Ohio Supercomputer Center Grant PCCR0001-2 (C.W.B.).