We explore a Bayesian approach to selection of variables that represent fixed and random effects in modeling of longitudinal binary outcomes with missing data caused by dropouts. We show via analytic results for a simple example that nonignorable missing data lead to biased parameter estimates. This bias results in selection of wrong effects asymptotically, which we can confirm via simulations for more complex settings. By jointly modeling the longitudinal binary data with the dropout process that possibly leads to nonignorable missing data, we are able to correct the bias in estimation and selection. Mixture priors with a point mass at zero are used to facilitate variable selection. We illustrate the proposed approach using a clinical trial for acute ischemic stroke.
Bayesian variable selection; Bias; Dropout; Missing data; Model selection
Longitudinal studies of aging often gather repeated observations of cognitive status to describe the development of dementia and to assess the influence of risk factors. Clinical progression to dementia is often conceptualized by a multistage model of several transitions that synthesizes time-varying effects. In this study, we assess the influence of risk factors on the transitions among three cognitive status: cognitive stability (normal cognition for age), memory impairment, and clinical dementia. We have developed a shared random effects model that not only links the propensity of transitions and to the probability of informative missingness due to death, but also incorporates heterogeneous transition between subjects. We evaluate four approaches using generalized logit and four using proportional odds models to the first order Markov transition probabilities as a function of covariates. Random effects were incorporated into these models to account for within-subject correlations. Data from the Einstein Aging Study are used to evaluate the goodness-of-fit of these models using the Akaike information criterion. The best fitting model for each type (generalized logit and proportional odds) is recommended and their results are discussed in more details.
Generalized Logit; Multi-Stage; Proportional Odds; Random Effects; Transition
Evaluation of impact of potential uncontrolled confounding is an important component for causal inference based on observational studies. In this article, we introduce a general framework of sensitivity analysis that is based on inverse probability weighting. We propose a general methodology that allows both non-parametric and parametric analyses, which are driven by two parameters that govern the magnitude of the variation of the multiplicative errors of the propensity score and their correlations with the potential outcomes. We also introduce a specific parametric model that offers a mechanistic view on how the uncontrolled confounding may bias the inference through these parameters. Our method can be readily applied to both binary and continuous outcomes and depends on the covariates only through the propensity score that can be estimated by any parametric or non-parametric method. We illustrate our method with two medical data sets.
Causal inference; Inverse probability weighting; Propensity score; Sensitivity analysis; Uncontrolled confounding
Deletion diagnostics are introduced for the regression analysis of clustered binary outcomes estimated with alternating logistic regressions, an implementation of generalized estimating equations (GEE) that estimates regression coefficients in a marginal mean model and in a model for the intracluster association given by the log odds ratio. The diagnostics are developed within an estimating equations framework that recasts the estimating functions for association parameters based upon conditional residuals into equivalent functions based upon marginal residuals. Extensions of earlier work on GEE diagnostics follow directly, including computational formulae for one-step deletion diagnostics that measure the influence of a cluster of observations on the estimated regression parameters and on the overall marginal mean or association model fit. The diagnostic formulae are evaluated with simulations studies and with an application concerning an assessment of factors associated with health maintenance visits in primary care medical practices. The application and the simulations demonstrate that the proposed cluster-deletion diagnostics for alternating logistic regressions are good approximations of their exact fully iterated counterparts.
clustered data; generalized estimating equations; influence; logistic regression; orthogonalized residuals
We consider the problem of jointly modeling survival time and longitudinal data subject to measurement error. The survival times are modeled through the proportional hazards model and a random effects model is assumed for the longitudinal covariate process. Under this framework, we propose an approximate nonparametric corrected-score estimator for the parameter, which describes the association between the time-to-event and the longitudinal covariate. The term nonparametric refers to the fact that assumptions regarding the distribution of the random effects and that of the measurement error are unnecessary. The finite sample size performance of the approximate nonparametric corrected-score estimator is examined through simulation studies and its asymptotic properties are also developed. Furthermore, the proposed estimator and some existing estimators are applied to real data from an AIDS clinical trial.
Corrected score; Cumulant generating function; Measurement error; Proportional hazards; Random effects
Online risk prediction tools for common cancers are now easily accessible and widely used by patients and doctors for informed decision-making concerning screening and diagnosis. A practical problem is as cancer research moves forward and new biomarkers and risk factors are discovered, there is a need to update the risk algorithms to include them. Typically the new markers and risk factors cannot be retrospectively measured on the same study participants used to develop the original prediction tool, necessitating the merging of a separate study of different participants, which may be much smaller in sample size and of a different design. Validation of the updated tool on a third independent data set is warranted before the updated tool can go online. This article reports on the application of Bayes rule for updating risk prediction tools to include a set of biomarkers measured in an external study to the original study used to develop the risk prediction tool. The procedure is illustrated in the context of updating the online Prostate Cancer Prevention Trial Risk Calculator to incorporate the new markers %freePSA and [−2]proPSA measured on an external case control study performed in Texas, U.S.. Recent state-of-the art methods in validation of risk prediction tools and evaluation of the improvement of updated to original tools are implemented using an external validation set provided by the U.S. Early Detection Research Network.
Calibration; Discrimination; Net Benefit, Risk Prediction; Validation; Prostate Cancer Prevention Trial
Often in biomedical studies, the routine use of linear mixed-effects models (based on Gaussian assumptions) can be questionable when the longitudinal responses are skewed in nature. Skew-normal/elliptical models are widely used in those situations. Often, those skewed responses might also be subjected to some upper and lower quantification limits (viz. longitudinal viral load measures in HIV studies), beyond which they are not measurable. In this paper, we develop a Bayesian analysis of censored linear mixed models replacing the Gaussian assumptions with skew-normal/independent (SNI) distributions. The SNI is an attractive class of asymmetric heavy-tailed distributions that includes the skew-normal, the skew-t, skew-slash and the skew-contaminated normal distributions as special cases. The proposed model provides flexibility in capturing the effects of skewness and heavy tail for responses which are either left- or right-censored. For our analysis, we adopt a Bayesian framework and develop a MCMC algorithm to carry out the posterior analyses. The marginal likelihood is tractable, and utilized to compute not only some Bayesian model selection measures but also case-deletion influence diagnostics based on the Kullback-Leibler divergence. The newly developed procedures are illustrated with a simulation study as well as a HIV case study involving analysis of longitudinal viral loads.
Bayesian inference; Detection limit; HIV viral load; Linear mixed models; Skew-normal/independent distribution
Identifying risk factors for transition rates among normal cognition, mildly cognitive impairment, dementia and death in an Alzheimer’s disease study is very important. It is known that transition rates among these states are strongly time dependent. While Markov process models are often used to describe these disease progressions, the literature mainly focuses on time homogeneous processes, and limited tools are available for dealing with non-homogeneity. Further, patients may choose when they want to visit the clinics, which creates informative observations. In this paper, we develop methods to deal with non-homogeneous Markov processes through time scale transformation when observation times are pre-planned with some observations missing. Maximum likelihood estimation via the EM algorithm is derived for parameter estimation. Simulation studies demonstrate that the proposed method works well under a variety of situations. An application to the Alzheimer’s disease study identifies that there is a significant increase in transition rates as a function of time. Furthermore, our models reveal that the non-ignorable missing mechanism is perhaps reasonable.
Markov; Missing data; Non-homogeneous; Transformation
This paper is motivated from the analysis of neuroscience data in a study of neural and muscular mechanisms of muscle fatigue. Multidimensional outcomes of different natures were obtained simultaneously from multiple modalities, including handgrip force, electromyography (EMG), and functional magnetic resonance imaging (fMRI). We first study individual modeling of the univariate response depending on its nature. A mixed-effects beta model and a mixed-effects simplex model are compared for modeling the force/EMG percentages. A mixed-effects negative-binomial model is proposed for modeling the fMRI counts. Then, I present a joint modeling approach to model the multidimensional outcomes together, which allows us to not only estimate the covariate effects but also to evaluate the strength of association among the multiple responses from different modalities. A simulation study is conducted to quantify the possible benefits by the new approaches in finite sample situations. Finally, the analysis of the fatigue data is illustrated with the use of the proposed methods.
Dispersion; Generalized linear mixed models; Joint modeling; Multivariate responses; Pseudo-likelihood
Multiple diagnostic tests and risk factors are commonly available for many diseases. This information can be either redundant or complimentary. Combinations of these tests and risk factors may improve the diagnostic/predictive accuracy but also may unnecessarily increase complexity, risks, and/or costs. The improved accuracy gained by including additional variables can be evaluated by the increment of the area under (AUC) the receiver operating characteristic (ROC) curves with and without the new variable(s). In this paper, we derive a new test statistic to accurately and efficiently determine the statistical significance of this incremental AUC under a multivariate normality assumption. Our test links the difference in AUC to a quadratic form of a standardized mean shift in a unit of the inverse covariance matrix through a properly linear transformation of all diagnostic variables. The distribution of the estimator of the quadratic form is related to the multivariate Behrens-Fisher problem. We provide explicit mathematical solutions of the estimator and its approximate noncentral F-distribution, type I error rate, and sample size formula. We use simulation studies to prove that our new test maintains prespecified type I error rates as well as reasonable statistical power under practical sample sizes. We use data from the Study of Osteoporotic Fractures (SOF) as an application example to illustrate our method.
Area under ROC curves; Behrens-Fisher problem; Noncentral F distribution; Receiver operating characteristic curve
Studies of HIV dynamics in AIDS research are very important in understanding the pathogenesis of HIV-1 infection and also in assessing the effectiveness of antiretroviral (ARV) treatment. Viral dynamic models can be formulated through a system of nonlinear ordinary differential equations (ODE), but there has been only limited development of statistical methodologies for inference. This paper, motivated by an AIDS clinical study, discusses a hierarchical Bayesian nonlinear mixed-effects modeling approach to dynamic ODE models without a closed-form solution. In this model we fully integrate viral load, medication adherence, drug resistance, pharmacokinetics, baseline covariates and time-dependent drug efficacy into the data analysis for characterizing long-term virologic responses. Our method is implemented by a data set from an AIDS clinical study. The results suggest that modeling HIV dynamics and virologic responses with consideration of time-varying clinical factors as well as baseline characteristics may be important for HIV/AIDS studies in providing quantitative guidance to better understand the virologic responses to ARV treatment and to help evaluation of clinical trial design in response to existing therapies.
Baseline characteristics; Bayesian nonlinear mixed-effects models; Long-term HIV dynamics; Longitudinal data; Time-varying drug efficacy; Treatment factors
The ROC (Receiver Operating Characteristic) curve is the most commonly used statistical tool for describing the discriminatory accuracy of a diagnostic test. Classical estimation of the ROC curve relies on data from a simple random sample from the target population. In practice, estimation is often complicated due to not all subjects undergoing a definitive assessment of disease status (verification). Estimation of the ROC curve based on data only from subjects with verified disease status may be badly biased. In this work we investigate the properties of the doubly robust (DR) method for estimating the ROC curve under verification bias originally developed by Rotnitzky et al. (2006) for estimating the area under the ROC curve. The DR method can be applied for continuous scaled tests and allows for a non ignorable process of selection to verification. We develop the estimator's asymptotic distribution and examine its finite sample properties via a simulation study. We exemplify the DR procedure for estimation of ROC curves with data collected on patients undergoing electron beam computer tomography, a diagnostic test for calcification of the arteries.
Diagnostic test; Nonignorable; Semiparametric model; Sensitivity analysis; Sensitivity; Specificity
In disease screening and prognosis studies, an important task is to determine useful markers for identifying high-risk subgroups. Once such markers are established, they can be incorporated into public health practice to provide appropriate strategies for treatment or disease monitoring based on each individual’s predicted risk. In the recent years, genetic and biological markers have been examined extensively for their potential to signal progression or risk of disease. In addition to these markers, it has often been argued that short-term outcomes may be helpful in making a better prediction of disease outcomes in clinical practice. In this paper we propose model-free non-parametric procedures to incorporate short-term event information to improve the prediction of a long-term terminal event. We include the optional availability of a single discrete marker measurement and assess the additional information gained by including the short-term outcome. We focus on the semi-competing risk setting where the short-term event is an intermediate event that may be censored by the terminal event while the terminal event is only subject to administrative censoring. Simulation studies suggest that the proposed procedures perform well in finite samples. Our procedures are illustrated using a data set of post-dialysis patients with end-stage renal disease.
Biomarkers; Disease prognosis; Risk prediction; Survival analysis
In biomedical research, the logistic regression model is the most commonly used method for predicting the probability of a binary outcome. While many clinical researchers have expressed an enthusiasm for regression trees, this method may have limited accuracy for predicting health outcomes. We aimed to evaluate the improvement that is achieved by using ensemble-based methods, including bootstrap aggregation (bagging) of regression trees, random forests, and boosted regression trees. We analyzed 30-day mortality in two large cohorts of patients hospitalized with either acute myocardial infarction (N = 16,230) or congestive heart failure (N = 15,848) in two distinct eras (1999–2001 and 2004–2005). We found that both the in-sample and out-of-sample prediction of ensemble methods offered substantial improvement in predicting cardiovascular mortality compared to conventional regression trees. However, conventional logistic regression models that incorporated restricted cubic smoothing splines had even better performance. We conclude that ensemble methods from the data mining and machine learning literature increase the predictive performance of regression trees, but may not lead to clear advantages over conventional logistic regression models for predicting short-term mortality in population-based samples of subjects with cardiovascular disease.
PMID: 22777999 CAMSID: cams2404
Acute myocardial infarction; Bagging; Boosting; Data mining; Heart failure
Characterizing associations among multiple single-nucleotide polymorphisms (SNPs) within and across genes, and measures of disease progression or disease status will potentially offer new insight into disease etiology and disease progression. However, this presents a significant analytic challenge due to the existence of multiple potentially informative genetic loci, as well as environmental and demographic factors, and the generally uncharacterized and complex relationships among them. Latent variable modeling approaches offer a natural, yet underutilized, framework for analysis of data arising from these population-based genetic association investigations of complex diseases as they are well-suited to uncover simultaneous effects of multiple markers. In this manuscript we describe application and performance of two such latent variable methods, namely structural equation models (SEMs) and mixed effects models (MEMs), and highlight their theoretical overlap. The relative advantages of each paradigm is investigated through simulation studies and, finally, an application to data arising from a study of anti-retroviral associated dyslipidemia in HIV-infected individuals is provided for illustration.
Mixed Effects Models; Single-nucleotide polymorphisms (SNPs); Structural equation model (SEM)
In order to study family based association in the presence of linkage we extend a generalized linear mixed model proposed for genetic linkage analysis (Lebrec and van Houwelingen, 2007) by adding a genotypic effect to the mean. The corresponding score test is a weighted FBAT statistic, where the weight depends on the linkage effect and on other genetic and shared environmental effects. For testing of genetic association in the presence of gene covariate interaction, we propose a linear regression method where the family-specific score statistic is regressed on family-specific covariates. Both statistics are straightforward to compute. Simulation results show that adjusting the weight for the within-family variance structure may be a powerful approach in the presence of environmental effects. The test statistic for genetic association in the presence of gene-covariate interaction improved the power for detecting association. For illustration we analyze the Rheumatoid Arthritis data from GAW15. Adjusting for smoking and anti-CCP increased the significance of the association with the DR locus.
family-based studies; generalized linear mixed model; FBAT; Linkage; Linkage disequilibrium; Score test
Concerns have been raised about the use of traditional measures of model fit in evaluating risk prediction models for clinical use, and reclassification tables have been suggested as an alternative means of assessing the clinical utility of a model. Several measures based on the table have been proposed, including the reclassification calibration (RC) statistic, the net reclassification improvement (NRI), and the integrated discrimination improvement (IDI), but the performance of these in practical settings has not been fully examined. We used simulations to estimate the type I error and power for these statistics in a number of scenarios, as well as the impact of the number and type of categories, when adding a new marker to an established or reference model. The type I error was found to be reasonable in most settings, and power was highest for the IDI, which was similar to the test of association. The relative power of the RC statistic, a test of calibration, and the NRI, a test of discrimination, varied depending on the model assumptions. These tools provide unique but complementary information.
Calibration; Discrimination; Model accuracy; Prediction; Reclassification
This research is motivated by a pilot colorectal adenoma study, where the outcome of interest is the presence of colorectal adenoma representing risk for colorectal cancer, and the predictors of interest are protein biomarkers that are repeatedly measured with errors along the length of a microscopic structure in the human colon, the colon crypt. Biomarkers of this type are referred to as functional biomarkers. The investigators are interested in identifying features of functional biomarkers that are associated with risk for colorectal cancer. In this paper, we investigate a joint modeling approach, where the binary clinical outcome is modeled using a logistic regression model with the unobserved true functional biomarkers as predictors. Most existing methods are developed either for linear models or for functional biomarkers measured without errors and cannot be directly applied to our data. The applicable methods include a two-step method and a maximum likelihood method, which have some limitations. We propose a robust semiparametric method to overcome the limitations of the existing methods. We study the properties of the proposed method, and show in simulations that it compares favorably with other methods and also offers significant savings in CPU time. We analyze the pilot colorectal adenoma data and show that expression levels of APC, a tumor suppressor gene, in the transitional area from the proliferation zone to the differentiation zone of colon crypts are likely associated with risk for colorectal cancer. Given the relatively small sample size in the pilot study, our results need to be validated in future full-scale studies.
Functional biomarker; Functional logistic models; Joint modeling; Measurement errors; Semiparametric estimation; Sufficient score
Group testing, also known as pooled testing, and inverse sampling are both widely used methods of data collection when the goal is to estimate a small proportion. Taking a Bayesian approach, we consider the new problem of estimating disease prevalence from group testing when inverse (negative binomial) sampling is used. Using different distributions to incorporate prior knowledge of disease incidence and different loss functions, we derive closed form expressions for posterior distributions and resulting point and credible interval estimators. We then evaluate our new estimators, on Bayesian and classical grounds, and apply our methods to a West Nile Virus data set.
Bayesian estimation; Inverse sampling; Maximum likelihood; Pooled testing; West Nile Virus
In cluster randomized trials (CRTs), identifiable clusters rather than individuals are randomized to study groups. Resulting data often consist of a small number of clusters with correlated observations within a treatment group. Missing data often present a problem in the analysis of such trials, and multiple imputation (MI) has been used to create complete data sets, enabling subsequent analysis with well-established analysis methods for CRTs. We discuss strategies for accounting for clustering when multiply imputing a missing continuous outcome, focusing on estimation of the variance of group means as used in an adjusted t-test or ANOVA. These analysis procedures are congenial to (can be derived from) a mixed effects imputation model; however, this imputation procedure is not yet available in commercial statistical software. An alternative approach that is readily available and has been used in recent studies is to include fixed effects for cluster, but the impact of using this convenient method has not been studied. We show that under this imputation model the MI variance estimator is positively biased and that smaller ICCs lead to larger overestimation of the MI variance. Analytical expressions for the bias of the variance estimator are derived in the case of data missing completely at random (MCAR), and cases in which data are missing at random (MAR) are illustrated through simulation. Finally, various imputation methods are applied to data from the Detroit Middle School Asthma Project, a recent school-based CRT, and differences in inference are compared.
Cluster randomized; Missing Data; Multiple Imputation
Dementia, Alzheimer’s disease in particular, is one of the major causes of disability and decreased quality of life among the elderly and a leading obstacle to successful aging. Given the profound impact on public health, much research has focused on the age-specific risk of developing dementia and the impact on survival. Early work has discussed various methods of estimating age-specific incidence of dementia, among which the illness-death model is popular for modeling disease progression. In this article we use multiple imputation to fit multi-state models for survival data with interval censoring and left truncation. This approach allows semi-Markov models in which survival after dementia may depend on onset age. Such models can be used to estimate the cumulative risk of developing dementia in the presence of the competing risk of dementia-free death. Simulations are carried out to examine the performance of the proposed method. We analyze data from the Honolulu Asia Aging Study to estimate the age-specific and cumulative risks of dementia and to examine the effect of major risk factors on dementia onset and death.
Competing risk; Dementia; Illness-death model; Interval censoring; Multiple imputation
Resampling-based multiple testing methods that control the Familywise Error Rate in the strong sense are presented. It is shown that no assumptions whatsoever on the data-generating process are required to obtain a reasonably powerful and flexible class of multiple testing procedures. Improvements are obtained with mild assumptions. The methods are applicable to gene expression data in particular, but more generally to any multivariate, multiple group data that may be character or numeric. The role of the disputed “subset pivotality” condition is clarified.
Bootstrap; Exchangeability; Permutation; Resampling; Subset pivotality
Simulation-based assessment is a popular and frequently necessary approach to evaluation of statistical procedures. Sometimes overlooked is the ability to take advantage of underlying mathematical relations and we focus on this aspect. We show how to take advantage of large-sample theory when conducting a simulation using the analysis of genomic data as a motivating example. The approach uses convergence results to provide an approximation to smaller-sample results, results that are available only by simulation. We consider evaluating and comparing a variety of ranking-based methods for identifying the most highly associated SNPs in a genome-wide association study, derive integral equation representations of the pre-posterior distribution of percentiles produced by three ranking methods, and provide examples comparing performance. These results are of interest in their own right and set the framework for a more extensive set of comparisons.
Efficient simulation; ranking procedures; SNP identification
Spatial cluster detection has become an important methodology in quantifying the effect of hazardous exposures. Previous methods have focused on cross-sectional outcomes that are binary or continuous. There are virtually no spatial cluster detection methods proposed for longitudinal outcomes. This paper proposes a new spatial cluster detection method for repeated outcomes using cumulative geographic residuals. A major advantage of this method is its ability to readily incorporate information on study participants relocation, which most cluster detection statistics cannot. Application of these methods will be illustrated by the Home Allergens and Asthma prospective cohort study analyzing the relationship between environmental exposures and repeated measured outcome, occurrence of wheeze in the last 6 months, while taking into account mobile locations.
Asthma; Cumulative residuals; Repeated measured; Spatial cluster detection; Wheeze
Models for longitudinal data are employed in a wide range of behavioral, biomedical, psychosocial, and health-care related research. One popular model for continuous response is the linear mixed-effects model (LMM). Although simulations by recent studies show that LMM provides reliable estimates under departures from the normality assumption for complete data, the invariable occurrence of missing data in practical studies renders such robustness results less useful when applied to real study data. In this paper, we show by simulated studies that in the presence of missing data estimates of the fixed-effect of LMM are biased under departures from normality. We discuss two robust alternatives, the weighted generalized estimating equations (WGEE) and the augmented WGEE (AWGEE), and compare their performances with LMM using real as well as simulated data. Our simulation results show that both WGEE and AWGEE provide valid inference for skewed non-normal data when missing data follows the missing at random (MAR), the most popular missing data mechanism for real study data.
Augmented weighted generalized estimating equations; double robust estimate; missing at random; surrogacy assumption; weighted generalized estimating equations