Pre- and post-intervention experiments are widely used in medical and social behavioral studies, where each subject is supposed to contribute a pair of observations. In this paper we investigate sample size requirement for a scenario frequently encountered by practitioners: All enrolled subjects participate in the pre-intervention phase of study, but some of them will drop out due to various reasons, thus resulting in missing values in the post-intervention measurements. Traditional sample size calculation based on the McNemar’s test could not accommodate missing data. Through the GEE approach, we derive a closed-form sample size formula that properly accounts for the impact of partial observations. We demonstrate that when there is no missing data, the proposed sample size estimate under the GEE approach is very close to that under the McNemar’s test. When there is missing data, the proposed method can lead to substantial saving in sample size. Simulation studies and an example are presented.
doi:10.1016/j.csda.2013.07.037
PMCID: PMC3842849
PMID: 24293779
Disease-modifying (DM) trials on chronic diseases such as Alzheimer’s disease (AD) require a randomized start or withdrawal design. The analysis and optimization of such trials remain poorly understood, even for the simplest scenario in which only three repeated efficacy assessments are planned for each subject: one at the baseline, one at the end of the trial, and the other at the time when the treatments are switched. Under the assumption that the repeated measures across subjects follow a trivariate distribution whose mean and covariance matrix exist, the DM efficacy hypothesis is formulated by comparing the change of efficacy outcome between treatment arms with and without a treatment switch. Using a minimax criterion, a methodology is developed to optimally determine the sample size allocations to individual treatment arms as well as the optimum time when treatments are switched. The sensitivity of the optimum designs with respect to various model parameters is further assessed. An intersection-union test (IUT) is proposed to test the DM hypothesis, and determine the asymptotic size and the power of the IUT. Finally, the proposed methodology is demonstrated by using reported statistics on the placebo arms from several recently published symptomatic trials on AD to estimate necessary parameters and then deriving the optimum sample sizes and the time of treatment switch for future DM trials on AD.
doi:10.1016/j.csda.2013.07.013
PMCID: PMC3804275
PMID: 24159249
Alzheimer’s disease; Disease-modifying trials; Intersection-union test; Minimax criterion; Random intercept and slope models; Randomized start design
Models for survival data generally assume that covariates are fully observed. However, in medical studies it is not uncommon for biomarkers to be censored at known detection limits. A computationally-efficient multiple imputation procedure for modeling survival data with covariates subject to detection limits is proposed. This procedure is developed in the context of an accelerated failure time model with a flexible seminonparametric error distribution. The consistency and asymptotic normality of the multiple imputation estimator are established and a consistent variance estimator is provided. An iterative version of the proposed multiple imputation algorithm that approximates the EM algorithm for maximum likelihood is also suggested. Simulation studies demonstrate that the proposed multiple imputation methods work well while alternative methods lead to estimates that are either biased or more variable. The proposed methods are applied to analyze the dataset from a recently-conducted GenIMS study.
doi:10.1016/j.csda.2013.07.027
PMCID: PMC3816712
PMID: 24204085
Accelerated failure time model; Censored predictor; Complete case; Detection limit; Multiple imputation; Seminonparametric distribution
Examination of multiple conditional quantile functions provides a comprehensive view of the relationship between the response and covariates. In situations where quantile slope coefficients share some common features, estimation efficiency and model interpretability can be improved by utilizing such commonality across quantiles. Furthermore, elimination of irrelevant predictors will also aid in estimation and interpretation. These motivations lead to the development of two penalization methods, which can identify the interquantile commonality and nonzero quantile coefficients simultaneously. The developed methods are based on a fused penalty that encourages sparsity of both quantile coefficients and interquantile slope differences. The oracle properties of the proposed penalization methods are established. Through numerical investigations, it is demonstrated that the proposed methods lead to simpler model structure and higher estimation efficiency than the traditional quantile regression estimation.
doi:10.1016/j.csda.2013.08.006
PMCID: PMC3956083
PMID: 24653545
Fused adaptive lasso; Fused adaptive sup-norm; Oracle; Quantile regression; Smoothing; Variable selection
Motivated by recent developments on dimension reduction (DR) techniques for time series data, the association of a general deterrent effect towards South Carolina (SC)’s registration and notification (SORN) policy for preventing sex crimes was examined. Using adult sex crime arrestee data from 1990 to 2005, the the idea of Central Mean Subspace (CMS) is extended to intervention time series analysis (CMS-ITS) to model the sequential intervention effects of 1995 (the year SC’s SORN policy was initially implemented) and 1999 (the year the policy was revised to include online notification) on the time series spectrum. The CMS-ITS model estimation was achieved via kernel smoothing techniques, and compared to interrupted auto-regressive integrated time series (ARIMA) models. Simulation studies and application to the real data underscores our model’s ability towards achieving parsimony, and to detect intervention effects not earlier determined via traditional ARIMA models. From a public health perspective, findings from this study draw attention to the potential general deterrent effects of SC’s SORN policy. These findings are considered in light of the overall body of research on sex crime arrestee registration and notification policies, which remain controversial.
doi:10.1016/j.csda.2013.08.004
PMCID: PMC4002981
PMID: 24795489
Central mean subspace; Nadaraya-Watson kernel smoother; Nonlinear time series; Sex crime arrestee
Racial differences in prostate cancer incidence and mortality have been
reported. Several authors hypothesize that African Americans have a more rapid
growth rate of prostate cancer compared to Caucasians, that manifests in higher
recurrence and lower survival rates in the former group. In this paper we
propose a Bayesian piecewise mixture model to characterize PSA progression over
time in African Americans and Caucasians, using follow-up serial PSA
measurements after surgery. Each individual's PSA trajectory is
hypothesized to have a latent phase immediately following surgery followed by a
rapid increase in PSA indicating regrowth of the tumor. The true time of
transition from the latent phase to the rapid growth phase is unknown, and can
vary across individuals, suggesting a random change point across individuals.
Furthermore, some patients may not experience the latent phase due to the cancer
having already spread outside the prostate before undergoing surgery. We propose
a two-component mixture model to accommodate patients both with and without a
latent phase. Within the framework of this mixture model, patients who do not
have a latent phase are allowed to have different rates of PSA rise; patients
who have a latent phase are allowed to have different PSA trajectories,
represented by subject-specific change points and rates of PSA rise before and
after the change point. The proposed Bayesian methodology is implemented using
Markov Chain Monte Carlo techniques. Model selection is performed using deviance
information criteria based on the observed and complete likelihoods. Finally, we
illustrate the methods using a prostate cancer dataset.
doi:10.1016/j.csda.2011.07.011
PMCID: PMC4273308
PMID: 25540470
random change point; mixture distribution; PSA profiles; MCMC; DIC
The cross-validation deletion–substitution–addition (cvDSA) algorithm is based on data-adaptive estimation methodology to select and estimate marginal structural models (MSMs) for point treatment studies as well as models for conditional means where the outcome is continuous or binary. The algorithm builds and selects models based on user-defined criteria for model selection, and utilizes a loss function-based estimation procedure to distinguish between different model fits. In addition, the algorithm selects models based on cross-validation methodology to avoid “over-fitting” data. The cvDSA routine is an R software package available for download. An alternative R-package (DSA) based on the same principles as the cvDSA routine (i.e., cross-validation, loss function), but one that is faster and with additional refinements for selection and estimation of conditional means, is also available for download. Analyses of real and simulated data were conducted to demonstrate the use of these algorithms, and to compare MSMs where the causal effects were assumed (i.e., investigator-defined), with MSMs selected by the cvDSA. The package was used also to select models for the nuisance parameter (treatment) model to estimate the MSM parameters with inverse-probability of treatment weight (IPTW) estimation. Other estimation procedures (i.e., G-computation and double robust IPTW) are available also with the package.
doi:10.1016/j.csda.2010.02.002
PMCID: PMC4259156
PMID: 25505354
Cross-validation; Machine learning; Marginal structural models; Lung function; Cardiovascular mortality
The continuum regression technique provides an appealing regression framework connecting ordinary least squares, partial least squares and principal component regression in one family. It offers some insight on the underlying regression model for a given application. Moreover, it helps to provide deep understanding of various regression techniques. Despite the useful framework, however, the current development on continuum regression is only for linear regression. In many applications, nonlinear regression is necessary. The extension of continuum regression from linear models to nonlinear models using kernel learning is considered. The proposed kernel continuum regression technique is quite general and can handle very flexible regression model estimation. An efficient algorithm is developed for fast implementation. Numerical examples have demonstrated the usefulness of the proposed technique.
doi:10.1016/j.csda.2013.06.016
PMCID: PMC3777709
PMID: 24058224
Continuum Regression; Kernel regression; Ordinary Least Squares; Principal Component Regression; Partial Least Squares
With three ordinal diagnostic categories, the most commonly used measures for the overall diagnostic accuracy are the volume under the ROC surface (VUS) and partial volume under the ROC surface (PVUS), which are the extensions of the area under the ROC curve (AUC) and partial area under the ROC curve (PAUC), respectively. A gold standard (GS) test on the true disease status is required to estimate the VUS and PVUS. However, oftentimes it may be difficult, inappropriate, or impossible to have a GS because of misclassification error, risk to the subjects or ethical concerns. Therefore, in many medical research studies, the true disease status may remain unobservable. Under the normality assumption, a maximum likelihood (ML) based approach using the expectation–maximization (EM) algorithm for parameter estimation is proposed. Three methods using the concepts of generalized pivot and parametric/nonparametric bootstrap for confidence interval estimation of the difference in paired VUSs and PVUSs without a GS are compared. The coverage probabilities of the investigated approaches are numerically studied. The proposed approaches are then applied to a real data set of 118 subjects from a cohort study in early stage Alzheimer’s disease (AD) from the Washington University Knight Alzheimer’s Disease Research Center to compare the overall diagnostic accuracy of early stage AD between two different pairs of neuropsychological tests.
doi:10.1016/j.csda.2013.07.007
PMCID: PMC3883051
PMID: 24415817
EM algorithm; Generalized pivot; Gold standard; Parametric bootstrap; Volume under the ROC surface
Correlated or clustered failure time data often occur in medical studies, among other fields (Cai and Prentice, 1995; Kalbfleisch and Prentice, 2002), and sometimes such data arise together with interval censoring (Wang et al., 2006). Furthermore, the failure time of interest may be related to the cluster size. For example, Williamson et al. (2008) discussed such an example arising from a lymphatic filariasis study. A simple and common approach to the analysis of these data is to simplify or convert interval-censored data to right-censored data due to the lack of proper inference procedures for direct analysis of these data. In this paper, two procedures are presented for regression analysis of clustered failure time data that allow both interval censoring and informative cluster size. Simulation studies are conducted to evaluate the presented approaches and they are applied to a motivating example.
doi:10.1016/j.csda.2010.01.035
PMCID: PMC4240509
PMID: 25419023
Interval censoring; Informative cluster size; Weibull model; Within-cluster resampling
This paper discusses regression analysis of interval-censored failure time data, which occur in many fields including demographical, epidemiological, financial, medical, and sociological studies. For the problem, we focus on the situation where the survival time of interest can be described by the additive hazards model and a multiple imputation approach is presented for inference. A major advantage of the approach is its simplicity and it can be easily implemented by using the existing software packages for right-censored failure time data. Extensive simulation studies are conducted which indicate that the approach performs well for practical situations and is comparable to the existing methods. The methodology is applied to a set of interval-censored failure time data arising from an AIDS clinical trial.
doi:10.1016/j.csda.2009.10.022
PMCID: PMC4240511
PMID: 25419022
The Weibull family is widely used to model failure data, or lifetime data, although the classical two-parameter Weibull distribution is limited to positive data and monotone failure rate. The parameters of the Weibull model are commonly obtained by maximum likelihood estimation; however, it is well-known that this estimator is not robust when dealing with contaminated data. A new robust procedure is introduced to fit a Weibull model by using L2 distance, i.e. integrated square distance, of the Weibull probability density function. The Weibull model is augmented with a weight parameter to robustly deal with contaminated data. Results comparing a maximum likelihood estimator with an L2 estimator are given in this article, based on both simulated and real data sets. It is shown that this new L2 parametric estimation method is more robust and does a better job than maximum likelihood in the newly proposed Weibull model when data are contaminated. The same preference for L2 distance criterion and the new Weibull model also happens for right-censored data with contamination.
doi:10.1016/j.csda.2013.05.009
PMCID: PMC3718081
PMID: 23888090
Weibull distribution; L2 distance; Robust estimator; Maximum likelihood; Right-censored data; Contamination
In clinical studies, covariates are often measured with error due to biological fluctuations, device error and other sources. Summary statistics and regression models that are based on mismeasured data will differ from the corresponding analysis based on the “true” covariate. Statistical analysis can be adjusted for measurement error, however various methods exhibit a tradeo between convenience and performance. Moment Adjusted Imputation (MAI) is method for measurement error in a scalar latent variable that is easy to implement and performs well in a variety of settings. In practice, multiple covariates may be similarly influenced by biological fluctuastions, inducing correlated multivariate measurement error. The extension of MAI to the setting of multivariate latent variables involves unique challenges. Alternative strategies are described, including a computationally feasible option that is shown to perform well.
doi:10.1016/j.csda.2013.04.017
PMCID: PMC3780432
PMID: 24072947
Moment adjusted imputation; Multivariate measurement error; Logistic Regression; Regression calibration
The two main algorithms that have been considered for fitting constrained marginal models
to discrete data, one based on Lagrange multipliers and the other on a regression model, are studied
in detail. It is shown that the updates produced by the two methods are identical, but that the
Lagrangian method is more efficient in the case of identically distributed observations. A
generalization is given of the regression algorithm for modelling the effect of exogenous
individual-level covariates, a context in which the use of the Lagrangian algorithm would be
infeasible for even moderate sample sizes. An extension of the method to likelihood-based estimation
under L1-penalties is also considered.
doi:10.1016/j.csda.2013.02.001
PMCID: PMC3686142
PMID: 23794772
categorical data; L1-penalty; marginal log-linear model; maximum likelihood; non-linear constraint
Three recent nonparametric methodologies for estimating a monotone regression function F and its inverse F−1 are (1) the inverse kernel method DNP (Dette et al. (2005), Dette and Scheder (2010)), (2) the monotone spline (Kong and Eubank (2006)) and (3) the data adaptive method NAM (Bhattacharya and Lin (2010), (2011)), with roots in isotonic regression (Ayer et al. (1955), Bhattacharya and Kong (2007)). All three have asymptotically optimal error rates. In this article their finite sample performances are compared using extensive simulation from diverse models of interest, and by analysis of real data. Let there be m distinct values of the independent variable x among N observations y. The results show that if m is relatively small compared to N then generally the NAM performs best, while the DNP outperforms the other methods when m is O(N) unless there is a substantial clustering of the values of the independent variable x.
doi:10.1016/j.csda.2013.01.023
PMCID: PMC3756697
PMID: 23997381
A fully automated smoothing procedure for uniformly-sampled datasets is described. The algorithm, based on a penalized least squares method, allows fast smoothing of data in one and higher dimensions by means of the discrete cosine transform. Automatic choice of the amount of smoothing is carried out by minimizing the generalized cross-validation score. An iteratively weighted robust version of the algorithm is proposed to deal with occurrences of missing and outlying values. Simplified Matlab codes with typical examples in one to three dimensions are provided. A complete user-friendly Matlab program is also supplied. The proposed algorithm – very fast, automatic, robust and requiring low storage –provides an efficient smoother for numerous applications in the area of data analysis.
doi:10.1016/j.csda.2009.09.020
PMCID: PMC4008475
PMID: 24795488 CAMSID: cams3650
The L1 norm has been applied in numerous variations of principal component analysis (PCA). L1-norm PCA is an attractive alternative to traditional L2-based PCA because it can impart robustness in the presence of outliers and is indicated for models where standard Gaussian assumptions about the noise may not apply. Of all the previously-proposed PCA schemes that recast PCA as an optimization problem involving the L1 norm, none provide globally optimal solutions in polynomial time. This paper proposes an L1-norm PCA procedure based on the efficient calculation of the optimal solution of the L1-norm best-fit hyperplane problem. We present a procedure called L1-PCA* based on the application of this idea that fits data to subspaces of successively smaller dimension. The procedure is implemented and tested on a diverse problem suite. Our tests show that L1-PCA* is the indicated procedure in the presence of unbalanced outlier contamination.
doi:10.1016/j.csda.2012.11.007
PMCID: PMC3746759
PMID: 23976807
principal component analysis; linear programming; L1 regression
In longitudinal cluster randomized clinical trials (cluster-RCT), subjects are nested within a higher level unit such as clinics and are evaluated for outcome repeatedly over the study period. This study design results in a three level hierarchical data structure. When the primary goal is to test the hypothesis that an intervention has an effect on the rate of change in the outcome over time and the between-subject variation in slopes is substantial, the subject-specific slopes are often modeled as random coefficients in a mixed-effects linear model. In this paper, we propose approaches for determining the samples size for each level of a 3-level hierarchical trial design based on ordinary least squares (OLS) estimates for detecting a difference in mean slopes between two intervention groups when the slopes are modeled as random. Notably, the sample size is not a function of the variances of either the second or the third level random intercepts and depends on the number of second and third level data units only through their product. Simulation results indicate that the OLS-based power and sample sizes are virtually identical to the empirical maximum likelihood based estimates even with varying cluster sizes. Sample sizes for random versus fixed slope models are also compared. The effects of the variance of the random slope on the sample size determinations are shown to be enormous. Therefore, when between-subject variations in outcome trends are anticipated to be significant, sample size determinations based on a fixed slope model can result in a seriously underpowered study.
doi:10.1016/j.csda.2012.11.016
PMCID: PMC3580878
PMID: 23459110
longitudinal cluster RCT; three level data; power; sample size; random slope; effect size
The semiparametric accelerated hazards mixture cure model provides a useful alternative to analyze survival data with a cure fraction if covariates of interest have a gradual effect on the hazard of uncured patients. However, the application of the model may be hindered by the computational intractability of its estimation method due to non-smooth estimating equations involved. We propose a new semiparametric estimation method based on a smooth estimating equation for the model and demonstrate that the new method makes the parameter estimation more tractable without loss of efficiency. The proposed method is used to fit the model to a SEER breast cancer data set.
doi:10.1016/j.csda.2012.09.017
PMCID: PMC3535878
PMID: 23293406
EM algorithm; Kernel-smoothed approximation; Non-smooth estimating equation; Profile likelihood
Data processing and source identification using lower dimensional hidden structure plays an essential role in many fields of applications, including image processing, neural networks, genome studies, signal processing and other areas where large datasets are often encountered. One of the common methods for source separation using lower dimensional structure involves the use of Independent Component Analysis, which is based on a linear representation of the observed data in terms of independent hidden sources. The problem thus involves the estimation of the linear mixing matrix and the densities of the independent hidden sources. However, the solution to the problem depends on the identifiability of the sources. This paper first presents a set of sufficient conditions to establish the identifiability of the sources and the mixing matrix using moment restrictions of the hidden source variables. Under such sufficient conditions a semi-parametric maximum likelihood estimate of the mixing matrix is obtained using a class of mixture distributions. The consistency of our proposed estimate is established under additional regularity conditions. The proposed method is illustrated and compared with existing methods using simulated and real data sets.
doi:10.1016/j.csda.2012.09.012
PMCID: PMC3921001
PMID: 24526802
Constrained EM-algorithm; Mixture Density Estimation; Source Identification
Many clinical trials compare the efficacy of K (≥3) treatments in repeated measurement studies. However, the design of such trials have received relatively less attention from researchers. Zhang & Ahn (2012) derived a closed-form sample size formula for two-sample comparisons of time-averaged responses using the generalized estimating equation (GEE) approach, which takes into account different correlation structures and missing data patterns. In this paper, we extend the sample size formula to scenarios where K (≥3) treatments are compared simultaneously to detect time-averaged differences in treatment effect. A closed-form sample size formula based on the noncentral χ2 test statistic is derived. We conduct simulation studies to assess the performance of the proposed sample size formula under various correlation structures from a damped exponential family, random and monotone missing patterns, and different observation probabilities. Simulation studies show that empirical powers and type I errors are close to their nominal levels. The proposed sample size formula is illustrated using a real clinical trial example.
doi:10.1016/j.csda.2012.08.013
PMCID: PMC3505113
PMID: 23183937
Based on the Bayes modal estimate of factor scores in binary latent variable models, this paper proposes two new limited information estimators for the factor analysis model with a logistic link function for binary data based on Bernoulli distributions up to the second and the third order with maximum likelihood estimation and Laplace approximations to required integrals. These estimators and two existing limited information weighted least squares estimators are studied empirically. The limited information estimators compare favorably to full information estimators based on marginal maximum likelihood, MCMC, and multinomial distribution with a Laplace approximation methodology. Among the various estimators, Maydeu-Olivares and Joe’s (2005) weighted least squares limited information estimators implemented with Laplace approximations for probabilities are shown in a simulation to have the best root mean square errors.
doi:10.1016/j.csda.2012.06.022
PMCID: PMC3418349
PMID: 22904587
Limited Information; Laplace Approximation; Binary Response; Marginal Likelihood; Factor Scores
In parametric hierarchical models, it is standard practice to place mean and variance constraints on the latent variable distributions for the sake of identifiability and interpretability. Because incorporation of such constraints is challenging in semiparametric models that allow latent variable distributions to be unknown, previous methods either constrain the median or avoid constraints. In this article, we propose a centered stick-breaking process (CSBP), which induces mean and variance constraints on an unknown distribution in a hierarchical model. This is accomplished by viewing an unconstrained stick-breaking process as a parameter-expanded version of a CSBP. An efficient blocked Gibbs sampler is developed for approximate posterior computation. The methods are illustrated through a simulated example and an epidemiologic application.
PMCID: PMC3869464
PMID: 24363478
Dirichlet process; Latent variables; Moment constraints; Nonparametric Bayes; Parameter expansion; Random effects
A primary challenge in unsupervised clustering using mixture models is the selection of a family of basis distributions flexible enough to succinctly represent the distributions of the target subpopulations. In this paper we introduce a new family of Gaussian Well distributions (GWDs) for clustering applications where the target subpopulations are characterized by hollow [hyper-]elliptical structures. We develop the primary theory pertaining to the GWD, including mixtures of GWDs, selection of prior distributions, and computationally efficient inference strategies using Markov chain Monte Carlo. We demonstrate the utility of our approach, as compared to standard Gaussian mixture methods on a synthetic dataset, and exemplify its applicability on an example from immunofluorescence imaging, emphasizing the improved interpretability and parsimony of the GWD-based model.
doi:10.1016/j.csda.2012.03.027
PMCID: PMC3384503
PMID: 22754052
Gaussian mixtures; Poisson point processes; subtractive mixtures; histology
This paper considers model-based methods for estimation of the adjusted attributable risk (AR) in both case-control and cohort studies. An earlier review discussed approaches for both types of studies, using the standard logistic regression model for case-control studies, and for cohort studies proposing the equivalent Poisson model in order to account for the additional variability in estimating the distribution of exposures and covariates from the data. In this paper we revisit case-control studies, arguing for the equivalent Poisson model in this case as well. Using the delta method with the Poisson model, we provide general expressions for the asymptotic variance of the AR for both types of studies. This includes the generalized AR, which extends the original idea of attributable risk to the case where the exposure is not completely eliminated. These variance expressions can be easily programmed in any statistical package that includes Poisson regression and has capabilities for simple matrix algebra. In addition, we discuss computation of standard errors and confidence limits using bootstrap resampling. For cohort studies, use of the bootstrap allows binary regression models with link functions other than the logit.
doi:10.1016/j.csda.2012.04.017
PMCID: PMC3462467
PMID: 23049150
adjusted attributable risk; case-control study; cohort study; Poisson regression; delta method; model-based estimate; bootstrap methods