Linear mixed-effects models involve fixed effects, random effects and covariance structure, which require model selection to simplify a model and to enhance its interpretability and predictability. In this article, we develop, in the context of linear mixed-effects models, the generalized degrees of freedom and an adaptive model selection procedure defined by a data-driven model complexity penalty. Numerically, the procedure performs well against its competitors not only in selecting fixed effects but in selecting random effects and covariance structure as well. Theoretically, asymptotic optimality of the proposed methodology is established over a class of information criteria. The proposed methodology is applied to the BioCycle study, to determine predictors of hormone levels among premenopausal women and to assess variation in hormone levels both between and within women across the menstrual cycle.
Adaptive penalty; linear mixed-effects models; loss estimation; generalized degrees of freedom
Canonical correlation analysis (CCA) is a widely used multivariate method for assessing the association between two sets of variables. However, when the number of variables far exceeds the number of subjects, such in the case of large-scale genomic studies, the traditional CCA method is not appropriate. In addition, when the variables are highly correlated the sample covariance matrices become unstable or undefined. To overcome these two issues, sparse canonical correlation analysis (SCCA) for multiple data sets has been proposed using a Lasso type of penalty. However, these methods do not have direct control over sparsity of solution. An additional step that uses Bayesian Information Criterion (BIC) has also been suggested to further filter out unimportant features. In this paper, a comparison of four penalty functions (Lasso, Elastic-net, SCAD and Hard-threshold) for SCCA with and without the BIC filtering step have been carried out using both real and simulated genotypic and mRNA expression data. This study indicates that the SCAD penalty with BIC filter would be a preferable penalty function for application of SCCA to genomic data.
SCCA; Lasso; Elastic-net; SCAD; BIC; penalty; SNP; mRNA expression
We evaluate the performance of the Dirichlet process mixture (DPM) and the latent class model (LCM) in identifying autism phenotype subgroups based on categorical autism spectrum disorder (ASD) diagnostic features from the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition Text Revision. A simulation study is designed to mimic the diagnostic features in the ASD dataset in order to evaluate the LCM and DPM methods in this context. Likelihood based information criteria and DPM partitioning are used to identify the best fitting models. The Rand statistic is used to compare the performance of the methods in recovering simulated phenotype subgroups. Our results indicate excellent recovery of the simulated subgroup structure for both methods. The LCM performs slightly better than DPM when the correct number of latent subgroups is selected a priori. The DPM method utilizes a maximum a posteriori (MAP) criterion to estimate the number of classes, and yielded results in fair agreement with the LCM method. Comparison of model fit indices in identifying the best fitting LCM showed that adjusted Bayesian information criteria (ABIC) picks the correct number of classes over 90% of the time. Thus, when diagnostic features are categorical and there is some prior information regarding the number of latent classes, LCM in conjunction with ABIC is preferred.
autism; Dirichlet process mixture; latent class model
Random-effects change point models are formulated for longitudinal data obtained from cognitive tests. The conditional distribution of the response variable in a change point model is often assumed to be normal even if the response variable is discrete and shows ceiling effects. For the sum score of a cognitive test, the binomial and the beta-binomial distributions are presented as alternatives to the normal distribution. Smooth shapes for the change point models are imposed. Estimation is by marginal maximum likelihood where a parametric population distribution for the random change point is combined with a non-parametric mixing distribution for other random effects. An extension to latent class modelling is possible in case some individuals do not experience a change in cognitive ability. The approach is illustrated using data from a longitudinal study of Swedish octogenarians and nonagenarians that began in 1991. Change point models are applied to investigate cognitive change in the years before death.
Beta-binomial distribution; Latent class model; Mini-mental state examination; Random-effects model
A new comprehensive procedure for statistical analysis of two-dimensional polyacrylamide gel electrophoresis (2D PAGE) images is proposed, including protein region quantification, normalization and statistical analysis. Protein regions are defined by the master watershed map that is obtained from the mean gel. By working with these protein regions, the approach bypasses the current bottleneck in the analysis of 2D PAGE images: it does not require spot matching. Background correction is implemented in each protein region by local segmentation. Two-dimensional locally weighted smoothing (LOESS) is proposed to remove any systematic bias after quantification of protein regions. Proteins are separated into mutually independent sets based on detected correlations, and a multivariate analysis is used on each set to detect the group effect. A strategy for multiple hypothesis testing based on this multivariate approach combined with the usual Benjamini-Hochberg FDR procedure is formulated and applied to the differential analysis of 2D PAGE images. Each step in the analytical protocol is shown by using an actual dataset. The effectiveness of the proposed methodology is shown using simulated gels in comparison with the commercial software packages PDQuest and Dymension. We also introduce a new procedure for simulating gel images.
proteomics; 2D PAGE; watershed region; spot matching; statistical image analysis; high dimension
We consider the estimation of the parameters indexing a parametric model for the conditional distribution of a diagnostic marker given covariates and disease status. Such models are useful for the evaluation of whether and to what extent a marker’s ability to accurately detect or discard disease depends on patient characteristics. A frequent problem that complicates the estimation of the model parameters is that estimation must be conducted from observational studies. Often, in such studies not all patients undergo the gold standard assessment of disease. Furthermore, the decision as to whether a patient undergoes verification is not controlled by study design. In such scenarios, maximum likelihood estimators based on subjects with observed disease status are generally biased. In this paper, we propose estimators for the model parameters that adjust for selection to verification that may depend on measured patient characteristics and additonally adjust for an assumed degree of residual association. Such estimators may be used as part of a sensitivity analysis for plausible degrees of residual association. We describe a doubly robust estimator that has the attractive feature of being consistent if either a model for the probability of selection to verification or a model for the probability of disease among the verified subjects (but not necessarily both) is correct.
Missing at Random; Nonignorable; Missing Covariate; Sensitivity Analysis; Semiparametric; Diagnosis
We present a Bayesian variable selection method for the setting in which the number of independent variables or predictors in a particular dataset is much larger than the available sample size. While most existing methods allow some degree of correlations among predictors but do not consider these correlations for variable selection, our method accounts for correlations among the predictors in variable selection. Our correlation-based stochastic search (CBS) method, the hybrid-CBS algorithm, extends a popular search algorithm for high-dimensional data, the stochastic search variable selection (SSVS) method. Similar to SSVS, we search the space of all possible models using variable addition, deletion or swap moves. However, our moves through the model space are designed to accommodate correlations among the variables. We describe our approach for continuous, binary, ordinal, and count outcome data. The impact of choices of prior distributions and hyper-parameters is assessed in simulation studies. We also examined performance of variable selection and prediction as the correlation structure of the predictors varies. We found that the hybrid-CBS resulted in lower prediction errors and better identified the true outcome associated predictors than SSVS when predictors were moderately to highly correlated. We illustrate the method on data from a proteomic profiling study of melanoma, a skin cancer.
Correlated predictors; correlation-based search; proteomic data
Consider clustered matched-pair studies for non-inferiority where clusters are independent but units in a cluster are correlated. An inexpensive new procedure and the expensive standard one are applied to each unit and outcomes are binary responses. Appropriate statistics testing non-inferiority of a new procedure have been developed recently by several investigators. In this note, we investigate power and sample size requirement of the clustered matched pair study for non-inferiority. Power of a test is related primarily to the number of clusters. The effect of a cluster size on power is secondary. The efficiency of a clustered matched-pair design is inversely related to the intra-class correlation coefficient within a cluster. We present an explicit formula for obtaining the number of clusters for given a cluster size and the cluster size for a given number of clusters for a specific power. We also provide alternative sample size calculations when available information regarding parameters are limited. The formulae can be useful in designing a clustered matched-pair study for non-inferiority. An example for determining sample size to establish non-inferiority for a clustered matched-pair study is illustrated.
binary outcomes; clustered matched-pair; intra-class correlation coefficient; non-inferiority; power; sample size
Most variable selection techniques focus on first-order linear regression models. Often, interaction and quadratic terms are also of interest, but the number of candidate predictors grows very fast with the number of original predictors, making variable selection more difficult. Forward selection algorithms are thus developed that enforce natural hierarchies in second-order models to control the entry rate of uninformative effects and to equalize the false selection rates from first-order and second-order terms. Method performance is compared through Monte Carlo simulation and illustrated with data from a Cox regression and from a response surface experiment.
Bagging; False selection rate; Model selection; Response optimization; Variable selection
We consider the problem of dichotomizing a continuous covariate when performing a regression analysis based on a generalized estimation approach. The problem involves estimation of the cutpoint for the covariate and testing the hypothesis that the binary covariate constructed from the continuous covariate has a significant impact on the outcome. Due to the multiple testing used to find the optimal cutpoint, we need to make an adjustment to the usual significance test to preserve the type-I error rates. We illustrate the techniques on one data set of patients given unrelated hematopoietic stem cell transplantation. Here the question is whether the CD34 cell dose given to patient affects the outcome of the transplant and what is the smallest cell dose which is needed for good outcomes.
Dichotomized outcomes; Generalized estimating equations; Generalized linear model; Pseudo-values; Survival analysis
Exact calculations of model posterior probabilities or related quantities are often infeasible due to the analytical intractability of predictive densities. Here new approximations to obtain predictive densities are proposed and contrasted with those based on the Laplace method. Our theory and a numerical study indicate that the proposed methods are easy to implement, computationally efficient, and accurate over a wide range of hyperparameters. In the context of GLMs, we show that they can be employed to facilitate the posterior computation under three general classes of informative priors on regression coefficients. A real example is provided to demonstrate the feasibility and usefulness of the proposed methods in a fully Bayes variable selection procedure.
Laplace Approximation; GLM; Normal Prior; Power Prior; Conjugate Prior; Asymptotic Normality; Logistic Regression
Gibbs sampler has been used exclusively for compatible conditionals that converge to a unique invariant joint distribution. However, conditional models are not always compatible. In this paper, a Gibbs sampling-based approach — Gibbs ensemble —is proposed to search for a joint distribution that deviates least from a prescribed set of conditional distributions. The algorithm can be easily scalable such that it can handle large data sets of high dimensionality. Using simulated data, we show that the proposed approach provides joint distributions that are less discrepant from the incompatible conditionals than those obtained by other methods discussed in the literature. The ensemble approach is also applied to a data set regarding geno-polymorphism and response to chemotherapy in patients with metastatic colorectal
Gibbs sampler; Conditionally specified distribution; Linear programming; Ensemble method; Odds ratio
Breast cancer is the most common non-skin cancer in women and the second most common cause of cancer-related death in U.S. women. It is well known that the breast cancer survival varies by age at diagnosis. For most cancers, the relative survival decreases with age but breast cancer may have the unusual age pattern. In order to reveal the stage risk and age effects pattern, we propose the semiparametric accelerated failure time partial linear model and develop its estimation method based on the P-spline and the rank estimation approach. The simulation studies demonstrate that the proposed method is comparable to the parametric approach when data is not contaminated, and more stable than the parametric methods when data is contaminated. By applying the proposed model and method to the breast cancer data set of Atlantic county, New Jersey from SEER program, we successfully reveal the significant effects of stage, and show that women diagnosed around 38s have consistently higher survival rates than either younger or older women.
Accelerated failure time model; Partial linear model; Penalized spline; Rank estimation; Robustness
The family of weighted likelihood estimators largely overlaps with minimum divergence estimators. They are robust to data contaminations compared to MLE. We define the class of generalized weighted likelihood estimators (GWLE), provide its influence function and discuss the efficiency requirements. We introduce a new truncated cubic-inverse weight, which is both first and second order efficient and more robust than previously reported weights. We also discuss new ways of selecting the smoothing bandwidth and weighted starting values for the iterative algorithm. The advantage of the truncated cubic-inverse weight is illustrated in a simulation study of three-components normal mixtures model with large overlaps and heavy contaminations. A real data example is also provided.
finite normal mixture; generalized weighted likelihood estimator; influence function; smoothing bandwidth; truncated cubic-inverse weight; weighted starting value
Generalized additive models (GAMs) have distinct advantages over generalized linear models as they allow investigators to make inferences about associations between outcomes and predictors without placing parametric restrictions on the associations. The variable of interest is often smoothed using a locally weighted regression (LOESS) and the optimal span (degree of smoothing) can be determined by minimizing the Akaike Information Criterion (AIC). A natural hypothesis when using GAMs is to test whether the smoothing term is necessary or if a simpler model would suffice. The statistic of interest is the difference in deviances between models including and excluding the smoothed term. As approximate chi-square tests of this hypothesis are known to be biased, permutation tests are a reasonable alternative. We compare the type I error rates of the chi-square test and of three permutation test methods using synthetic data generated under the null hypothesis. In each permutation method a distribution of differences in deviances is obtained from 999 permuted datasets and the null hypothesis is rejected if the observed statistic falls in the upper 5% of the distribution. One test is a conditional permutation test using the optimal span size for the observed data; this span size is held constant for all permutations. This test is shown to have an inflated type I error rate. Alternatively, the span size can be fixed a priori such that the span selection technique is not reliant on the observed data. This test is shown to be unbiased; however, the choice of span size is not clear. A third method is an unconditional permutation test where the optimal span size is selected for observed and permuted datasets. This test is unbiased though computationally intensive.
Generalized Additive Models; Type I Error; Permutation Test; Span Size Selection
The assumption of proportional hazards (PH) fundamental to the Cox PH model sometimes may not hold in practice. In this paper, we propose a generalization of the Cox PH model in terms of the cumulative hazard function taking a form similar to the Cox PH model, with the extension that the baseline cumulative hazard function is raised to a power function. Our model allows for interaction between covariates and the baseline hazard and it also includes, for the two sample problem, the case of two Weibull distributions and two extreme value distributions differing in both scale and shape parameters. The partial likelihood approach can not be applied here to estimate the model parameters. We use the full likelihood approach via a cubic B-spline approximation for the baseline hazard to estimate the model parameters. A semi-automatic procedure for knot selection based on Akaike’s Information Criterion is developed. We illustrate the applicability of our approach using real-life data.
censored survival data analysis; crossing hazards; Frailty model; maximum likelihood; regression; spline function; Akaike information criterion; Weibull distribution; extreme value distribution
Importance sampling is an efficient strategy for reducing the variance of certain bootstrap estimates. It has found wide applications in bootstrap quantile estimation, proportional hazards regression, bootstrap confidence interval estimation, and other problems. Although estimation of the optimal sampling weights is a special case of convex programming, generic optimization methods are frustratingly slow on problems with large numbers of observations. For instance, interior point and adaptive barrier methods must cope with forming, storing, and inverting the Hessian of the objective function. In this paper, we present an efficient procedure for calculating the optimal importance weights and compare its performance to standard optimization methods on a representative data set. The procedure combines several potent ideas for large scale optimization.
importance resampling; bootstrap; majorization; quasi-Newton acceleration
According to the American Cancer Society report (1999), cancer surpasses heart disease as the leading cause of death in the United States of America (USA) for people of age less than 85. Thus, medical research in cancer is an important public health interest. Understanding how medical improvements are affecting cancer incidence, mortality and survival is critical for effective cancer control. In this paper, we study the cancer survival trend on the population level cancer data. In particular, we develop a parametric Bayesian joinpoint regression model based on a Poisson distribution for the relative survival. To avoid identifying the cause of death, we only conduct analysis based on the relative survival. The method is further extended to the semiparametric Bayesian joinpoint regression models wherein the parametric distributional assumptions of the joinpoint regression models are relaxed by modeling the distribution of regression slopes using Dirichlet process mixtures. We also consider the effect of adding covariates of interest in the joinpoint model. Three model selection criteria, namely, the conditional predictive ordinate (CPO), the expected predictive deviance (EPD), and the deviance information criteria (DIC), are used to select the number of joinpoints. We analyze the grouped survival data for distant testicular cancer from the Surveillance, Epidemiology, and End Results (SEER) Program using these Bayesian models.
The success of human pancreatic islet transplantation in a subset of type 1 diabetic patients has led to an increased demand for this tissue in both clinical and basic research, yet the availability of such preparations is limited and the quality highly variable. Under the current process of islet distribution for basic science experimentation nationwide, specialized laboratories attempt to distribute islets to one or more scientists based on a list of known investigators. This Local Decision Making (LDM) process has been found to be ineffective and suboptimal. To alleviate these problems, a computerized Matching Algorithm for Islet Distribution (MAID) was developed to better match the functional, morphological, and quality characteristics of islet preparations to the criteria desired by basic research laboratories, i.e. requesters. The algorithm searches for an optimal combination of requesters using detailed screening, sorting, and search procedures. When applied to a data set of 68 human islet preparations distributed by the Islet Cell Resource (ICR) Center Consortium, MAID reduced the number of requesters that a) did not receive any islets, and b) received mis-matched shipments. These results suggest that MAID is an improved more efficient approach to the centralized distribution of human islets within a consortium setting.
Islet distribution; matching algorithm; exhaustive search; space reduction; importance sampling
Fluorescence spectroscopy has emerged in recent years as an effective way to detect cervical cancer. Investigation of the data preprocessing stage uncovered a need for a robust smoothing to extract the signal from the noise. Various robust smoothing methods for estimating fluorescence emission spectra are compared and data driven methods for the selection of smoothing parameter are suggested. The methods currently implemented in R for smoothing parameter selection proved to be unsatisfactory, and a computationally efficient procedure that approximates robust leave-one-out cross validation is presented.
Robust smoothing; Smoothing parameter selection; Robust cross validation; Leave out schemes; Fluorescence spectroscopy
We describe a Dirichlet multivariable regression method useful for modeling data representing components as a percentage of a total. This model is motivated by the unmet need in psychiatry and other areas to simultaneously assess the effects of covariates on the relative contributions of different components of a measure. The model is illustrated using the Positive and Negative Syndrome Scale (PANSS) for assessment of schizophrenia symptoms which, like many other metrics in psychiatry, is composed of a sum of scores on several components, each in turn, made up of sums of evaluations on several questions. We simultaneously examine the effects of baseline socio-demographic and co-morbid correlates on all of the components of the total PANSS score of patients from a schizophrenia clinical trial and identify variables associated with increasing or decreasing relative contributions of each component. Several definitions of residuals are provided. Diagnostics include measures of overdispersion, Cook’s distance, and a local jackknife influence metric.
Multivariable regression; overdispersion; Cook’s distance; local influence; PANSS
Doubly-censored data refers to time to event data for which both the originating and failure times are censored. In studies involving AIDS incubation time or survival after dementia onset, for example, data are frequently doubly-censored because the date of the originating event is interval-censored and the date of the failure event usually is right-censored. The primary interest is in the distribution of elapsed times between the originating and failure events and its relationship to exposures and risk factors. The estimating equation approach [Sun, et al. 1999. Regression analysis of doubly censored failure time data with applications to AIDS studies. Biometrics 55, 909-914] and its extensions assume the same distribution of originating event times for all subjects. This paper demonstrates the importance of utilizing additional covariates to impute originating event times, i.e., more accurate estimation of originating event times may lead to less biased parameter estimates for elapsed time. The Bayesian MCMC method is shown to be a suitable approach for analyzing doubly-censored data and allows a rich class of survival models. The performance of the proposed estimation method is compared to that of other conventional methods through simulations. Two examples, an AIDS cohort study and a population-based dementia study, are used for illustration. Sample code is shown in the appendix.
AIDS; dementia; doubly censored data; incubation period; MCMC; midpoint imputation
Majorization methods solve minimization problems by replacing a complicated problem by a sequence of simpler problems. Solving the sequence of simple optimization problems guarantees convergence to a solution of the complicated original problem. Convergence is guaranteed by requiring that the approximating functions majorize the original function at the current solution. The leading examples of majorization are the EM algorithm and the SMACOF algorithm used in Multidimensional Scaling. The simplest possible majorizing subproblems are quadratic, because minimizing a quadratic is easy to do. In this paper quadratic majorizations for real-valued functions of a real variable are analyzed, and the concept of sharp majorization is introduced and studied. Applications to logit, probit, and robust loss functions are discussed.
Successive approximation; iterative majorization; convexity
This paper is concerned with developing rules for assignment of tooth prognosis based on actual tooth loss in the VA Dental Longitudinal Study. It is also of interest to rank the relative importance of various clinical factors for tooth loss. A multivariate survival tree procedure is proposed. The procedure is built on a parametric exponential frailty model, which leads to greater computational efficiency. We adopted the goodness-of-split pruning algorithm of LeBlanc and Crowley (1993) to determine the best tree size. In addition, the variable importance method is extended to trees grown by goodness-of-fit using an algorithm similar to the random forest procedure in Breiman (2001). Simulation studies for assessing the proposed tree and variable importance methods are presented. To limit the final number of meaningful prognostic groups, an amalgamation algorithm is employed to merge terminal nodes that are homogenous in tooth survival. The resulting prognosis rules and variable importance rankings seem to offer simple yet clear and insightful interpretations.
CART; Censoring; Frailty; Multivariate Survival Time; Random Forest; Tooth Loss; Variable Importance
Bounded data with excess observations at the boundary are common in many areas of application. Various individual cases of inflated mixture models have been studied in the literature for bound-inflated data, yet the computational methods have been developed separately for each type of model. In this article we use a common framework for computing these models, and expand the range of models for both discrete and semi-continuous data with point inflation at the lower boundary. The quasi-Newton and EM algorithms are adapted and compared for estimation of model parameters. The numerical Hessian and generalized Louis method are investigated as means for computing standard errors after optimization. Correlated data are included in this framework via generalized estimating equations. The estimation of parameters and effectiveness of standard errors are demonstrated through simulation and in the analysis of data from an ultrasound bioeffect study. The unified approach enables reliable computation for a wide class of inflated mixture models and comparison of competing models.
EM; Generalized estimating equation; Louis method; Mixture model; Quasi-Newton; Tobit model; Two-part model