In longitudinal cluster randomized clinical trials (cluster-RCT), subjects are nested within a higher level unit such as clinics and are evaluated for outcome repeatedly over the study period. This study design results in a three level hierarchical data structure. When the primary goal is to test the hypothesis that an intervention has an effect on the rate of change in the outcome over time and the between-subject variation in slopes is substantial, the subject-specific slopes are often modeled as random coefficients in a mixed-effects linear model. In this paper, we propose approaches for determining the samples size for each level of a 3-level hierarchical trial design based on ordinary least squares (OLS) estimates for detecting a difference in mean slopes between two intervention groups when the slopes are modeled as random. Notably, the sample size is not a function of the variances of either the second or the third level random intercepts and depends on the number of second and third level data units only through their product. Simulation results indicate that the OLS-based power and sample sizes are virtually identical to the empirical maximum likelihood based estimates even with varying cluster sizes. Sample sizes for random versus fixed slope models are also compared. The effects of the variance of the random slope on the sample size determinations are shown to be enormous. Therefore, when between-subject variations in outcome trends are anticipated to be significant, sample size determinations based on a fixed slope model can result in a seriously underpowered study.
doi:10.1016/j.csda.2012.11.016
PMCID: PMC3580878
PMID: 23459110
longitudinal cluster RCT; three level data; power; sample size; random slope; effect size
The semiparametric accelerated hazards mixture cure model provides a useful alternative to analyze survival data with a cure fraction if covariates of interest have a gradual effect on the hazard of uncured patients. However, the application of the model may be hindered by the computational intractability of its estimation method due to non-smooth estimating equations involved. We propose a new semiparametric estimation method based on a smooth estimating equation for the model and demonstrate that the new method makes the parameter estimation more tractable without loss of efficiency. The proposed method is used to fit the model to a SEER breast cancer data set.
doi:10.1016/j.csda.2012.09.017
PMCID: PMC3535878
PMID: 23293406
EM algorithm; Kernel-smoothed approximation; Non-smooth estimating equation; Profile likelihood
Data processing and source identification using lower dimensional hidden structure plays an essential role in many fields of applications, including image processing, neural networks, genome studies, signal processing and other areas where large datasets are often encountered. One of the common methods for source separation using lower dimensional structure involves the use of Independent Component Analysis, which is based on a linear representation of the observed data in terms of independent hidden sources. The problem thus involves the estimation of the linear mixing matrix and the densities of the independent hidden sources. However, the solution to the problem depends on the identifiability of the sources. This paper first presents a set of sufficient conditions to establish the identifiability of the sources and the mixing matrix using moment restrictions of the hidden source variables. Under such sufficient conditions a semi-parametric maximum likelihood estimate of the mixing matrix is obtained using a class of mixture distributions. The consistency of our proposed estimate is established under additional regularity conditions. The proposed method is illustrated and compared with existing methods using simulated and real data sets.
doi:10.1016/j.csda.2012.09.012
PMCID: PMC3921001
PMID: 24526802
Constrained EM-algorithm; Mixture Density Estimation; Source Identification
Many clinical trials compare the efficacy of K (≥3) treatments in repeated measurement studies. However, the design of such trials have received relatively less attention from researchers. Zhang & Ahn (2012) derived a closed-form sample size formula for two-sample comparisons of time-averaged responses using the generalized estimating equation (GEE) approach, which takes into account different correlation structures and missing data patterns. In this paper, we extend the sample size formula to scenarios where K (≥3) treatments are compared simultaneously to detect time-averaged differences in treatment effect. A closed-form sample size formula based on the noncentral χ2 test statistic is derived. We conduct simulation studies to assess the performance of the proposed sample size formula under various correlation structures from a damped exponential family, random and monotone missing patterns, and different observation probabilities. Simulation studies show that empirical powers and type I errors are close to their nominal levels. The proposed sample size formula is illustrated using a real clinical trial example.
doi:10.1016/j.csda.2012.08.013
PMCID: PMC3505113
PMID: 23183937
Based on the Bayes modal estimate of factor scores in binary latent variable models, this paper proposes two new limited information estimators for the factor analysis model with a logistic link function for binary data based on Bernoulli distributions up to the second and the third order with maximum likelihood estimation and Laplace approximations to required integrals. These estimators and two existing limited information weighted least squares estimators are studied empirically. The limited information estimators compare favorably to full information estimators based on marginal maximum likelihood, MCMC, and multinomial distribution with a Laplace approximation methodology. Among the various estimators, Maydeu-Olivares and Joe’s (2005) weighted least squares limited information estimators implemented with Laplace approximations for probabilities are shown in a simulation to have the best root mean square errors.
doi:10.1016/j.csda.2012.06.022
PMCID: PMC3418349
PMID: 22904587
Limited Information; Laplace Approximation; Binary Response; Marginal Likelihood; Factor Scores
In parametric hierarchical models, it is standard practice to place mean and variance constraints on the latent variable distributions for the sake of identifiability and interpretability. Because incorporation of such constraints is challenging in semiparametric models that allow latent variable distributions to be unknown, previous methods either constrain the median or avoid constraints. In this article, we propose a centered stick-breaking process (CSBP), which induces mean and variance constraints on an unknown distribution in a hierarchical model. This is accomplished by viewing an unconstrained stick-breaking process as a parameter-expanded version of a CSBP. An efficient blocked Gibbs sampler is developed for approximate posterior computation. The methods are illustrated through a simulated example and an epidemiologic application.
PMCID: PMC3869464
PMID: 24363478
Dirichlet process; Latent variables; Moment constraints; Nonparametric Bayes; Parameter expansion; Random effects
A primary challenge in unsupervised clustering using mixture models is the selection of a family of basis distributions flexible enough to succinctly represent the distributions of the target subpopulations. In this paper we introduce a new family of Gaussian Well distributions (GWDs) for clustering applications where the target subpopulations are characterized by hollow [hyper-]elliptical structures. We develop the primary theory pertaining to the GWD, including mixtures of GWDs, selection of prior distributions, and computationally efficient inference strategies using Markov chain Monte Carlo. We demonstrate the utility of our approach, as compared to standard Gaussian mixture methods on a synthetic dataset, and exemplify its applicability on an example from immunofluorescence imaging, emphasizing the improved interpretability and parsimony of the GWD-based model.
doi:10.1016/j.csda.2012.03.027
PMCID: PMC3384503
PMID: 22754052
Gaussian mixtures; Poisson point processes; subtractive mixtures; histology
This paper considers model-based methods for estimation of the adjusted attributable risk (AR) in both case-control and cohort studies. An earlier review discussed approaches for both types of studies, using the standard logistic regression model for case-control studies, and for cohort studies proposing the equivalent Poisson model in order to account for the additional variability in estimating the distribution of exposures and covariates from the data. In this paper we revisit case-control studies, arguing for the equivalent Poisson model in this case as well. Using the delta method with the Poisson model, we provide general expressions for the asymptotic variance of the AR for both types of studies. This includes the generalized AR, which extends the original idea of attributable risk to the case where the exposure is not completely eliminated. These variance expressions can be easily programmed in any statistical package that includes Poisson regression and has capabilities for simple matrix algebra. In addition, we discuss computation of standard errors and confidence limits using bootstrap resampling. For cohort studies, use of the bootstrap allows binary regression models with link functions other than the logit.
doi:10.1016/j.csda.2012.04.017
PMCID: PMC3462467
PMID: 23049150
adjusted attributable risk; case-control study; cohort study; Poisson regression; delta method; model-based estimate; bootstrap methods
Compared to the proportional hazards model and accelerated failure time model, the accelerated hazards model has a unique property in its application, in that it can allow gradual effects of the treatment. However, its application is still very limited, partly due to the complexity of existing semiparametric estimation methods. We propose a new semiparametric estimation method based on the induced smoothing and rank type estimates. The parameter estimates and their variances can be easily obtained from the smoothed estimating equation; thus it is easy to use in practice. Our numerical study shows that the new method is more efficient than the existing methods with respect to its variance estimation and coverage probability. The proposed method is employed to reanalyze a data set from a brain tumor treatment study.
doi:10.1016/j.csda.2012.04.001
PMCID: PMC3462890
PMID: 23049151
Accelerated Hazards Model; Rank Estimation; Induced Smoothing
The celebrated expectation-maximization (EM) algorithm is one of the most widely used optimization methods in statistics. In recent years it has been realized that EM algorithm is a special case of the more general minorization-maximization (MM) principle. Both algorithms creates a surrogate function in the first (E or M) step that is maximized in the second M step. This two step process always drives the objective function uphill and is iterated until the parameters converge. The two algorithms differ in the way the surrogate function is constructed. The expectation step of the EM algorithm relies on calculating conditional expectations, while the minorization step of the MM algorithm builds on crafty use of inequalities. For many problems, EM and MM derivations yield the same algorithm. This expository note walks through the construction of both algorithms for estimating the parameters of the Dirichlet-Multinomial distribution. This particular case is of interest because EM and MM derivations lead to two different algorithms with completely distinct operating characteristics. The EM algorithm converges fast but involves solving a nontrivial maximization problem in the M step. In contrast the MM updates are extremely simple but converge slowly. An EM-MM hybrid algorithm is derived which shows faster convergence than the MM algorithm in certain parameter regimes. The local convergence rates of the three algorithms are studied theoretically from the unifying MM point of view and also compared on numerical examples.
doi:10.1016/j.csda.2012.05.018
PMCID: PMC3755471
PMID: 23997380
Convergence rate; Dirichlet-Multinomial distribution; EM algorithm; MM algorithm
The aim of this paper is to provide a composite likelihood approach to handle spatially correlated survival data using pairwise joint distributions. With e-commerce data, a recent question of interest in marketing research has been to describe spatially clustered purchasing behavior and to assess whether geographic distance is the appropriate metric to describe purchasing dependence. We present a model for the dependence structure of time-to-event data subject to spatial dependence to characterize purchasing behavior from the motivating example from e-commerce data. We assume the Farlie-Gumbel-Morgenstern (FGM) distribution and then model the dependence parameter as a function of geographic and demographic pairwise distances. For estimation of the dependence parameters, we present pairwise composite likelihood equations. We prove that the resulting estimators exhibit key properties of consistency and asymptotic normality under certain regularity conditions in the increasing-domain framework of spatial asymptotic theory.
doi:10.1016/j.csda.2011.07.004
PMCID: PMC3819148
PMID: 24223450
Spatial dependence; Pairwise joint likelihood; Marginal likelihood; Event times; Consistency; Asymptotic normality; Censoring
Fourier Amplitude Sensitivity Test (FAST) is one of the most popular uncertainty and sensitivity analysis techniques. It uses a periodic sampling approach and a Fourier transformation to decompose the variance of a model output into partial variances contributed by different model parameters. Until now, the FAST analysis is mainly confined to the estimation of partial variances contributed by the main effects of model parameters, but does not allow for those contributed by specific interactions among parameters. In this paper, we theoretically show that FAST analysis can be used to estimate partial variances contributed by both main effects and interaction effects of model parameters using different sampling approaches (i.e., traditional search-curve based sampling, simple random sampling and random balance design sampling). We also analytically calculate the potential errors and biases in the estimation of partial variances. Hypothesis tests are constructed to reduce the effect of sampling errors on the estimation of partial variances. Our results show that compared to simple random sampling and random balance design sampling, sensitivity indices (ratios of partial variances to variance of a specific model output) estimated by search-curve based sampling generally have higher precision but larger underestimations. Compared to simple random sampling, random balance design sampling generally provides higher estimation precision for partial variances contributed by the main effects of parameters. The theoretical derivation of partial variances contributed by higher-order interactions and the calculation of their corresponding estimation errors in different sampling schemes can help us better understand the FAST method and provide a fundamental basis for FAST applications and further improvements.
doi:10.1016/j.csda.2010.06.028
PMCID: PMC3798038
PMID: 24143037
Uncertainty analysis; Sensitivity analysis; Fourier Amplitude Sensitivity Test; Simple random sampling; Random balance design; Interactions
We consider an extension of the temporal epidemic-type aftershock sequence (ETAS) model with random effects as a special case of a well-known doubly stochastic self-exciting point process. The new model arises from a deterministic function that is randomly scaled by a nonnegative random variable, which is unobservable but assumed to follow either positive stable or one-parameter gamma distribution with unit mean. Both random effects models are of interest although the one-parameter gamma random effects model is more popular when modeling associated survival times. Our estimation is based on the maximum likelihood approach with marginalized intensity. The methods are shown to perform well in simulation experiments. When applied to an earthquake sequence on the east coast of Taiwan, the extended model with positive stable random effects provides a better model fit, compared to the original ETAS model and the extended model with one-parameter gamma random effects.
doi:10.1016/j.csda.2010.09.031
PMCID: PMC3770286
PMID: 24039322
Doubly stochastic self-exciting process; ETAS model; Earthquake sequence; Marginalized intensity; Random effects
Summary
In breast cancer research, it is of great interest to identify genomic markers associated with prognosis. Multiple gene profiling studies have been conducted for such a purpose. Genomic markers identified from the analysis of single datasets often do not have satisfactory reproducibility. Among the multiple possible reasons, the most important one is the small sample sizes of individual studies. A cost-effective solution is to pool data from multiple comparable studies and conduct integrative analysis. In this study, we collect four breast cancer prognosis studies with gene expression measurements. We describe the relationship between prognosis and gene expressions using the accelerated failure time (AFT) models. We adopt a 2-norm group bridge penalization approach for marker identification. This integrative analysis approach can effectively identify markers with consistent effects across multiple datasets and naturally accommodate the heterogeneity among studies. Statistical and simulation studies demonstrate satisfactory performance of this approach. Breast cancer prognosis markers identified using this approach have sound biological implications and satisfactory prediction performance.
doi:10.1016/j.csda.2012.02.017
PMCID: PMC3389801
PMID: 22773869
Breast cancer prognosis; Gene expression; Marker identification; Integrative analysis; 2-norm group bridge
The paper addresses a common problem in the analysis of high-dimensional high-throughput “omics” data, which is parameter estimation across multiple variables in a set of data where the number of variables is much larger than the sample size. Among the problems posed by this type of data are that variable-specific estimators of variances are not reliable and variable-wise tests statistics have low power, both due to a lack of degrees of freedom. In addition, it has been observed in this type of data that the variance increases as a function of the mean. We introduce a non-parametric adaptive regularization procedure that is innovative in that : (i) it employs a novel “similarity statistic”-based clustering technique to generate local-pooled or regularized shrinkage estimators of population parameters, (ii) the regularization is done jointly on population moments, benefiting from C. Stein's result on inadmissibility, which implies that usual sample variance estimator is improved by a shrinkage estimator using information contained in the sample mean. From these joint regularized shrinkage estimators, we derived regularized t-like statistics and show in simulation studies that they offer more statistical power in hypothesis testing than their standard sample counterparts, or regular common value-shrinkage estimators, or when the information contained in the sample mean is simply ignored. Finally, we show that these estimators feature interesting properties of variance stabilization and normalization that can be used for preprocessing high-dimensional multivariate data. The method is available as an R package, called ‘MVR’ (‘Mean-Variance Regularization’), downloadable from the CRAN website.
doi:10.1016/j.csda.2012.01.012
PMCID: PMC3375876
PMID: 22711950
Bioinformatics; Inadmissibility; Regularization; Shrinkage Estimators; Normalization; Variance Stabilization
When analyzing high-throughput genomic data, the multiple comparison problem is most often addressed through estimation of the false discovery rate (FDR), using methods such as the Benjamini & Hochberg, Benjamini & Yekutieli, the q-value method, or in controlling the family-wise error rate (FWER) using Holm’s step down method. To date, research studies that have compared various FDR/FWER methodologies have made use of limited simulation studies and/or have applied the methods to one or more microarray gene expression dataset(s). However, for microarray datasets the veracity of each null hypothesis tested is unknown so that an objective evaluation of performance cannot be rendered for application data. Due to the role of methylation in X-chromosome inactivation, we postulate that high-throughput methylation datasets may provide an appropriate forum for assessing the performance of commonly used FDR methodologies. These datasets preserve the complex correlation structure between probes, offering an advantage over simulated datasets. Using several methylation datasets, commonly used FDR methods including the q-value, Benjamini & Hochberg, and Benjamini & Yekutieli procedures as well as Holm’s step down method were applied to identify CpG sites that are differentially methylated when comparing healthy males to healthy females. The methods were compared with respect to their ability to identify CpG sites located on sex chromosomes as significant, by reporting the sensitivity, specificity, and observed FDR. These datasets are useful for characterizing the performance of multiple comparison procedures, and may find further utility in other tasks such as comparing variable selection capabilities of classification methods and evaluating the performance of meta-analytic methods for microarray data.
doi:10.1016/j.csda.2011.10.020
PMCID: PMC3352593
PMID: 22611297
false discovery rate; family wise error rate; methylation; sensitivity; specificity; multiple comparisons
Studies of ocular disease and analyses of time to disease onset are complicated by the correlation expected between the two eyes from a single patient. We overcome these statistical modeling challenges through a nonparametric Bayesian frailty model. While this model suggests itself as a natural one for such complex data structures, model fitting routines become overwhelmingly complicated and computationally intensive given the nonparametric form assumed for the frailty distribution and baseline hazard function. We consider empirical Bayesian methods to alleviate these difficulties through a routine that iterates between frequentist, data-driven estimation of the cumulative baseline hazard and Markov chain Monte Carlo estimation of the frailty and regression coefficients. We show both in theory and through simulation that this approach yields consistent estimators of the parameters of interest. We then apply the method to the short-wave automated perimetry (SWAP) data set to study risk factors of glaucomatous visual field deficits.
doi:10.1016/j.csda.2011.09.004
PMCID: PMC3359094
PMID: 22639479
multivariate survival analysis; nonparametric Pólya tree prior; Gibbs sampler; Metropolis-Hastings sampler; goodness of fit; glaucoma and ophthalmology data
Statistical inference in censored quantile regression is challenging, partly due to the unsmoothness of the quantile score function. A new procedure is developed to estimate the variance of Bang and Tsiatis’s inverse-censoring-probability weighted estimator for censored quantile regression by employing the idea of induced smoothing. The proposed variance estimator is shown to be asymptotically consistent. In addition, numerical study suggests that the proposed procedure performs well in finite samples, and it is computationally more efficient than the commonly used bootstrap method.
doi:10.1016/j.csda.2010.10.018
PMCID: PMC3338150
PMID: 22547899
censored quantile regression; smoothing; survival analysis; variance estimation
Linear mixed-effects models involve fixed effects, random effects and covariance structure, which require model selection to simplify a model and to enhance its interpretability and predictability. In this article, we develop, in the context of linear mixed-effects models, the generalized degrees of freedom and an adaptive model selection procedure defined by a data-driven model complexity penalty. Numerically, the procedure performs well against its competitors not only in selecting fixed effects but in selecting random effects and covariance structure as well. Theoretically, asymptotic optimality of the proposed methodology is established over a class of information criteria. The proposed methodology is applied to the BioCycle study, to determine predictors of hormone levels among premenopausal women and to assess variation in hormone levels both between and within women across the menstrual cycle.
doi:10.1016/j.csda.2011.09.001
PMCID: PMC3214646
PMID: 22090665
Adaptive penalty; linear mixed-effects models; loss estimation; generalized degrees of freedom
Canonical correlation analysis (CCA) is a widely used multivariate method for assessing the association between two sets of variables. However, when the number of variables far exceeds the number of subjects, such in the case of large-scale genomic studies, the traditional CCA method is not appropriate. In addition, when the variables are highly correlated the sample covariance matrices become unstable or undefined. To overcome these two issues, sparse canonical correlation analysis (SCCA) for multiple data sets has been proposed using a Lasso type of penalty. However, these methods do not have direct control over sparsity of solution. An additional step that uses Bayesian Information Criterion (BIC) has also been suggested to further filter out unimportant features. In this paper, a comparison of four penalty functions (Lasso, Elastic-net, SCAD and Hard-threshold) for SCCA with and without the BIC filtering step have been carried out using both real and simulated genotypic and mRNA expression data. This study indicates that the SCAD penalty with BIC filter would be a preferable penalty function for application of SCCA to genomic data.
doi:10.1016/j.csda.2011.07.012
PMCID: PMC3185379
PMID: 21984855
SCCA; Lasso; Elastic-net; SCAD; BIC; penalty; SNP; mRNA expression
We evaluate the performance of the Dirichlet process mixture (DPM) and the latent class model (LCM) in identifying autism phenotype subgroups based on categorical autism spectrum disorder (ASD) diagnostic features from the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition Text Revision. A simulation study is designed to mimic the diagnostic features in the ASD dataset in order to evaluate the LCM and DPM methods in this context. Likelihood based information criteria and DPM partitioning are used to identify the best fitting models. The Rand statistic is used to compare the performance of the methods in recovering simulated phenotype subgroups. Our results indicate excellent recovery of the simulated subgroup structure for both methods. The LCM performs slightly better than DPM when the correct number of latent subgroups is selected a priori. The DPM method utilizes a maximum a posteriori (MAP) criterion to estimate the number of classes, and yielded results in fair agreement with the LCM method. Comparison of model fit indices in identifying the best fitting LCM showed that adjusted Bayesian information criteria (ABIC) picks the correct number of classes over 90% of the time. Thus, when diagnostic features are categorical and there is some prior information regarding the number of latent classes, LCM in conjunction with ABIC is preferred.
doi:10.1016/j.csda.2011.06.014
PMCID: PMC3171740
PMID: 21927523
autism; Dirichlet process mixture; latent class model
Random-effects change point models are formulated for longitudinal data obtained from cognitive tests. The conditional distribution of the response variable in a change point model is often assumed to be normal even if the response variable is discrete and shows ceiling effects. For the sum score of a cognitive test, the binomial and the beta-binomial distributions are presented as alternatives to the normal distribution. Smooth shapes for the change point models are imposed. Estimation is by marginal maximum likelihood where a parametric population distribution for the random change point is combined with a non-parametric mixing distribution for other random effects. An extension to latent class modelling is possible in case some individuals do not experience a change in cognitive ability. The approach is illustrated using data from a longitudinal study of Swedish octogenarians and nonagenarians that began in 1991. Change point models are applied to investigate cognitive change in the years before death.
doi:10.1016/j.csda.2012.07.024
PMCID: PMC3587404
PMID: 23471297
Beta-binomial distribution; Latent class model; Mini-mental state examination; Random-effects model
A new comprehensive procedure for statistical analysis of two-dimensional polyacrylamide gel electrophoresis (2D PAGE) images is proposed, including protein region quantification, normalization and statistical analysis. Protein regions are defined by the master watershed map that is obtained from the mean gel. By working with these protein regions, the approach bypasses the current bottleneck in the analysis of 2D PAGE images: it does not require spot matching. Background correction is implemented in each protein region by local segmentation. Two-dimensional locally weighted smoothing (LOESS) is proposed to remove any systematic bias after quantification of protein regions. Proteins are separated into mutually independent sets based on detected correlations, and a multivariate analysis is used on each set to detect the group effect. A strategy for multiple hypothesis testing based on this multivariate approach combined with the usual Benjamini-Hochberg FDR procedure is formulated and applied to the differential analysis of 2D PAGE images. Each step in the analytical protocol is shown by using an actual dataset. The effectiveness of the proposed methodology is shown using simulated gels in comparison with the commercial software packages PDQuest and Dymension. We also introduce a new procedure for simulating gel images.
doi:10.1016/j.csda.2011.05.013
PMCID: PMC3155775
PMID: 21850152
proteomics; 2D PAGE; watershed region; spot matching; statistical image analysis; high dimension
We consider the estimation of the parameters indexing a parametric model for the conditional distribution of a diagnostic marker given covariates and disease status. Such models are useful for the evaluation of whether and to what extent a marker’s ability to accurately detect or discard disease depends on patient characteristics. A frequent problem that complicates the estimation of the model parameters is that estimation must be conducted from observational studies. Often, in such studies not all patients undergo the gold standard assessment of disease. Furthermore, the decision as to whether a patient undergoes verification is not controlled by study design. In such scenarios, maximum likelihood estimators based on subjects with observed disease status are generally biased. In this paper, we propose estimators for the model parameters that adjust for selection to verification that may depend on measured patient characteristics and additonally adjust for an assumed degree of residual association. Such estimators may be used as part of a sensitivity analysis for plausible degrees of residual association. We describe a doubly robust estimator that has the attractive feature of being consistent if either a model for the probability of selection to verification or a model for the probability of disease among the verified subjects (but not necessarily both) is correct.
doi:10.1016/j.csda.2008.06.021
PMCID: PMC3475507
PMID: 23087495
Missing at Random; Nonignorable; Missing Covariate; Sensitivity Analysis; Semiparametric; Diagnosis
We present a Bayesian variable selection method for the setting in which the number of independent variables or predictors in a particular dataset is much larger than the available sample size. While most existing methods allow some degree of correlations among predictors but do not consider these correlations for variable selection, our method accounts for correlations among the predictors in variable selection. Our correlation-based stochastic search (CBS) method, the hybrid-CBS algorithm, extends a popular search algorithm for high-dimensional data, the stochastic search variable selection (SSVS) method. Similar to SSVS, we search the space of all possible models using variable addition, deletion or swap moves. However, our moves through the model space are designed to accommodate correlations among the variables. We describe our approach for continuous, binary, ordinal, and count outcome data. The impact of choices of prior distributions and hyper-parameters is assessed in simulation studies. We also examined performance of variable selection and prediction as the correlation structure of the predictors varies. We found that the hybrid-CBS resulted in lower prediction errors and better identified the true outcome associated predictors than SSVS when predictors were moderately to highly correlated. We illustrate the method on data from a proteomic profiling study of melanoma, a skin cancer.
doi:10.1016/j.csda.2011.04.019
PMCID: PMC3113479
PMID: 21686315
Correlated predictors; correlation-based search; proteomic data