A primary challenge in unsupervised clustering using mixture models is the selection of a family of basis distributions flexible enough to succinctly represent the distributions of the target subpopulations. In this paper we introduce a new family of Gaussian Well distributions (GWDs) for clustering applications where the target subpopulations are characterized by hollow [hyper-]elliptical structures. We develop the primary theory pertaining to the GWD, including mixtures of GWDs, selection of prior distributions, and computationally efficient inference strategies using Markov chain Monte Carlo. We demonstrate the utility of our approach, as compared to standard Gaussian mixture methods on a synthetic dataset, and exemplify its applicability on an example from immunofluorescence imaging, emphasizing the improved interpretability and parsimony of the GWD-based model.
Gaussian mixtures; Poisson point processes; subtractive mixtures; histology
This paper considers model-based methods for estimation of the adjusted attributable risk (AR) in both case-control and cohort studies. An earlier review discussed approaches for both types of studies, using the standard logistic regression model for case-control studies, and for cohort studies proposing the equivalent Poisson model in order to account for the additional variability in estimating the distribution of exposures and covariates from the data. In this paper we revisit case-control studies, arguing for the equivalent Poisson model in this case as well. Using the delta method with the Poisson model, we provide general expressions for the asymptotic variance of the AR for both types of studies. This includes the generalized AR, which extends the original idea of attributable risk to the case where the exposure is not completely eliminated. These variance expressions can be easily programmed in any statistical package that includes Poisson regression and has capabilities for simple matrix algebra. In addition, we discuss computation of standard errors and confidence limits using bootstrap resampling. For cohort studies, use of the bootstrap allows binary regression models with link functions other than the logit.
adjusted attributable risk; case-control study; cohort study; Poisson regression; delta method; model-based estimate; bootstrap methods
Compared to the proportional hazards model and accelerated failure time model, the accelerated hazards model has a unique property in its application, in that it can allow gradual effects of the treatment. However, its application is still very limited, partly due to the complexity of existing semiparametric estimation methods. We propose a new semiparametric estimation method based on the induced smoothing and rank type estimates. The parameter estimates and their variances can be easily obtained from the smoothed estimating equation; thus it is easy to use in practice. Our numerical study shows that the new method is more efficient than the existing methods with respect to its variance estimation and coverage probability. The proposed method is employed to reanalyze a data set from a brain tumor treatment study.
Accelerated Hazards Model; Rank Estimation; Induced Smoothing
The celebrated expectation-maximization (EM) algorithm is one of the most widely used optimization methods in statistics. In recent years it has been realized that EM algorithm is a special case of the more general minorization-maximization (MM) principle. Both algorithms creates a surrogate function in the first (E or M) step that is maximized in the second M step. This two step process always drives the objective function uphill and is iterated until the parameters converge. The two algorithms differ in the way the surrogate function is constructed. The expectation step of the EM algorithm relies on calculating conditional expectations, while the minorization step of the MM algorithm builds on crafty use of inequalities. For many problems, EM and MM derivations yield the same algorithm. This expository note walks through the construction of both algorithms for estimating the parameters of the Dirichlet-Multinomial distribution. This particular case is of interest because EM and MM derivations lead to two different algorithms with completely distinct operating characteristics. The EM algorithm converges fast but involves solving a nontrivial maximization problem in the M step. In contrast the MM updates are extremely simple but converge slowly. An EM-MM hybrid algorithm is derived which shows faster convergence than the MM algorithm in certain parameter regimes. The local convergence rates of the three algorithms are studied theoretically from the unifying MM point of view and also compared on numerical examples.
Convergence rate; Dirichlet-Multinomial distribution; EM algorithm; MM algorithm
The aim of this paper is to provide a composite likelihood approach to handle spatially correlated survival data using pairwise joint distributions. With e-commerce data, a recent question of interest in marketing research has been to describe spatially clustered purchasing behavior and to assess whether geographic distance is the appropriate metric to describe purchasing dependence. We present a model for the dependence structure of time-to-event data subject to spatial dependence to characterize purchasing behavior from the motivating example from e-commerce data. We assume the Farlie-Gumbel-Morgenstern (FGM) distribution and then model the dependence parameter as a function of geographic and demographic pairwise distances. For estimation of the dependence parameters, we present pairwise composite likelihood equations. We prove that the resulting estimators exhibit key properties of consistency and asymptotic normality under certain regularity conditions in the increasing-domain framework of spatial asymptotic theory.
Spatial dependence; Pairwise joint likelihood; Marginal likelihood; Event times; Consistency; Asymptotic normality; Censoring
Fourier Amplitude Sensitivity Test (FAST) is one of the most popular uncertainty and sensitivity analysis techniques. It uses a periodic sampling approach and a Fourier transformation to decompose the variance of a model output into partial variances contributed by different model parameters. Until now, the FAST analysis is mainly confined to the estimation of partial variances contributed by the main effects of model parameters, but does not allow for those contributed by specific interactions among parameters. In this paper, we theoretically show that FAST analysis can be used to estimate partial variances contributed by both main effects and interaction effects of model parameters using different sampling approaches (i.e., traditional search-curve based sampling, simple random sampling and random balance design sampling). We also analytically calculate the potential errors and biases in the estimation of partial variances. Hypothesis tests are constructed to reduce the effect of sampling errors on the estimation of partial variances. Our results show that compared to simple random sampling and random balance design sampling, sensitivity indices (ratios of partial variances to variance of a specific model output) estimated by search-curve based sampling generally have higher precision but larger underestimations. Compared to simple random sampling, random balance design sampling generally provides higher estimation precision for partial variances contributed by the main effects of parameters. The theoretical derivation of partial variances contributed by higher-order interactions and the calculation of their corresponding estimation errors in different sampling schemes can help us better understand the FAST method and provide a fundamental basis for FAST applications and further improvements.
Uncertainty analysis; Sensitivity analysis; Fourier Amplitude Sensitivity Test; Simple random sampling; Random balance design; Interactions
We consider an extension of the temporal epidemic-type aftershock sequence (ETAS) model with random effects as a special case of a well-known doubly stochastic self-exciting point process. The new model arises from a deterministic function that is randomly scaled by a nonnegative random variable, which is unobservable but assumed to follow either positive stable or one-parameter gamma distribution with unit mean. Both random effects models are of interest although the one-parameter gamma random effects model is more popular when modeling associated survival times. Our estimation is based on the maximum likelihood approach with marginalized intensity. The methods are shown to perform well in simulation experiments. When applied to an earthquake sequence on the east coast of Taiwan, the extended model with positive stable random effects provides a better model fit, compared to the original ETAS model and the extended model with one-parameter gamma random effects.
Doubly stochastic self-exciting process; ETAS model; Earthquake sequence; Marginalized intensity; Random effects
In breast cancer research, it is of great interest to identify genomic markers associated with prognosis. Multiple gene profiling studies have been conducted for such a purpose. Genomic markers identified from the analysis of single datasets often do not have satisfactory reproducibility. Among the multiple possible reasons, the most important one is the small sample sizes of individual studies. A cost-effective solution is to pool data from multiple comparable studies and conduct integrative analysis. In this study, we collect four breast cancer prognosis studies with gene expression measurements. We describe the relationship between prognosis and gene expressions using the accelerated failure time (AFT) models. We adopt a 2-norm group bridge penalization approach for marker identification. This integrative analysis approach can effectively identify markers with consistent effects across multiple datasets and naturally accommodate the heterogeneity among studies. Statistical and simulation studies demonstrate satisfactory performance of this approach. Breast cancer prognosis markers identified using this approach have sound biological implications and satisfactory prediction performance.
Breast cancer prognosis; Gene expression; Marker identification; Integrative analysis; 2-norm group bridge
The paper addresses a common problem in the analysis of high-dimensional high-throughput “omics” data, which is parameter estimation across multiple variables in a set of data where the number of variables is much larger than the sample size. Among the problems posed by this type of data are that variable-specific estimators of variances are not reliable and variable-wise tests statistics have low power, both due to a lack of degrees of freedom. In addition, it has been observed in this type of data that the variance increases as a function of the mean. We introduce a non-parametric adaptive regularization procedure that is innovative in that : (i) it employs a novel “similarity statistic”-based clustering technique to generate local-pooled or regularized shrinkage estimators of population parameters, (ii) the regularization is done jointly on population moments, benefiting from C. Stein's result on inadmissibility, which implies that usual sample variance estimator is improved by a shrinkage estimator using information contained in the sample mean. From these joint regularized shrinkage estimators, we derived regularized t-like statistics and show in simulation studies that they offer more statistical power in hypothesis testing than their standard sample counterparts, or regular common value-shrinkage estimators, or when the information contained in the sample mean is simply ignored. Finally, we show that these estimators feature interesting properties of variance stabilization and normalization that can be used for preprocessing high-dimensional multivariate data. The method is available as an R package, called ‘MVR’ (‘Mean-Variance Regularization’), downloadable from the CRAN website.
Bioinformatics; Inadmissibility; Regularization; Shrinkage Estimators; Normalization; Variance Stabilization
When analyzing high-throughput genomic data, the multiple comparison problem is most often addressed through estimation of the false discovery rate (FDR), using methods such as the Benjamini & Hochberg, Benjamini & Yekutieli, the q-value method, or in controlling the family-wise error rate (FWER) using Holm’s step down method. To date, research studies that have compared various FDR/FWER methodologies have made use of limited simulation studies and/or have applied the methods to one or more microarray gene expression dataset(s). However, for microarray datasets the veracity of each null hypothesis tested is unknown so that an objective evaluation of performance cannot be rendered for application data. Due to the role of methylation in X-chromosome inactivation, we postulate that high-throughput methylation datasets may provide an appropriate forum for assessing the performance of commonly used FDR methodologies. These datasets preserve the complex correlation structure between probes, offering an advantage over simulated datasets. Using several methylation datasets, commonly used FDR methods including the q-value, Benjamini & Hochberg, and Benjamini & Yekutieli procedures as well as Holm’s step down method were applied to identify CpG sites that are differentially methylated when comparing healthy males to healthy females. The methods were compared with respect to their ability to identify CpG sites located on sex chromosomes as significant, by reporting the sensitivity, specificity, and observed FDR. These datasets are useful for characterizing the performance of multiple comparison procedures, and may find further utility in other tasks such as comparing variable selection capabilities of classification methods and evaluating the performance of meta-analytic methods for microarray data.
false discovery rate; family wise error rate; methylation; sensitivity; specificity; multiple comparisons
Studies of ocular disease and analyses of time to disease onset are complicated by the correlation expected between the two eyes from a single patient. We overcome these statistical modeling challenges through a nonparametric Bayesian frailty model. While this model suggests itself as a natural one for such complex data structures, model fitting routines become overwhelmingly complicated and computationally intensive given the nonparametric form assumed for the frailty distribution and baseline hazard function. We consider empirical Bayesian methods to alleviate these difficulties through a routine that iterates between frequentist, data-driven estimation of the cumulative baseline hazard and Markov chain Monte Carlo estimation of the frailty and regression coefficients. We show both in theory and through simulation that this approach yields consistent estimators of the parameters of interest. We then apply the method to the short-wave automated perimetry (SWAP) data set to study risk factors of glaucomatous visual field deficits.
multivariate survival analysis; nonparametric Pólya tree prior; Gibbs sampler; Metropolis-Hastings sampler; goodness of fit; glaucoma and ophthalmology data
Statistical inference in censored quantile regression is challenging, partly due to the unsmoothness of the quantile score function. A new procedure is developed to estimate the variance of Bang and Tsiatis’s inverse-censoring-probability weighted estimator for censored quantile regression by employing the idea of induced smoothing. The proposed variance estimator is shown to be asymptotically consistent. In addition, numerical study suggests that the proposed procedure performs well in finite samples, and it is computationally more efficient than the commonly used bootstrap method.
censored quantile regression; smoothing; survival analysis; variance estimation
Linear mixed-effects models involve fixed effects, random effects and covariance structure, which require model selection to simplify a model and to enhance its interpretability and predictability. In this article, we develop, in the context of linear mixed-effects models, the generalized degrees of freedom and an adaptive model selection procedure defined by a data-driven model complexity penalty. Numerically, the procedure performs well against its competitors not only in selecting fixed effects but in selecting random effects and covariance structure as well. Theoretically, asymptotic optimality of the proposed methodology is established over a class of information criteria. The proposed methodology is applied to the BioCycle study, to determine predictors of hormone levels among premenopausal women and to assess variation in hormone levels both between and within women across the menstrual cycle.
Adaptive penalty; linear mixed-effects models; loss estimation; generalized degrees of freedom
Canonical correlation analysis (CCA) is a widely used multivariate method for assessing the association between two sets of variables. However, when the number of variables far exceeds the number of subjects, such in the case of large-scale genomic studies, the traditional CCA method is not appropriate. In addition, when the variables are highly correlated the sample covariance matrices become unstable or undefined. To overcome these two issues, sparse canonical correlation analysis (SCCA) for multiple data sets has been proposed using a Lasso type of penalty. However, these methods do not have direct control over sparsity of solution. An additional step that uses Bayesian Information Criterion (BIC) has also been suggested to further filter out unimportant features. In this paper, a comparison of four penalty functions (Lasso, Elastic-net, SCAD and Hard-threshold) for SCCA with and without the BIC filtering step have been carried out using both real and simulated genotypic and mRNA expression data. This study indicates that the SCAD penalty with BIC filter would be a preferable penalty function for application of SCCA to genomic data.
SCCA; Lasso; Elastic-net; SCAD; BIC; penalty; SNP; mRNA expression
We evaluate the performance of the Dirichlet process mixture (DPM) and the latent class model (LCM) in identifying autism phenotype subgroups based on categorical autism spectrum disorder (ASD) diagnostic features from the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition Text Revision. A simulation study is designed to mimic the diagnostic features in the ASD dataset in order to evaluate the LCM and DPM methods in this context. Likelihood based information criteria and DPM partitioning are used to identify the best fitting models. The Rand statistic is used to compare the performance of the methods in recovering simulated phenotype subgroups. Our results indicate excellent recovery of the simulated subgroup structure for both methods. The LCM performs slightly better than DPM when the correct number of latent subgroups is selected a priori. The DPM method utilizes a maximum a posteriori (MAP) criterion to estimate the number of classes, and yielded results in fair agreement with the LCM method. Comparison of model fit indices in identifying the best fitting LCM showed that adjusted Bayesian information criteria (ABIC) picks the correct number of classes over 90% of the time. Thus, when diagnostic features are categorical and there is some prior information regarding the number of latent classes, LCM in conjunction with ABIC is preferred.
autism; Dirichlet process mixture; latent class model
Random-effects change point models are formulated for longitudinal data obtained from cognitive tests. The conditional distribution of the response variable in a change point model is often assumed to be normal even if the response variable is discrete and shows ceiling effects. For the sum score of a cognitive test, the binomial and the beta-binomial distributions are presented as alternatives to the normal distribution. Smooth shapes for the change point models are imposed. Estimation is by marginal maximum likelihood where a parametric population distribution for the random change point is combined with a non-parametric mixing distribution for other random effects. An extension to latent class modelling is possible in case some individuals do not experience a change in cognitive ability. The approach is illustrated using data from a longitudinal study of Swedish octogenarians and nonagenarians that began in 1991. Change point models are applied to investigate cognitive change in the years before death.
Beta-binomial distribution; Latent class model; Mini-mental state examination; Random-effects model
A new comprehensive procedure for statistical analysis of two-dimensional polyacrylamide gel electrophoresis (2D PAGE) images is proposed, including protein region quantification, normalization and statistical analysis. Protein regions are defined by the master watershed map that is obtained from the mean gel. By working with these protein regions, the approach bypasses the current bottleneck in the analysis of 2D PAGE images: it does not require spot matching. Background correction is implemented in each protein region by local segmentation. Two-dimensional locally weighted smoothing (LOESS) is proposed to remove any systematic bias after quantification of protein regions. Proteins are separated into mutually independent sets based on detected correlations, and a multivariate analysis is used on each set to detect the group effect. A strategy for multiple hypothesis testing based on this multivariate approach combined with the usual Benjamini-Hochberg FDR procedure is formulated and applied to the differential analysis of 2D PAGE images. Each step in the analytical protocol is shown by using an actual dataset. The effectiveness of the proposed methodology is shown using simulated gels in comparison with the commercial software packages PDQuest and Dymension. We also introduce a new procedure for simulating gel images.
proteomics; 2D PAGE; watershed region; spot matching; statistical image analysis; high dimension
We consider the estimation of the parameters indexing a parametric model for the conditional distribution of a diagnostic marker given covariates and disease status. Such models are useful for the evaluation of whether and to what extent a marker’s ability to accurately detect or discard disease depends on patient characteristics. A frequent problem that complicates the estimation of the model parameters is that estimation must be conducted from observational studies. Often, in such studies not all patients undergo the gold standard assessment of disease. Furthermore, the decision as to whether a patient undergoes verification is not controlled by study design. In such scenarios, maximum likelihood estimators based on subjects with observed disease status are generally biased. In this paper, we propose estimators for the model parameters that adjust for selection to verification that may depend on measured patient characteristics and additonally adjust for an assumed degree of residual association. Such estimators may be used as part of a sensitivity analysis for plausible degrees of residual association. We describe a doubly robust estimator that has the attractive feature of being consistent if either a model for the probability of selection to verification or a model for the probability of disease among the verified subjects (but not necessarily both) is correct.
Missing at Random; Nonignorable; Missing Covariate; Sensitivity Analysis; Semiparametric; Diagnosis
We present a Bayesian variable selection method for the setting in which the number of independent variables or predictors in a particular dataset is much larger than the available sample size. While most existing methods allow some degree of correlations among predictors but do not consider these correlations for variable selection, our method accounts for correlations among the predictors in variable selection. Our correlation-based stochastic search (CBS) method, the hybrid-CBS algorithm, extends a popular search algorithm for high-dimensional data, the stochastic search variable selection (SSVS) method. Similar to SSVS, we search the space of all possible models using variable addition, deletion or swap moves. However, our moves through the model space are designed to accommodate correlations among the variables. We describe our approach for continuous, binary, ordinal, and count outcome data. The impact of choices of prior distributions and hyper-parameters is assessed in simulation studies. We also examined performance of variable selection and prediction as the correlation structure of the predictors varies. We found that the hybrid-CBS resulted in lower prediction errors and better identified the true outcome associated predictors than SSVS when predictors were moderately to highly correlated. We illustrate the method on data from a proteomic profiling study of melanoma, a skin cancer.
Correlated predictors; correlation-based search; proteomic data
Consider clustered matched-pair studies for non-inferiority where clusters are independent but units in a cluster are correlated. An inexpensive new procedure and the expensive standard one are applied to each unit and outcomes are binary responses. Appropriate statistics testing non-inferiority of a new procedure have been developed recently by several investigators. In this note, we investigate power and sample size requirement of the clustered matched pair study for non-inferiority. Power of a test is related primarily to the number of clusters. The effect of a cluster size on power is secondary. The efficiency of a clustered matched-pair design is inversely related to the intra-class correlation coefficient within a cluster. We present an explicit formula for obtaining the number of clusters for given a cluster size and the cluster size for a given number of clusters for a specific power. We also provide alternative sample size calculations when available information regarding parameters are limited. The formulae can be useful in designing a clustered matched-pair study for non-inferiority. An example for determining sample size to establish non-inferiority for a clustered matched-pair study is illustrated.
binary outcomes; clustered matched-pair; intra-class correlation coefficient; non-inferiority; power; sample size
Most variable selection techniques focus on first-order linear regression models. Often, interaction and quadratic terms are also of interest, but the number of candidate predictors grows very fast with the number of original predictors, making variable selection more difficult. Forward selection algorithms are thus developed that enforce natural hierarchies in second-order models to control the entry rate of uninformative effects and to equalize the false selection rates from first-order and second-order terms. Method performance is compared through Monte Carlo simulation and illustrated with data from a Cox regression and from a response surface experiment.
Bagging; False selection rate; Model selection; Response optimization; Variable selection
We consider the problem of dichotomizing a continuous covariate when performing a regression analysis based on a generalized estimation approach. The problem involves estimation of the cutpoint for the covariate and testing the hypothesis that the binary covariate constructed from the continuous covariate has a significant impact on the outcome. Due to the multiple testing used to find the optimal cutpoint, we need to make an adjustment to the usual significance test to preserve the type-I error rates. We illustrate the techniques on one data set of patients given unrelated hematopoietic stem cell transplantation. Here the question is whether the CD34 cell dose given to patient affects the outcome of the transplant and what is the smallest cell dose which is needed for good outcomes.
Dichotomized outcomes; Generalized estimating equations; Generalized linear model; Pseudo-values; Survival analysis
Exact calculations of model posterior probabilities or related quantities are often infeasible due to the analytical intractability of predictive densities. Here new approximations to obtain predictive densities are proposed and contrasted with those based on the Laplace method. Our theory and a numerical study indicate that the proposed methods are easy to implement, computationally efficient, and accurate over a wide range of hyperparameters. In the context of GLMs, we show that they can be employed to facilitate the posterior computation under three general classes of informative priors on regression coefficients. A real example is provided to demonstrate the feasibility and usefulness of the proposed methods in a fully Bayes variable selection procedure.
Laplace Approximation; GLM; Normal Prior; Power Prior; Conjugate Prior; Asymptotic Normality; Logistic Regression
Gibbs sampler has been used exclusively for compatible conditionals that converge to a unique invariant joint distribution. However, conditional models are not always compatible. In this paper, a Gibbs sampling-based approach — Gibbs ensemble —is proposed to search for a joint distribution that deviates least from a prescribed set of conditional distributions. The algorithm can be easily scalable such that it can handle large data sets of high dimensionality. Using simulated data, we show that the proposed approach provides joint distributions that are less discrepant from the incompatible conditionals than those obtained by other methods discussed in the literature. The ensemble approach is also applied to a data set regarding geno-polymorphism and response to chemotherapy in patients with metastatic colorectal
Gibbs sampler; Conditionally specified distribution; Linear programming; Ensemble method; Odds ratio
Breast cancer is the most common non-skin cancer in women and the second most common cause of cancer-related death in U.S. women. It is well known that the breast cancer survival varies by age at diagnosis. For most cancers, the relative survival decreases with age but breast cancer may have the unusual age pattern. In order to reveal the stage risk and age effects pattern, we propose the semiparametric accelerated failure time partial linear model and develop its estimation method based on the P-spline and the rank estimation approach. The simulation studies demonstrate that the proposed method is comparable to the parametric approach when data is not contaminated, and more stable than the parametric methods when data is contaminated. By applying the proposed model and method to the breast cancer data set of Atlantic county, New Jersey from SEER program, we successfully reveal the significant effects of stage, and show that women diagnosed around 38s have consistently higher survival rates than either younger or older women.
Accelerated failure time model; Partial linear model; Penalized spline; Rank estimation; Robustness