Summary
In modern statistical applications, the dimension of covariates can be much larger than the sample size. In the context of linear models, correlation screening (Fan and Lv, 2008) has been shown to reduce the dimension of such data effectively while achieving the sure screening property, i.e., all of the active variables can be retained with high probability. However, screening based on the Pearson correlation does not perform well when applied to contaminated covariates and/or censored outcomes. In this paper, we study censored rank independence screening of high-dimensional survival data. The proposed method is robust to predictors that contain outliers, works for a general class of survival models, and enjoys the sure screening property. Simulations and an analysis of real data demonstrate that the proposed method performs competitively on survival data sets of moderate size and high-dimensional predictors, even when these are contaminated.
doi:10.1093/biomet/asu047
PMCID: PMC4318124
PMID: 25663709
High-dimensional survival data; Rank independence screening; Sure screening property
Summary
As usually formulated the nonparametric likelihood for the bivariate survivor function is over-parameterized, resulting in uniqueness problems for the corresponding nonparametric maximum likelihood estimator. Here the estimation problem is redefined to include parameters for marginal hazard rates, and for double failure hazard rates only at informative uncensored failure time grid points where there is pertinent empirical information. Double failure hazard rates at other grid points in the risk region are specified rather than estimated. With this approach the nonparametric maximum likelihood estimator is unique, and can be calculated using a two-step procedure. The first step involves setting aside all doubly censored observations that are interior to the risk region. The nonparametric maximum likelihood estimator from the remaining data turns out to be the Dabrowska (1988) estimator. The omitted doubly censored observations are included in the procedure in the second stage using self-consistency, resulting in a non-iterative nonpara-metric maximum likelihood estimator for the bivariate survivor function. Simulation evaluation and asymptotic distributional results are provided. Moderate sample size efficiency for the survivor function nonparametric maximum likelihood estimator is similar to that for the Dabrowska estimator as applied to the entire dataset, while some useful efficiency improvement arises for corresponding distribution function estimator, presumably due to the avoidance of negative mass assignments.
doi:10.1093/biomet/asu010
PMCID: PMC4306565
PMID: 25632162
Bivariate survivor function; Censored data; Dabrowska estimator; Kaplan–Meier estimator; Non-parametric maximum likelihood; Self-consistency
Summary
Evidence-based rules for optimal treatment allocation are key components in the quest for efficient, effective health care delivery. Q-learning, an approximate dynamic programming algorithm, is a popular method for estimating optimal sequential decision rules from data. Q-learning requires the modeling of nonsmooth, nonmonotone transformations of the data, complicating the search for adequately expressive, yet parsimonious, statistical models. The default Q-learning working model is multiple linear regression, which is not only provably misspecified under most data-generating models, but also results in nonregular regression estimators, complicating inference. We propose an alternative strategy for estimating optimal sequential decision rules for which the requisite statistical modeling does not depend on nonsmooth, nonmonotone transformed data, does not result in nonregular regression estimators, is consistent under a broader array of data-generation models than Q-learning, results in estimated sequential decision rules that have better sampling properties, and is amenable to established statistical approaches for exploratory data analysis, model building, and validation. We derive the new method, IQ-learning, via an interchange in the order of certain steps in Q-learning. In simulated experiments IQ-learning improves on Q-learning in terms of integrated mean squared error and power. The method is illustrated using data from a study of major depressive disorder.
doi:10.1093/biomet/asu043
PMCID: PMC4274394
PMID: 25541562
Dynamic Treatment Regime; Personalized Medicine; Treatment Selection
Summary
The development of high-throughput biomedical technologies has led to increased interest in the analysis of high-dimensional data where the number of features is much larger than the sample size. In this paper, we investigate principal component analysis under the ultra-high dimensional regime, where both the number of features and the sample size increase as the ratio of the two quantities also increases. We bridge the existing results from the finite and the high-dimension low sample size regimes, embedding the two regimes in a more general framework. We also numerically demonstrate the universal application of the results from the finite regime.
doi:10.1093/biomet/ast064
PMCID: PMC4135472
PMID: 25143634
High-Dimension Low Sample Size Data; Principal Component Analysis; Random Matrix
Summary
We propose an adaptive nuclear norm penalization approach for low-rank matrix approximation, and use it to develop a new reduced rank estimation method for high-dimensional multivariate regression. The adaptive nuclear norm is defined as the weighted sum of the singular values of the matrix, and it is generally non-convex under the natural restriction that the weight decreases with the singular value. However, we show that the proposed non-convex penalized regression method has a global optimal solution obtained from an adaptively soft-thresholded singular value decomposition. The method is computationally efficient, and the resulting solution path is continuous. The rank consistency of and prediction/estimation performance bounds for the estimator are established for a high-dimensional asymptotic regime. Simulation studies and an application in genetics demonstrate its efficacy.
doi:10.1093/biomet/ast036
PMCID: PMC4101086
PMID: 25045172
Low-rank approximation; Nuclear norm penalization; Reduced rank regression; Singular value decomposition
Summary
In longitudinal data analysis, statistical inference for sparse data and dense data could be substantially different. For kernel smoothing estimate of the mean function, the convergence rates and limiting variance functions are different under the two scenarios. The latter phenomenon poses challenges for statistical inference as a subjective choice between the sparse and dense cases may lead to wrong conclusions. We develop self-normalization based methods that can adapt to the sparse and dense cases in a unified framework. Simulations show that the proposed methods outperform some existing methods.
doi:10.1093/biomet/ass050
PMCID: PMC4066936
PMID: 24966413
Dense longitudinal data; Kernel smoothing; Mixed-effects model; Nonparametric estimation; Self-normalization; Sparse longitudinal data
Summary
We propose a multiple imputation estimator for parameter estimation in a quantile regression model when some covariates are missing at random. The estimation procedure fully utilizes the entire dataset to achieve increased efficiency, and the resulting coefficient estimators are root-n consistent and asymptotically normal. To protect against possible model misspecification, we further propose a shrinkage estimator, which automatically adjusts for possible bias. The finite sample performance of our estimator is investigated in a simulation study. Finally, we apply our methodology to part of the Eating at American’s Table Study data, investigating the association between two measures of dietary intake.
doi:10.1093/biomet/ass007
PMCID: PMC4059083
PMID: 24944347
Missing data; Multiple imputation; Quantile regression; Regression quantile; Shrinkage estimation
Summary
The case-cohort study design, used to reduce costs in large cohort
studies, is a random sample of the entire cohort, named the subcohort, augmented
with subjects having the disease of interest but not in the subcohort sample.
When several diseases are of interest, several case-cohort studies may be
conducted using the same subcohort, with each disease analyzed separately,
ignoring the additional exposure measurements collected on subjects with the
other diseases. This is not an efficient use of the data, and in this paper, we
propose more efficient estimators. We consider both joint and separate analyses
for the multiple diseases. We propose an estimating equation approach with a new
weight function, and we establish the consistency and asymptotic normality of
the resulting estimator. Simulation studies show that the proposed methods using
all available information gain efficiency. We apply our proposed method to the
data from the Busselton Health Study.
doi:10.1093/biomet/ast018
PMCID: PMC3950393
PMID: 24634519
Case-cohort study; Multiple disease outcomes; Multivariate failure time; Proportional hazards; Survival analysis
Summary
A dynamic treatment regime is a list of sequential decision rules for assigning treatment based on a patient’s history. Q- and A-learning are two main approaches for estimating the optimal regime, i.e., that yielding the most beneficial outcome in the patient population, using data from a clinical trial or observational study. Q-learning requires postulated regression models for the outcome, while A-learning involves models for that part of the outcome regression representing treatment contrasts and for treatment assignment. We propose an alternative to Q- and A-learning that maximizes a doubly robust augmented inverse probability weighted estimator for population mean outcome over a restricted class of regimes. Simulations demonstrate the method’s performance and robustness to model misspecification, which is a key concern.
doi:10.1093/biomet/ast014
PMCID: PMC3843953
PMID: 24302771
A-learning; Double robustness; Outcome regression; Propensity score; Q-learning
Summary
Inverse probability-weighted estimators are widely used in applications where data are missing due to nonresponse or censoring and in the estimation of causal effects from observational studies. Current estimators rely on ignorability assumptions for response indicators or treatment assignment and outcomes being conditional on observed covariates which are assumed to be measured without error. However, measurement error is common for the variables collected in many applications. For example, in studies of educational interventions, student achievement as measured by standardized tests is almost always used as the key covariate for removing hidden biases, but standardized test scores may have substantial measurement errors. We provide several expressions for a weighting function that can yield a consistent estimator for population means using incomplete data and covariates measured with error. We propose a method to estimate the weighting function from data. The results of a simulation study show that the estimator is consistent and has no bias and small variance.
doi:10.1093/biomet/ast022
PMCID: PMC4006991
PMID: 24795484
Causal inference; Measurement error; Missing observation; Propensity score
Summary
In single hypothesis testing, power is a non-decreasing function of type I error rate; hence it is desirable to test at the nominal level exactly to achieve optimal power. The puzzle lies in the fact that for multiple testing, under the false discovery rate paradigm, such a monotonic relationship may not hold. In particular, exact false discovery rate control may lead to a less powerful testing procedure if a test statistic fails to fulfil the monotone likelihood ratio condition. In this article, we identify different scenarios wherein the condition fails and give caveats for conducting multiple testing in practical settings.
doi:10.1093/biomet/ast001
PMCID: PMC3984571
PMID: 24733954
False discovery rate; heteroscedasticity; monotone likelihood ratio; multiple testing dependence
Summary
We show that relative mean survival parameters of a semiparametric log-linear model can be estimated using covariate data from an incident sample and a prevalent sample, even when there is no prospective follow-up to collect any survival data. Estimation is based on an induced semiparametric density ratio model for covariates from the two samples, and it shares the same structure as for a logistic regression model for case-control data. Likelihood inference coincides with well-established methods for case-control data. We show two further related results. First, estimation of interaction parameters in a survival model can be performed using covariate information only from a prevalent sample, analogous to a case-only analysis. Furthermore, propensity score and conditional exposure effect parameters on survival can be estimated using only covariate data collected from incident and prevalent samples.
doi:10.1093/biomet/ast008
PMCID: PMC3879155
PMID: 24391222
Accelerated failure time model; Biased sampling; Empirical likelihood; Prevalent cohort; Propensity score; Proportional mean residual life model
Summary
The aim of this paper is to develop a general framework of Bayesian influence analysis for assessing various perturbation schemes to the data, the prior and the sampling distribution for a class of statistical models. We introduce a perturbation model to characterize these various perturbation schemes. We develop a geometric framework, called the Bayesian perturbation manifold, and use its associated geometric quantities including the metric tensor and geodesic to characterize the intrinsic structure of the perturbation model. We develop intrinsic influence measures and local influence measures based on the Bayesian perturbation manifold to quantify the effect of various perturbations to statistical models. Theoretical and numerical examples are examined to highlight the broad spectrum of applications of this local influence method in a formal Bayesian analysis.
doi:10.1093/biomet/asr009
PMCID: PMC3897258
PMID: 24453379
Bayesian; Influence measure; Perturbation manifold; Perturbation model; Prior
Summary
A rate model is proposed for a modulated renewal process comprising a single long sequence, where the covariate process may not capture the dependencies in the sequence as in standard intensity models. We consider partial likelihood-based inferences under a semiparametric multiplicative rate model, which has been widely studied in the context of independent and identical data. Under an intensity model, gap times in a single long sequence may be used naively in the partial likelihood with variance estimation utilizing the observed information matrix. Under a rate model, the gap times cannot be treated as independent and studying the partial likelihood is much more challenging. We employ a mixing condition in the application of limit theory for stationary sequences to obtain consistency and asymptotic normality. The estimator's variance is quite complicated owing to the unknown gap times dependence structure. We adapt block bootstrapping and cluster variance estimators to the partial likelihood. Simulation studies and an analysis of a semiparametric extension of a popular model for neural spike train data demonstrate the practical utility of the rate approach in comparison with the intensity approach.
doi:10.1093/biomet/ast011
PMCID: PMC3925684
PMID: 24550568
Block bootstrap; Mixing condition; Neurophysiology; Partial likelihood; Single sequence; Stationary limit theory
Summary
In the analysis of multivariate event times, frailty models assuming time-independent regression coefficients are often considered, mainly due to their mathematical convenience. In practice, regression coefficients are often time dependent and the temporal effects are of clinical interest. Motivated by a phase III clinical trial in multiple sclerosis, we develop a semiparametric frailty modelling approach to estimate time-varying effects for overdispersed recurrent events data with treatment switching. The proposed model incorporates the treatment switching time in the time-varying coefficients. Theoretical properties of the proposed model are established and an efficient expectation-maximization algorithm is derived to obtain the maximum likelihood estimates. Simulation studies evaluate the numerical performance of the proposed model under various temporal treatment effect curves. The ideas in this paper can also be used for time-varying coefficient frailty models without treatment switching as well as for alternative models when the proportional hazard assumption is violated. A multiple sclerosis dataset is analysed to illustrate our methodology.
doi:10.1093/biomet/ass091
PMCID: PMC3899844
PMID: 24465031
B-spline; Expectation-maximization algorithm; Maximum likelihood estimate; Recurrent event; Time-varying coefficient; Treatment switching
Summary
Clustered survival data frequently arise in biomedical applications, where event times of interest are clustered into groups such as families. In this article we consider an accelerated failure time frailty model for clustered survival data and develop nonparametric maximum likelihood estimation for it via a kernel smoother aided EM algorithm. We show that the proposed estimator for the regression coefficients is consistent, asymptotically normal and semiparametric efficient when the kernel bandwidth is properly chosen. An EM-aided numerical differentiation method is derived for estimating its variance. Simulation studies evaluate the finite sample performance of the estimator, and it is applied to the Diabetic Retinopathy data set.
doi:10.1093/biomet/ast012
PMCID: PMC3893096
PMID: 24443587
Accelerated failure time model; Clustered survival data; EM algorithm; Kernel smoothing; Profile likelihood estimation
Summary
The proportional likelihood ratio model introduced in Luo & Tsai (2011) is adapted to explicitly model the means of observations. This is useful for the estimation of and inference on treatment effects, particularly in designed experiments, and allows the data analyst greater control over model specification and parameter interpretation.
doi:10.1093/biomet/asr075
PMCID: PMC3888642
PMID: 24421412
Empirical likelihood; Exponential tilting; Generalized linear models; Multi-way layout; Proportional likelihood ratio model; Quasi-likelihood; Semiparametric model
Summary
Recurrent event data frequently arise in longitudinal studies when study subjects possibly experience more than one event during the observation period. Often, such recurrent events can be categorized. However, part of the categorization may be missing due to technical difficulties. If the event types are missing completely at random, then a complete case analysis may provide consistent estimates of regression parameters in certain regression models, but estimates of the baseline event rates are generally biased. Previous work on nonparametric estimation of these rates has utilized parametric missingness models. In this paper, we develop fully nonparametric methods in which the missingness mechanism is completely unspecified. Consistency and asymptotic normality of the nonparametric estimators of the mean event functions accommodate nonparametric estimators of the event category probabilities, which converge more slowly than the parametric rate. Plug-in variance estimators are provided and perform well in simulation studies, where complete case estimators may exhibit large biases and parametric estimators generally have a larger mean squared error when the model is misspecified. The proposed methods are applied to data from a cystic fibrosis registry.
doi:10.1093/biomet/ast016
PMCID: PMC3887139
PMID: 24415792
Cystic fibrosis; Local polynomial regression; Nelson–Aalen estimation; Pseudomonas aeruginosa infection; Rate proportion
Summary
Copy number variant is an important type of genetic structural variation appearing in germline DNA, ranging from common to rare in a population. Both rare and common copy number variants have been reported to be associated with complex diseases, so it is therefore important to simultaneously identify both based on a large set of population samples. We develop a proportion adaptive segment selection procedure that automatically adjusts to the unknown proportions of the carriers of the segment variants. We characterize the detection boundary that separates the region where a segment variant is detectable by some method from the region where it cannot be detected. Although the detection boundaries are very different for the rare and common segment variants, it is shown that the proposed procedure can reliably identify both whenever they are detectable. Compared with methods for single sample analysis, this procedure gains power by pooling information from multiple samples. The method is applied to analyze neuroblastoma samples and identifies a large number of copy number variants that are missed by single-sample methods.
doi:10.1093/biomet/ass059
PMCID: PMC3696347
PMID: 23825436
DNA copy number variant; Information pooling; Population structural variant
Summary
We show that the proportional likelihood ratio model proposed recently by Luo & Tsai (2012) enjoys model-invariant properties under certain forms of nonignorable missing mechanisms and randomly double-truncated data, so that target parameters in the population can be estimated consistently from those biased samples. We also construct an alternative estimator for the target parameters by maximizing a pseudo-likelihood that eliminates a functional nuisance parameter in the model. The corresponding estimating equation has a U-statistic structure. As an added advantage of the proposed method, a simple score-type test is developed to test a null hypothesis on the regression coefficients. Simulations show that the proposed estimator has a small-sample efficiency similar to that of the nonparametric likelihood estimator and performs well for certain nonignorable missing data problems.
doi:10.1093/biomet/ass056
PMCID: PMC3809024
PMID: 24179236
Double truncation; Nonignorable missingness; Pairwise pseudolikelihood; U-statistic
Summary
In the modeling of longitudinal data from several groups, appropriate handling of the dependence structure is of central importance. Standard methods include specifying a single covariance matrix for all groups or independently estimating the covariance matrix for each group without regard to the others, but when these model assumptions are incorrect, these techniques can lead to biased mean effects or loss of efficiency, respectively. Thus, it is desirable to develop methods to simultaneously estimate the covariance matrix for each group that will borrow strength across groups in a way that is ultimately informed by the data. In addition, for several groups with covariance matrices of even medium dimension, it is difficult to manually select a single best parametric model among the huge number of possibilities given by incorporating structural zeros and/or commonality of individual parameters across groups. In this paper we develop a family of nonparametric priors using the matrix stick-breaking process of Dunson et al. (2008) that seeks to accomplish this task by parameterizing the covariance matrices in terms of the parameters of their modified Cholesky decomposition (Pourahmadi, 1999). We establish some theoretic properties of these priors, examine their effectiveness via a simulation study, and illustrate the priors using data from a longitudinal clinical trial.
doi:10.1093/biomet/ass060
PMCID: PMC3852937
PMID: 24324281
Bayesian nonparametric inference; Cholesky decomposition; matrix stick-breaking process; simultaneous covariance estimation; sparsity
Summary
We construct an asymptotic confidence interval for the mean of a class of nonstationary processes with constant mean and time-varying variances. Due to the large number of unknown parameters, traditional approaches based on consistent estimation of the limiting variance of sample mean through moving block or non-overlapping block methods are not applicable. Under a block-wise asymptotically equal cumulative variance assumption, we propose a self-normalized confidence interval that is robust against the nonstationarity and dependence structure of the data. We also apply the same idea to construct an asymptotic confidence interval for the mean difference of nonstationary processes with piecewise constant means. The proposed methods are illustrated through simulations and an application to global temperature series.
doi:10.1093/biomet/asq076
PMCID: PMC3852676
PMID: 24319293
Confidence interval; Global temperature; Invariance principle; Nonstationary process; Self-normalization; Time-varying variance
SUMMARY
We consider statistical inference for additive partial linear models when the linear covariate is measured with error. We propose attenuation-to-correction and SIMEX estimators of the parameter of interest. It is shown that the first resulting estimator is asymptotically normal and requires no undersmoothing. This is an advantage of our estimator over existing backfitting-based estimators for semiparametric additive models which require undersmoothing of the nonparametric component in order for the estimator of the parametric component be root-n consistent. This feature stems from a decrease of the bias of the resulting estimator which is appropriately derived using a profile procedure. A similar characteristic in semiparametric partially linear models was obtained by Wang et al. (2005). We also discuss the asymptotics of the proposed SIMEX approach. Finite-sample performance of the proposed estimators is assessed by simulation experiments. The proposed methods are applied to a dataset from a semen study.
doi:10.1093/biomet/asn024
PMCID: PMC3816788
PMID: 24198435
Backfitting; Correction-for-attenuation; Error-prone; Local linear regression; Semen quality study; Semiparametric estimation; SIMEX; Undersmoothing
Summary
Suppose we are interested in the effect of a binary treatment on an outcome where that relationship is confounded by an ordinal confounder. We assume that the true confounder is not observed, rather we observe a nondifferentially mismeasured version of it. We show that under certain monotonicity assumptions about its effect on the treatment and on the outcome, an effect measure controlling for the mismeasured confounder will fall between its corresponding crude and the true effect measures. We present results for coarsened, and, under further assumptions, for multiple misclassified confounders.
doi:10.1093/biomet/ass054
PMCID: PMC3761876
PMID: 24014285
Bias; Confounding; Measurement Error; Misclassification
Summary
Gaussian processes are widely used in nonparametric regression, classification and spatiotemporal modelling, facilitated in part by a rich literature on their theoretical properties. However, one of their practical limitations is expensive computation, typically on the order of n3 where n is the number of data points, in performing the necessary matrix inversions. For large datasets, storage and processing also lead to computational bottlenecks, and numerical stability of the estimates and predicted values degrades with increasing n. Various methods have been proposed to address these problems, including predictive processes in spatial data analysis and the subset-of-regressors technique in machine learning. The idea underlying these approaches is to use a subset of the data, but this raises questions concerning sensitivity to the choice of subset and limitations in estimating fine-scale structure in regions that are not well covered by the subset. Motivated by the literature on compressive sensing, we propose an alternative approach that involves linear projection of all the data points onto a lower-dimensional subspace. We demonstrate the superiority of this approach from a theoretical perspective and through simulated and real data examples.
doi:10.1093/biomet/ass068
PMCID: PMC3712798
PMID: 23869109
Bayesian regression; Compressive sensing; Dimensionality reduction; Gaussian process; Random projection