Summary
Treatment switching is a frequent occurrence in clinical trials, where, during the course of the trial, patients who fail on the control treatment may change to the experimental treatment. Analysing the data without accounting for switching yields highly biased and inefficient estimates of the treatment effect. In this paper, we propose a novel class of semiparametric semicompeting risks transition survival models to accommodate treatment switches. Theoretical properties of the proposed model are examined and an efficient expectation-maximization algorithm is derived for obtaining the maximum likelihood estimates. Simulation studies are conducted to demonstrate the superiority of the model compared with the intent-to-treat analysis and other methods proposed in the literature. The proposed method is applied to data from a colorectal cancer clinical trial.
doi:10.1093/biomet/asr062
PMCID: PMC3412606
PMID: 23049136
Expectation-maximization algorithm; Maximum likelihood estimate; Noncompliance; Panitumumab; Partial switching; Transition model; Treatment switching
Summary
We give a definition of a bounded edge within the causal directed acyclic graph framework. A bounded edge generalizes the notion of a signed edge and is defined in terms of bounds on a ratio of survivor probabilities. We derive rules concerning the propagation of bounds. Bounds on causal effects in the presence of unmeasured confounding are also derived using bounds related to specific edges on a graph. We illustrate the theory developed by an example concerning estimating the effect of antihistamine treatment on asthma in the presence of unmeasured confounding.
doi:10.1093/biomet/asr059
PMCID: PMC3412607
PMID: 23049135
Bayesian network; Bound; Causal inference; Confounding; Directed acyclic graph
Summary
Importance sampling is a common technique for Monte Carlo approximation, including that of p-values. Here it is shown that a simple correction of the usual importance sampling p-values provides valid p-values, meaning that a hypothesis test created by rejecting the null hypothesis when the p-value is at most α will also have a Type I error rate of at most α. This correction uses the importance weight of the original observation, which gives valuable diagnostic information under the null hypothesis. Using the corrected p-values can be crucial for multiple testing and also in problems where evaluating the accuracy of importance sampling approximations is difficult. Inverting the corrected p-values provides a useful way to create Monte Carlo confidence intervals that maintain the nominal significance level and use only a single Monte Carlo sample.
doi:10.1093/biomet/asr079
PMCID: PMC3412608
PMID: 23049134
Exact inference; Monte Carlo simulation; Multiple testing; p-value; Rasch model
Summary
Several optimality properties of Dorfman’s (1943) group testing procedure are derived for estimation of the prevalence of a rare disease whose status is classified with error. Exact ranges of disease prevalence are obtained for which group testing provides more efficient estimation when group size increases.
doi:10.1093/biomet/asr064
PMCID: PMC3412609
PMID: 23049137
Binary outcome; Maximum likelihood estimation; Pooling; Prevalence; Sensitivity; Specificity
Summary
We suggest a method for estimating a covariance matrix on the basis of a sample of vectors drawn from a multivariate normal distribution. In particular, we penalize the likelihood with a lasso penalty on the entries of the covariance matrix. This penalty plays two important roles: it reduces the effective number of parameters, which is important even when the dimension of the vectors is smaller than the sample size since the number of parameters grows quadratically in the number of variables, and it produces an estimate which is sparse. In contrast to sparse inverse covariance estimation, our method’s close relative, the sparsity attained here is in the covariance matrix itself rather than in the inverse matrix. Zeros in the covariance matrix correspond to marginal independencies; thus, our method performs model selection while providing a positive definite estimate of the covariance. The proposed penalized maximum likelihood problem is not convex, so we use a majorize-minimize approach in which we iteratively solve convex approximations to the original nonconvex problem. We discuss tuning parameter selection and demonstrate on a flow-cytometry dataset how our method produces an interpretable graphical display of the relationship between variables. We perform simulations that suggest that simple elementwise thresholding of the empirical covariance matrix is competitive with our method for identifying the sparsity structure. Additionally, we show how our method can be used to solve a previously studied special case in which a desired sparsity pattern is prespecified.
doi:10.1093/biomet/asr054
PMCID: PMC3413177
PMID: 23049130
Concave-convex procedure; Covariance graph; Covariance matrix; Generalized gradient descent; Lasso; Majorization-minimization; Regularization; Sparsity
Summary
The existing theory of the wild bootstrap has focused on linear estimators. In this note, we broaden its validity by providing a class of weight distributions that is asymptotically valid for quantile regression estimators. As most weight distributions in the literature lead to biased variance estimates for nonlinear estimators of linear regression, we propose a modification of the wild bootstrap that admits a broader class of weight distributions for quantile regression. A simulation study on median regression is carried out to compare various bootstrap methods. With a simple finite-sample correction, the wild bootstrap is shown to account for general forms of heteroscedasticity in a regression model with fixed design points.
doi:10.1093/biomet/asr052
PMCID: PMC3413178
PMID: 23049133
Bahadur representation; Heteroscedastic error; Quantile regression; Wild bootstrap
Summary
We use p-values to identify the threshold level at which a regression function leaves its baseline value, a problem motivated by applications in toxicological and pharmacological dose-response studies and environmental statistics. We study the problem in two sampling settings: one where multiple responses can be obtained at a number of different covariate levels, and the other the standard regression setting involving limited number of response values at each covariate. Our procedure involves testing the hypothesis that the regression function is at its baseline at each covariate value and then computing the potentially approximate p-value of the test. An estimate of the threshold is obtained by fitting a piecewise constant function with a single jump discontinuity, known as a stump, to these observed p-values, as they behave in markedly different ways on the two sides of the threshold. The estimate is shown to be consistent and its finite sample properties are studied through simulations. Our approach is computationally simple and extends to the estimation of the baseline value of the regression function, heteroscedastic errors and to time series. It is illustrated on some real data applications.
doi:10.1093/biomet/asr051
PMCID: PMC3413179
PMID: 23049132
Baseline value; Changepoint; Consistent estimate; Misspecified model; Stump function
Summary
It is a challenge to evaluate experimental treatments where it is suspected that the treatment effect may only be strong for certain subpopulations, such as those having a high initial severity of disease, or those having a particular gene variant. Standard randomized controlled trials can have low power in such situations. They also are not optimized to distinguish which subpopulations benefit from a treatment. With the goal of overcoming these limitations, we consider randomized trial designs in which the criteria for patient enrollment may be changed, in a preplanned manner, based on interim analyses. Since such designs allow data-dependent changes to the population enrolled, care must be taken to ensure strong control of the familywise Type I error rate. Our main contribution is a general method for constructing randomized trial designs that allow changes to the population enrolled based on interim data using a prespecified decision rule, for which the asymptotic, familywise Type I error rate is strongly controlled at a specified level α. As a demonstration of our method, we prove new, sharp results for a simple, two-stage enrichment design. We then compare this design to fixed designs, focusing on each design’s ability to determine the overall and subpopulation-specific treatment effects.
doi:10.1093/biomet/asr055
PMCID: PMC3413180
PMID: 23049131
Adaptive design; Enrichment design; Group sequential design; Optimization; Patient-oriented research; Randomized trial; Subpopulation
Two-stage randomized trials are growing in importance in developing adaptive treatment strategies, i.e. treatment policies or dynamic treatment regimes. Usually, the first stage involves randomization to one of the several initial treatments. The second stage of treatment begins when an early nonresponse criterion or response criterion is met. In the second-stage, nonresponding subjects are re-randomized among second-stage treatments. Sample size calculations for planning these two-stage randomized trials with failure time outcomes are challenging because the variances of common test statistics depend in a complex manner on the joint distribution of time to the early nonresponse criterion or response criterion and the primary failure time outcome. We produce simple, albeit conservative, sample size formulae by using upper bounds on the variances. The resulting formulae only require the working assumptions needed to size a standard single-stage randomized trial and, in common settings, are only mildly conservative. These sample size formulae are based on either a weighted Kaplan–Meier estimator of survival probabilities at a fixed time-point or a weighted version of the log-rank test.
doi:10.1093/biomet/asr019
PMCID: PMC3254237
PMID: 22363091
Dynamic treatment regime; Sample size calculation; Sequential multiple assignment randomized trial; Weighted Kaplan–Meier estimator; Weighted log-rank test
We study model selection for clustered data, when the focus is on cluster specific inference. Such data are often modelled using random effects, and conditional Akaike information was proposed in Vaida & Blanchard (2005) and used to derive an information criterion under linear mixed models. Here we extend the approach to generalized linear and proportional hazards mixed models. Outside the normal linear mixed models, exact calculations are not available and we resort to asymptotic approximations. In the presence of nuisance parameters, a profile conditional Akaike information is proposed. Bootstrap methods are considered for their potential advantage in finite samples. Simulations show that the performance of the bootstrap and the analytic criteria are comparable, with bootstrap demonstrating some advantages for larger cluster sizes. The proposed criteria are applied to two cancer datasets to select models when the cluster-specific inference is of interest.
doi:10.1093/biomet/asr023
PMCID: PMC3384357
PMID: 22822261
Akaike information; Conditional likelihood; Effective degrees of freedom
We describe an estimator of the parameter indexing a model for the conditional odds ratio between a binary exposure and a binary outcome given a high-dimensional vector of confounders, when the exposure and a subset of the confounders are missing, not necessarily simultaneously, in a subsample. We argue that a recently proposed estimator restricted to complete-cases confers more protection to model misspecification than existing ones in the sense that the set of data laws under which it is consistent strictly contains each set of data laws under which each of the previous estimators are consistent.
doi:10.1093/biomet/asr027
PMCID: PMC3384358
PMID: 22822262
Inverse probability weighted; Logistic regression; Missing at random; Model misspecification
Density regression models allow the conditional distribution of the response given predictors to change flexibly over the predictor space. Such models are much more flexible than nonparametric mean regression models with nonparametric residual distributions, and are well supported in many applications. A rich variety of Bayesian methods have been proposed for density regression, but it is not clear whether such priors have full support so that any true data-generating model can be accurately approximated. This article develops a new class of density regression models that incorporate stochastic-ordering constraints which are natural when a response tends to increase or decrease monotonely with a predictor. Theory is developed showing large support. Methods are developed for hypothesis testing, with posterior computation relying on a simple Gibbs sampler. Frequentist properties are illustrated in a simulation study, and an epidemiology application is considered.
doi:10.1093/biomet/asr025
PMCID: PMC3384359
PMID: 22822259
Conditional density estimation; Dependent Dirichlet process; Hypothesis test; Isotonic regression; Nonparametric Bayes; Quantile regression; Stochastic ordering
We propose a class of dependent processes in which density shape is regressed on one or more predictors through conditional tail-free probabilities by using transformed Gaussian processes. A particular linear version of the process is developed in detail. The resulting process is flexible and easy to fit using standard algorithms for generalized linear models. The method is applied to growth curve analysis, evolving univariate random effects distributions in generalized linear mixed models, and median survival modelling with censored data and covariate-dependent errors.
doi:10.1093/biomet/asq082
PMCID: PMC3398659
PMID: 22822260
Bayesian nonparametrics; Median regression; Partial exchangeability; Polya tree; Related probability distribution
New methods and theory have recently been developed to nonparametrically estimate cumulative incidence functions for competing risks survival data subject to current status censoring. In particular, the limiting distribution of the nonparametric maximum likelihood estimator and a simplified naive estimator have been established under certain smoothness conditions. In this paper, we establish the large-sample behaviour of these estimators in two additional models, namely when the observation time distribution has discrete support and when the observation times are grouped. These asymptotic results are applied to the construction of confidence intervals in the three different models. The methods are illustrated on two datasets regarding the cumulative incidence of different types of menopause from a cross-sectional sample of women in the United States and subtype-specific HIV infection from a sero-prevalence study in injecting drug users in Thailand.
doi:10.1093/biomet/asq083
PMCID: PMC3372275
PMID: 22822257
Competing risk; Confidence interval; Current status data; Interval censoring; Nonparametric maximum likelihood estimator; Survival analysis
In the analysis of bivariate correlated failure time data, it is important to measure the strength of association among the correlated failure times. One commonly used measure is the cross ratio. Motivated by Cox’s partial likelihood idea, we propose a novel parametric cross ratio estimator that is a flexible continuous function of both components of the bivariate survival times. We show that the proposed estimator is consistent and asymptotically normal. Its finite sample performance is examined using simulation studies, and it is applied to the Australian twin data.
doi:10.1093/biomet/asr005
PMCID: PMC3376771
PMID: 22822258
Correlated survival times; Empirical process theory; Local dependency measure; Pseudo-partial likelihood
Genome-wide association studies have successfully identified hundreds of novel genetic variants associated with many complex human diseases. However, there is a lack of rigorous work on evaluating the statistical power for identifying these variants. In this paper, we consider sparse signal identification in genome-wide association studies and present two analytical frameworks for detailed analysis of the statistical power for detecting and identifying the disease-associated variants. We present an explicit sample size formula for achieving a given false non-discovery rate while controlling the false discovery rate based on an optimal procedure. Sparse genetic variant recovery is also considered and a boundary condition is established in terms of sparsity and signal strength for almost exact recovery of both disease-associated variants and nondisease-associated variants. A data-adaptive procedure is proposed to achieve this bound. The analytical results are illustrated with a genome-wide association study of neuroblastoma.
doi:10.1093/biomet/asr003
PMCID: PMC3419390
PMID: 23049128
False discovery rate; False non-discovery rate; High-dimensional data; Multiple testing; Oracle exact recovery
We focus on sparse modelling of high-dimensional covariance matrices using Bayesian latent factor models. We propose a multiplicative gamma process shrinkage prior on the factor loadings which allows introduction of infinitely many factors, with the loadings increasingly shrunk towards zero as the column index increases. We use our prior on a parameter-expanded loading matrix to avoid the order dependence typical in factor analysis models and develop an efficient Gibbs sampler that scales well as data dimensionality increases. The gain in efficiency is achieved by the joint conjugacy property of the proposed prior, which allows block updating of the loadings matrix. We propose an adaptive Gibbs sampler for automatically truncating the infinite loading matrix through selection of the number of important factors. Theoretical results are provided on the support of the prior and truncation approximation bounds. A fast algorithm is proposed to produce approximate Bayes estimates. Latent factor regression methods are developed for prediction and variable selection in applications with high-dimensional correlated predictors. Operating characteristics are assessed through simulation studies, and the approach is applied to predict survival times from gene expression data.
doi:10.1093/biomet/asr013
PMCID: PMC3419391
PMID: 23049129
Adaptive Gibbs sampling; Factor analysis; High-dimensional data; Multiplicative gamma process; Parameter expansion; Regularization; Shrinkage
Summary
In standard regression analyses of clustered data, one typically assumes that the expected value of the response is independent of cluster size. However, this is often false. For example, in studies of surgical interventions, investigators have frequently found surgery volume and outcomes to be related to the skill level of the surgeons. This paper examines the effect of ignoring response-dependent, informative, cluster sizes on standard analytical methods such as mixed-effects models and conditional likelihood methods using analytic calculations, simulation studies and an example from a study of periodontal disease. We consider the case in which cluster sizes and responses share random effects which we assume to be independent of the covariates. Our focus is on maximum likelihood methods that ignore informative cluster sizes, and we show that they exhibit little bias in estimating covariate effects that are uncorrelated with the random effects associated with cluster sizes. However, estimation of covariate effects that are associated with the random effects can be biased. In particular, for models with random intercepts only, ignoring informative cluster sizes can yield biased estimators of the intercept but little bias in estimation of all covariate effects.
doi:10.1093/biomet/asq066
PMCID: PMC3412602
PMID: 23049125
Conditional likelihood; Generalized linear mixed model; Misspecified mixing distribution; Random slope
Summary
The objective of this paper is to quantify the effect of correlation in false discovery rate analysis. Specifically, we derive approximations for the mean, variance, distribution and quantiles of the standard false discovery rate estimator for arbitrarily correlated data. This is achieved using a negative binomial model for the number of false discoveries, where the parameters are found empirically from the data. We show that correlation may increase the bias and variance of the estimator substantially with respect to the independent case, and that in some cases, such as an exchangeable correlation structure, the estimator fails to be consistent as the number of tests becomes large.
doi:10.1093/biomet/asq075
PMCID: PMC3412603
PMID: 23049127
High-dimensional data; Microarray data; Multiple testing; Negative binomial
Summary
Gaussian graphical models explore dependence relationships between random variables, through the estimation of the corresponding inverse covariance matrices. In this paper we develop an estimator for such models appropriate for data from several graphical models that share the same variables and some of the dependence structure. In this setting, estimating a single graphical model would mask the underlying heterogeneity, while estimating separate models for each category does not take advantage of the common structure. We propose a method that jointly estimates the graphical models corresponding to the different categories present in the data, aiming to preserve the common structure, while allowing for differences between the categories. This is achieved through a hierarchical penalty that targets the removal of common zeros in the inverse covariance matrices across categories. We establish the asymptotic consistency and sparsity of the proposed estimator in the high-dimensional case, and illustrate its performance on a number of simulated networks. An application to learning semantic connections between terms from webpages collected from computer science departments is included.
doi:10.1093/biomet/asq060
PMCID: PMC3412604
PMID: 23049124
Covariance matrix; Graphical model; Hierarchical penalty; High-dimensional data; Network
Summary
This paper considers survival data arising from length-biased sampling, where the survival times are left truncated by uniformly distributed random truncation times. We propose a nonparametric estimator that incorporates the information about the length-biased sampling scheme. The new estimator retains the simplicity of the truncation product-limit estimator with a closed-form expression, and has a small efficiency loss compared with the nonparametric maximum likelihood estimator, which requires an iterative algorithm. Moreover, the asymptotic variance of the proposed estimator has a closed form, and a variance estimator is easily obtained by plug-in methods. Numerical simulation studies with practical sample sizes are conducted to compare the performance of the proposed method with its competitors. A data analysis of the Canadian Study of Health and Aging is conducted to illustrate the methods and theory.
doi:10.1093/biomet/asq069
PMCID: PMC3412605
PMID: 23049126
Backward and forward recurrence time; Cross-sectional sampling; Partial likelihood; Random truncation; Renewal process
Summary
Standardized means, commonly used in observational studies in epidemiology to adjust for potential confounders, are equal to inverse probability weighted means with inverse weights equal to the empirical propensity scores. More refined standardization corresponds with empirical propensity scores computed under more flexible models. Unnecessary standardization induces efficiency loss. However, according to the theory of inverse probability weighted estimation, propensity scores estimated under more flexible models induce improvement in the precision of inverse probability weighted means. This apparent contradiction is clarified by explicitly stating the assumptions under which the improvement in precision is attained.
doi:10.1093/biomet/asq049
PMCID: PMC3371719
PMID: 22822256
Causal inference; Propensity score; Standardized mean
Summary
Statistical analysis on landmark-based shape spaces has diverse applications in morphometrics, medical diagnostics, machine vision and other areas. These shape spaces are non-Euclidean quotient manifolds. To conduct nonparametric inferences, one may define notions of centre and spread on this manifold and work with their estimates. However, it is useful to consider full likelihood-based methods, which allow nonparametric estimation of the probability density. This article proposes a broad class of mixture models constructed using suitable kernels on a general compact metric space and then on the planar shape space in particular. Following a Bayesian approach with a nonparametric prior on the mixing distribution, conditions are obtained under which the Kullback–Leibler property holds, implying large support and weak posterior consistency. Gibbs sampling methods are developed for posterior computation, and the methods are applied to problems in density estimation and classification with shape-based predictors. Simulation studies show improved estimation performance relative to existing approaches.
doi:10.1093/biomet/asq044
PMCID: PMC3371720
PMID: 22822255
Dirichlet process mixture; Discriminant analysis; Kullback–Leibler property; Metric space; Nonparametric Bayes; Planar shape space; Posterior consistency; Riemannian manifold
Summary
Since quantile regression curves are estimated individually, the quantile curves can cross, leading to an invalid distribution for the response. A simple constrained version of quantile regression is proposed to avoid the crossing problem for both linear and nonparametric quantile curves. A simulation study and a reanalysis of tropical cyclone intensity data shows the usefulness of the procedure. Asymptotic properties of the estimator are equivalent to the typical approach under standard conditions, and the proposed estimator reduces to the classical one if there is no crossing. The performance of the constrained estimator has shown significant improvement by adding smoothing and stability across the quantile levels.
doi:10.1093/biomet/asq048
PMCID: PMC3371721
PMID: 22822254
Crossing quantile curve; Heteroscedastic error; Quantile regression; Robustness; Smoothing spline; Tropical cyclone
Summary
Directed acyclic graphs are commonly used to represent causal relationships among
random variables in graphical models. Applications of these models arise in the
study of physical and biological systems where directed edges between nodes
represent the influence of components of the system on each other. Estimation of
directed graphs from observational data is computationally NP-hard. In addition,
directed graphs with the same structure may be indistinguishable based on
observations alone. When the nodes exhibit a natural ordering, the problem of
estimating directed graphs reduces to the problem of estimating the structure of
the network. In this paper, we propose an efficient penalized likelihood method
for estimation of the adjacency matrix of directed acyclic graphs, when
variables inherit a natural ordering. We study variable selection consistency of
lasso and adaptive lasso penalties in high-dimensional sparse settings, and
propose an error-based choice for selecting the tuning parameter. We show that
although the lasso is only variable selection consistent under stringent
conditions, the adaptive lasso can consistently estimate the true graph under
the usual regularity assumptions.
doi:10.1093/biomet/asq038
PMCID: PMC3254233
PMID: 22434937
Adaptive lasso; Directed acyclic graph; High-dimensional sparse graphs; Lasso; Penalized likelihood estimation; Small n large p asymptotics