Summary
In this article, we propose a regression method for simultaneous supervised clustering and feature selection over a given undirected graph, where homogeneous groups or clusters are estimated as well as informative predictors, with each predictor corresponding to one node in the graph and a connecting path indicating a priori possible grouping among the corresponding predictors. The method seeks a parsimonious model with high predictive power through identifying and collapsing homogeneous groups of regression coefficients. To address computational challenges, we present an efficient algorithm integrating the augmented Lagrange multipliers, coordinate descent and difference convex methods. We prove that the proposed method not only identifies the true homogeneous groups and informative features consistently but also leads to accurate parameter estimation. A gene network dataset is analysed to demonstrate that the method can make a difference by exploring dependency structures among the genes.
doi:10.1093/biomet/ass038
PMCID: PMC3629856
PMID: 23843673
Expression quantitative trait loci data; High-dimensional data; Nonconvex minimization; Prediction
Summary
Resampling-based methods for multiple hypothesis testing often lead to long run times when the number of tests is large. This paper presents a simple rule that substantially reduces computation by allowing resampling to terminate early on a subset of tests. We prove that the method has a low probability of obtaining a set of rejected hypotheses different from those rejected without early stopping, and obtain error bounds for multiple hypothesis testing. Simulation shows that our approach saves more computation than other available procedures.
doi:10.1093/biomet/ass051
PMCID: PMC3629857
PMID: 23843675
Bootstrap; Early stopping; False discovery rate control; Multiple hypothesis testing; Resampling
Summary
Linear classifiers are very popular, but can have limitations when classes have distinct subpopulations. General nonlinear kernel classifiers are very flexible, but do not give clear interpretations and may not be efficient in high dimensions. We propose the bidirectional discrimination classification method, which generalizes linear classifiers to two or more hyperplanes. This new family of classification methods gives much of the flexibility of a general nonlinear classifier while maintaining the interpretability, and much of the parsimony, of linear classifiers. They provide a new visualization tool for high-dimensional, low-sample-size data. Although the idea is generally applicable, we focus on the generalization of the support vector machine and distance-weighted discrimination methods. The performance and usefulness of the proposed method are assessed using asymptotics and demonstrated through analysis of simulated and real data. Our method leads to better classification performance in high-dimensional situations where subclusters are present in the data.
doi:10.1093/biomet/ass029
PMCID: PMC3629858
PMID: 23843672
Asymptotics; Classification; High-dimensional data; Initial value; Iteration; Optimization; Visualization
Summary
Several two-stage multiple testing procedures have been proposed to detect gene-environment interaction in genome-wide association studies. In this article, we elucidate general conditions that are required for validity and power of these procedures, and we propose extensions of two-stage procedures using the case-only estimator of gene-treatment interaction in randomized clinical trials. We develop a unified estimating equation approach to proving asymptotic independence between a filtering statistic and an interaction test statistic in a range of situations, including marginal association and interaction in a generalized linear model with a canonical link. We assess the performance of various two-stage procedures in simulations and in genetic studies from Women’s Health Initiative clinical trials.
doi:10.1093/biomet/ass044
PMCID: PMC3629859
PMID: 23843674
Case-only estimator; Filtering; Gene-treatment interaction; Multiple testing; Pharmacogenetics; Randomization
To study disease association with risk factors in epidemiologic studies, cross-sectional sampling is often more focused and less costly for recruiting study subjects who have already experienced initiating events. For time-to-event outcome, however, such a sampling strategy may be length biased. Coupled with censoring, analysis of length-biased data can be quite challenging, due to induced informative censoring in which the survival time and censoring time are correlated through a common backward recurrence time. We propose to use the proportional mean residual life model of Oakes & Dasu (Biometrika
77, 409–10, 1990) for analysis of censored length-biased survival data. Several nonstandard data structures, including censoring of onset time and cross-sectional data without follow-up, can also be handled by the proposed methodology.
doi:10.1093/biomet/ass049
PMCID: PMC3635658
PMID: 23843676
Biased sampling; Bivariate survival data; Proportional hazards model; Renewal process
Summary
Incidence is an important epidemiological concept most suitably studied using an incident cohort study. However, data are often collected from the more feasible prevalent cohort study, whereby diseased individuals are recruited through a cross-sectional survey and followed in time. In the absence of temporal trends in survival, we derive an efficient nonparametric estimator of the cumulative incidence based on such data and study its asymptotic properties. Arbitrary calendar time variations in disease incidence are allowed. Age-specific incidence and adjustments for both stratified sampling and temporal variations in survival are also discussed. Simulation results are presented and data from the Canadian Study of Health and Aging are analysed to infer the incidence of dementia in the Canadian elderly population.
doi:10.1093/biomet/ass017
PMCID: PMC3635701
PMID: 23843670
Age-specific incidence; Cross-sectional sampling; Left-truncation; Point process; Stratification
Summary
We propose a graphical measure, the generalized negative predictive function, to quantify the predictive accuracy of covariates for survival time or recurrent event times. This new measure characterizes the event-free probabilities over time conditional on a thresholded linear combination of covariates and has direct clinical utility. We show that this function is maximized at the set of covariates truly related to event times and thus can be used to compare the predictive accuracy of different sets of covariates. We construct nonparametric estimators for this function under right censoring and prove that the proposed estimators, upon proper normalization, converge weakly to zero-mean Gaussian processes. To bypass the estimation of complex density functions involved in the asymptotic variances, we adopt the bootstrap approach and establish its validity. Simulation studies demonstrate that the proposed methods perform well in practical situations. Two clinical studies are presented.
doi:10.1093/biomet/ass018
PMCID: PMC3635702
PMID: 23843671
Censoring; Negative predictive value; Positive predictive value; Prognostic accuracy; Receiver operating characteristic curve; Recurrent event; Survival data; Transformation model
Summary
A general framework for a novel non-geodesic decomposition of high-dimensional spheres or high-dimensional shape spaces for planar landmarks is discussed. The decomposition, principal nested spheres, leads to a sequence of submanifolds with decreasing intrinsic dimensions, which can be interpreted as an analogue of principal component analysis. In a number of real datasets, an apparent one-dimensional mode of variation curving through more than one geodesic component is captured in the one-dimensional component of principal nested spheres. While analysis of principal nested spheres provides an intuitive and flexible decomposition of the high-dimensional sphere, an interesting special case of the analysis results in finding principal geodesics, similar to those from previous approaches to manifold principal component analysis. An adaptation of our method to Kendall’s shape space is discussed, and a computational algorithm for fitting principal nested spheres is proposed. The result provides a coordinate system to visualize the data structure and an intuitive summary of principal modes of variation, as exemplified by several datasets.
doi:10.1093/biomet/ass022
PMCID: PMC3635703
PMID: 23843669
Dimension reduction; Kendall’s shape space; Manifold; Principal arc; Principal component analysis; Spherical data
Summary
Penalization methods have been shown to yield both consistent variable selection and oracle parameter estimation under correct model specification. In this article, we study such methods under model misspecification, where the assumed form of the regression function is incorrect, including generalized linear models for uncensored outcomes and the proportional hazards model for censored responses. Estimation with the adaptive least absolute shrinkage and selection operator, lasso, penalty is proven to achieve sparse estimation of regression coefficients under misspecification. The resulting estimators are selection consistent, asymptotically normal and oracle, where the selection is based on the limiting values of the parameter estimators obtained using the misspecified model without penalization. We further derive conditions under which the penalized estimators from the misspecified model may yield selection consistency under the true model. The robustness is explored numerically via simulation and an application to the Wisconsin Epidemiological Study of Diabetic Retinopathy.
doi:10.1093/biomet/ass027
PMCID: PMC4188068
PMID: 25294946
Least false parameter; Model misspecification; Oracle property; Penalization; Selection consistency; Shrinkage estimation; Variable selection
We propose a new residual for regression models of ordinal outcomes, defined as E{sign(y,Y)}, where y is the observed outcome and Y is a random variable from the fitted distribution. This new residual is a single value per subject irrespective of the number of categories of the ordinal outcome, contains directional information between the observed value and the fitted distribution, and does not require the assignment of arbitrary numbers to categories. We study its properties, describe its connections with other residuals, ranks and ridits, and demonstrate its use in model diagnostics.
doi:10.1093/biomet/asr073
PMCID: PMC3635659
PMID: 23843667
Model diagnostics; Ordinal outcome; Ordinal regression; Residual
To evaluate the biological efficacy of a treatment in a randomized clinical trial, one needs to compare patients in the treatment arm who actually received treatment with the subgroup of patients in the control arm who would have received treatment had they been randomized into the treatment arm. In practice, subgroup membership in the control arm is usually unobservable. This paper develops a nonparametric inference procedure to compare subgroup probabilities with right-censored time-to-event data and unobservable subgroup membership in the control arm. We also present a procedure to estimate the onset and duration of treatment effect. The performance of our method is evaluated by simulation. An illustration is given using a randomized clinical trial for melanoma.
doi:10.1093/biomet/ass004
PMCID: PMC3635705
PMID: 23843664
Biological efficacy; Censoring; Counting process; Martingale; Noncompliance; Survival probability
In this paper, we consider estimation of survivor functions from groups of observations with right-censored data when the groups are subject to a stochastic ordering constraint. Many methods and algorithms have been proposed to estimate distribution functions under such restrictions, but none have completely satisfactory properties when the observations are censored. We propose a pointwise constrained nonparametric maximum likelihood estimator, which is defined at each time t by the estimates of the survivor functions subject to constraints applied at time t only. We also propose an efficient method to obtain the estimator. The estimator of each constrained survivor function is shown to be nonincreasing in t, and its consistency and asymptotic distribution are established. A simulation study suggests better small and large sample properties than for alternative estimators. An example using prostate cancer data illustrates the method.
doi:10.1093/biomet/ass006
PMCID: PMC3635706
PMID: 23843661
Censored data; Constrained nonparametric maximum likelihood estimator; Kaplan–Meier estimator; Maximum likelihood estimator; Order restriction
We study estimation in quantile regression when covariates are measured with errors. Existing methods require stringent assumptions, such as spherically symmetric joint distribution of the regression and measurement error variables, or linearity of all quantile functions, which restrict model flexibility and complicate computation. In this paper, we develop a new estimation approach based on corrected scores to account for a class of covariate measurement errors in quantile regression. The proposed method is simple to implement. Its validity requires only linearity of the particular quantile function of interest, and it requires no parametric assumptions on the regression error distributions. Finite-sample results demonstrate that the proposed estimators are more efficient than the existing methods in various models considered.
doi:10.1093/biomet/ass005
PMCID: PMC3635707
PMID: 23843665
Corrected loss function; Laplace distribution; Measurement error; Normal distribution; Quantile regression; Smoothing
We present asymptotic and finite-sample results on the use of stochastic blockmodels for the analysis of network data. We show that the fraction of misclassified network nodes converges in probability to zero under maximum likelihood fitting when the number of classes is allowed to grow as the root of the network size and the average network degree grows at least poly-logarithmically in this size. We also establish finite-sample confidence bounds on maximum-likelihood blockmodel parameter estimates from data comprising independent Bernoulli random variates; these results hold uniformly over class assignment. We provide simulations verifying the conditions sufficient for our results, and conclude by fitting a logit parameterization of a stochastic blockmodel with covariates to a network data example comprising self-reported school friendships, resulting in block estimates that reveal residual structure.
doi:10.1093/biomet/asr053
PMCID: PMC3635708
PMID: 23843660
Likelihood-based inference; Social network analysis; Sparse random graph; Stochastic blockmodel
Recently proposed double-robust estimators for a population mean from incomplete data and for a finite number of counterfactual means can have much higher efficiency than the usual double-robust estimators under misspecification of the outcome model. In this paper, we derive a new class of double-robust estimators for the parameters of regression models with incomplete cross-sectional or longitudinal data, and of marginal structural mean models for cross-sectional data with similar efficiency properties. Unlike the recent proposals, our estimators solve outcome regression estimating equations. In a simulation study, the new estimator shows improvements in variance relative to the standard double-robust estimator that are in agreement with those suggested by asymptotic theory.
doi:10.1093/biomet/ass013
PMCID: PMC3635709
PMID: 23843666
Drop-out; Marginal structural model; Missing at random
The full likelihood approach in statistical analysis is regarded as the most efficient means for estimation and inference. For complex length-biased failure time data, computational algorithms and theoretical properties are not readily available, especially when a likelihood function involves infinite-dimensional parameters. Relying on the invariance property of length-biased failure time data under the semiparametric density ratio model, we present two likelihood approaches for the estimation and assessment of the difference between two survival distributions. The most efficient maximum likelihood estimators are obtained by the em algorithm and profile likelihood. We also provide a simple numerical method for estimation and inference based on conditional likelihood, which can be generalized to k-arm settings. Unlike conventional survival data, the mean of the population failure times can be consistently estimated given right-censored length-biased data under mild regularity conditions. To check the semiparametric density ratio model assumption, we use a test statistic based on the area between two survival distributions. Simulation studies confirm that the full likelihood estimators are more efficient than the conditional likelihood estimators. We analyse an epidemiological study to illustrate the proposed methods.
doi:10.1093/biomet/ass008
PMCID: PMC3635710
PMID: 23843663
Conditional likelihood; Density ratio model; em algorithm; Length-biased sampling; Maximum likelihood approach
Results are given concerning inferences that can be drawn about interaction when binary exposures are subject to certain forms of independent nondifferential misclassification. Tests for interaction, using the misclassified exposures, are valid provided the probability of misclassification satisfies certain bounds. Results are given for additive statistical interactions, for causal interactions corresponding to synergism in the sufficient cause framework and for so-called compositional epistasis. Both two-way and three-way interactions are considered. The results require only that the probability of misclassification be no larger than 1/2 or 1/4, depending on the test. For additive statistical interaction, a method to correct estimates and confidence intervals for misclassification is described. The consequences for power of interaction tests under exposure misclassification are explored through simulations.
doi:10.1093/biomet/ass012
PMCID: PMC3635711
PMID: 23843668
Causal inference; Epistasis; Interaction; Misclassification; Sufficient cause; Synergism
In biomedical studies, ordered bivariate survival data are frequently encountered when bivariate failure events are used as outcomes to identify the progression of a disease. In cancer studies, interest could be focused on bivariate failure times, for example, time from birth to cancer onset and time from cancer onset to death. This paper considers a sampling scheme, termed interval sampling, in which the first failure event is identified within a calendar time interval, the time of the initiating event can be retrospectively confirmed and the occurrence of the second failure event is observed subject to right censoring. In a cancer data application, the initiating, first and second events could correspond to birth, cancer onset and death. The fact that the data are collected conditional on the first failure event occurring within a time interval induces bias. Interval sampling is widely used for collection of disease registry data by governments and medical institutions, though the interval sampling bias is frequently overlooked by researchers. This paper develops statistical methods for analysing such data. Semiparametric methods are proposed under semi-stationarity and stationarity. Numerical studies demonstrate that the proposed estimation approaches perform well with moderate sample sizes. We apply the proposed methods to ovarian cancer registry data.
doi:10.1093/biomet/ass009
PMCID: PMC3635712
PMID: 23843662
Bivariate survival distribution; Copula; Interval sampling; Semiparametric model; Semi-stationarity; Stationarity
Summary
Treatment switching is a frequent occurrence in clinical trials, where, during the course of the trial, patients who fail on the control treatment may change to the experimental treatment. Analysing the data without accounting for switching yields highly biased and inefficient estimates of the treatment effect. In this paper, we propose a novel class of semiparametric semicompeting risks transition survival models to accommodate treatment switches. Theoretical properties of the proposed model are examined and an efficient expectation-maximization algorithm is derived for obtaining the maximum likelihood estimates. Simulation studies are conducted to demonstrate the superiority of the model compared with the intent-to-treat analysis and other methods proposed in the literature. The proposed method is applied to data from a colorectal cancer clinical trial.
doi:10.1093/biomet/asr062
PMCID: PMC3412606
PMID: 23049136
Expectation-maximization algorithm; Maximum likelihood estimate; Noncompliance; Panitumumab; Partial switching; Transition model; Treatment switching
Summary
We give a definition of a bounded edge within the causal directed acyclic graph framework. A bounded edge generalizes the notion of a signed edge and is defined in terms of bounds on a ratio of survivor probabilities. We derive rules concerning the propagation of bounds. Bounds on causal effects in the presence of unmeasured confounding are also derived using bounds related to specific edges on a graph. We illustrate the theory developed by an example concerning estimating the effect of antihistamine treatment on asthma in the presence of unmeasured confounding.
doi:10.1093/biomet/asr059
PMCID: PMC3412607
PMID: 23049135
Bayesian network; Bound; Causal inference; Confounding; Directed acyclic graph
Summary
Importance sampling is a common technique for Monte Carlo approximation, including that of p-values. Here it is shown that a simple correction of the usual importance sampling p-values provides valid p-values, meaning that a hypothesis test created by rejecting the null hypothesis when the p-value is at most α will also have a Type I error rate of at most α. This correction uses the importance weight of the original observation, which gives valuable diagnostic information under the null hypothesis. Using the corrected p-values can be crucial for multiple testing and also in problems where evaluating the accuracy of importance sampling approximations is difficult. Inverting the corrected p-values provides a useful way to create Monte Carlo confidence intervals that maintain the nominal significance level and use only a single Monte Carlo sample.
doi:10.1093/biomet/asr079
PMCID: PMC3412608
PMID: 23049134
Exact inference; Monte Carlo simulation; Multiple testing; p-value; Rasch model
Summary
Several optimality properties of Dorfman’s (1943) group testing procedure are derived for estimation of the prevalence of a rare disease whose status is classified with error. Exact ranges of disease prevalence are obtained for which group testing provides more efficient estimation when group size increases.
doi:10.1093/biomet/asr064
PMCID: PMC3412609
PMID: 23049137
Binary outcome; Maximum likelihood estimation; Pooling; Prevalence; Sensitivity; Specificity
This paper considers semiparametric estimation of the Cox proportional hazards model for right-censored and length-biased data arising from prevalent sampling. To exploit the special structure of length-biased sampling, we propose a maximum pseudo-profile likelihood estimator, which can handle time-dependent covariates and is consistent under covariate-dependent censoring. Simulation studies show that the proposed estimator is more efficient than its competitors. A data analysis illustrates the methods and theory.
doi:10.1093/biomet/asr072
PMCID: PMC3667656
PMID: 23843659
Approximate likelihood; Cross-sectional sampling; Product-limit estimator; Random truncation; Screening trials
Summary
We suggest a method for estimating a covariance matrix on the basis of a sample of vectors drawn from a multivariate normal distribution. In particular, we penalize the likelihood with a lasso penalty on the entries of the covariance matrix. This penalty plays two important roles: it reduces the effective number of parameters, which is important even when the dimension of the vectors is smaller than the sample size since the number of parameters grows quadratically in the number of variables, and it produces an estimate which is sparse. In contrast to sparse inverse covariance estimation, our method’s close relative, the sparsity attained here is in the covariance matrix itself rather than in the inverse matrix. Zeros in the covariance matrix correspond to marginal independencies; thus, our method performs model selection while providing a positive definite estimate of the covariance. The proposed penalized maximum likelihood problem is not convex, so we use a majorize-minimize approach in which we iteratively solve convex approximations to the original nonconvex problem. We discuss tuning parameter selection and demonstrate on a flow-cytometry dataset how our method produces an interpretable graphical display of the relationship between variables. We perform simulations that suggest that simple elementwise thresholding of the empirical covariance matrix is competitive with our method for identifying the sparsity structure. Additionally, we show how our method can be used to solve a previously studied special case in which a desired sparsity pattern is prespecified.
doi:10.1093/biomet/asr054
PMCID: PMC3413177
PMID: 23049130
Concave-convex procedure; Covariance graph; Covariance matrix; Generalized gradient descent; Lasso; Majorization-minimization; Regularization; Sparsity
Summary
The existing theory of the wild bootstrap has focused on linear estimators. In this note, we broaden its validity by providing a class of weight distributions that is asymptotically valid for quantile regression estimators. As most weight distributions in the literature lead to biased variance estimates for nonlinear estimators of linear regression, we propose a modification of the wild bootstrap that admits a broader class of weight distributions for quantile regression. A simulation study on median regression is carried out to compare various bootstrap methods. With a simple finite-sample correction, the wild bootstrap is shown to account for general forms of heteroscedasticity in a regression model with fixed design points.
doi:10.1093/biomet/asr052
PMCID: PMC3413178
PMID: 23049133
Bahadur representation; Heteroscedastic error; Quantile regression; Wild bootstrap