The support vector machine (SVM) is a powerful binary classification tool with high accuracy and great flexibility. It has achieved great success, but its performance can be seriously impaired if many redundant covariates are included. Some efforts have been devoted to studying variable selection for SVMs, but asymptotic properties, such as variable selection consistency, are largely unknown when the number of predictors diverges to infinity. In this work, we establish a unified theory for a general class of nonconvex penalized SVMs. We first prove that in ultra-high dimensions, there exists one local minimizer to the objective function of nonconvex penalized SVMs possessing the desired oracle property. We further address the problem of nonunique local minimizers by showing that the local linear approximation algorithm is guaranteed to converge to the oracle estimator even in the ultra-high dimensional setting if an appropriate initial estimator is available. This condition on initial estimator is verified to be automatically valid as long as the dimensions are moderately high. Numerical examples provide supportive evidence.
Local linear approximation; nonconvex penalty; oracle property; support vector machines; ultra-high dimensions; variable selection
We propose a framework for general Bayesian inference. We argue that a valid update of a prior belief distribution to a posterior can be made for parameters which are connected to observations through a loss function rather than the traditional likelihood function, which is recovered as a special case. Modern application areas make it increasingly challenging for Bayesians to attempt to model the true data‐generating mechanism. For instance, when the object of interest is low dimensional, such as a mean or median, it is cumbersome to have to achieve this via a complete model for the whole data distribution. More importantly, there are settings where the parameter of interest does not directly index a family of density functions and thus the Bayesian approach to learning about such parameters is currently regarded as problematic. Our framework uses loss functions to connect information in the data to functionals of interest. The updating of beliefs then follows from a decision theoretic approach involving cumulative loss functions. Importantly, the procedure coincides with Bayesian updating when a true likelihood is known yet provides coherent subjective inference in much more general settings. Connections to other inference frameworks are highlighted.
Decision theory; General Bayesian updating; Generalized estimating equations; Gibbs posteriors; Information; Loss function; Maximum entropy; Provably approximately correct Bayes methods; Self‐information loss function
Fitting regression models for intensity functions of spatial point processes is of great interest in ecological and epidemiological studies of association between spatially referenced events and geographical or environmental covariates. When Cox or cluster process models are used to accommodate clustering not accounted for by the available covariates, likelihood based inference becomes computationally cumbersome due to the complicated nature of the likelihood function and the associated score function. It is therefore of interest to consider alternative more easily computable estimating functions. We derive the optimal estimating function in a class of first-order estimating functions. The optimal estimating function depends on the solution of a certain Fredholm integral equation which in practise is solved numerically. The derivation of the optimal estimating function has close similarities to the derivation of quasi-likelihood for standard data sets. The approximate solution is further equivalent to a quasi-likelihood score for binary spatial data. We therefore use the term quasi-likelihood for our optimal estimating function approach. We demonstrate in a simulation study and a data example that our quasi-likelihood method for spatial point processes is both statistically and computationally efficient.
Estimating function; Fredholm integral equation; Godambe information; Intensity function; Regression model; Spatial point process
In this manuscript we consider the problem of jointly estimating multiple graphical models in high dimensions. We assume that the data are collected from n subjects, each of which consists of T possibly dependent observations. The graphical models of subjects vary, but are assumed to change smoothly corresponding to a measure of closeness between subjects. We propose a kernel based method for jointly estimating all graphical models. Theoretically, under a double asymptotic framework, where both (T, n) and the dimension d can increase, we provide the explicit rate of convergence in parameter estimation. It characterizes the strength one can borrow across different individuals and the impact of data dependence on parameter estimation. Empirically, experiments on both synthetic and real resting state functional magnetic resonance imaging (rs-fMRI) data illustrate the effectiveness of the proposed method.
Graphical model; Conditional independence; High dimensional data; Time series; Rate of convergence
We consider causal inference in randomized survival studies with right censored outcomes and all-or-nothing compliance, using semiparametric transformation models to estimate the distribution of survival times in treatment and control groups, conditional on covariates and latent compliance type. Estimands depending on these distributions, for example, the complier average causal effect (CACE), the complier effect on survival beyond time t, and the complier quantile effect are then considered. Maximum likelihood is used to estimate the parameters of the transformation models, using a specially designed expectation-maximization (EM) algorithm to overcome the computational difficulties created by the mixture structure of the problem and the infinite dimensional parameter in the transformation models. The estimators are shown to be consistent, asymptotically normal, and semiparametrically efficient. Inferential procedures for the causal parameters are developed. A simulation study is conducted to evaluate the finite sample performance of the estimated causal parameters. We also apply our methodology to a randomized study conducted by the Health Insurance Plan of Greater New York to assess the reduction in breast cancer mortality due to screening.
All-or-nothing compliance; Complier average causal effect; Instrumental variable; Randomized trials; Survival analysis; Semiparametric transformation models
We study the regression relationship among covariates in case-control data, an area known as the secondary analysis of case-control studies. The context is such that only the form of the regression mean is specified, so that we allow an arbitrary regression error distribution, which can depend on the covariates and thus can be heteroscedastic. Under mild regularity conditions we establish the theoretical identifiability of such models. Previous work in this context has either (a) specified a fully parametric distribution for the regression errors, (b) specified a homoscedastic distribution for the regression errors, (c) has specified the rate of disease in the population (we refer this as true population), or (d) has made a rare disease approximation. We construct a class of semiparametric estimation procedures that rely on none of these. The estimators differ from the usual semiparametric ones in that they draw conclusions about the true population, while technically operating in a hypothetic superpopulation. We also construct estimators with a unique feature, in that they are robust against the misspecification of the regression error distribution in terms of variance structure, while all other nonparametric effects are estimated despite of the biased samples. We establish the asymptotic properties of the estimators and illustrate their finite sample performance through simulation studies, as well as through an empirical example on the relation between red meat consumption and heterocyclic amines. Our analysis verified the positive relationship between red meat consumption and two forms of HCA, indicating that increased red meat consumption leads to increased levels of MeIQA and PhiP, both being risk factors for colorectal cancer. Computer software as well as data to illustrate the methodology are available at http://wileyonlinelibrary.com/journal/rss-datasets.
Biased samples; Case-control study; Heteroscedastic regression; Secondary analysis; Semiparametric estimation
In this work, we study quantile regression when the response is an event time subject to potentially dependent censoring. We consider the semi-competing risks setting, where time to censoring remains observable after the occurrence of the event of interest. While such a scenario frequently arises in biomedical studies, most of current quantile regression methods for censored data are not applicable because they generally require the censoring time and the event time be independent. By imposing rather mild assumptions on the association structure between the time-to-event response and the censoring time variable, we propose quantile regression procedures, which allow us to garner a comprehensive view of the covariate effects on the event time outcome as well as to examine the informativeness of censoring. An efficient and stable algorithm is provided for implementing the new method. We establish the asymptotic properties of the resulting estimators including uniform consistency and weak convergence. The theoretical development may serve as a useful template for addressing estimating settings that involve stochastic integrals. Extensive simulation studies suggest that the proposed method performs well with moderate sample sizes. We illustrate the practical utility of our proposals through an application to a bone marrow transplant trial.
Copula; Dependent censoring; Quantile regression; Semi-competing risks; Stochastic integral equation
This article develops a unified theoretical and computational framework for false discovery control in multiple testing of spatial signals. We consider both point-wise and cluster-wise spatial analyses, and derive oracle procedures which optimally control the false discovery rate, false discovery exceedance and false cluster rate, respectively. A data-driven finite approximation strategy is developed to mimic the oracle procedures on a continuous spatial domain. Our multiple testing procedures are asymptotically valid and can be effectively implemented using Bayesian computational algorithms for analysis of large spatial data sets. Numerical results show that the proposed procedures lead to more accurate error control and better power performance than conventional methods. We demonstrate our methods for analyzing the time trends in tropospheric ozone in eastern US.
Compound decision theory; false cluster rate; false discovery exceedance; false discovery rate; large-scale multiple testing; spatial dependency
We consider heteroscedastic regression models where the mean function is a partially linear single index model and the variance function depends upon a generalized partially linear single index model. We do not insist that the variance function depend only upon the mean function, as happens in the classical generalized partially linear single index model. We develop efficient and practical estimation methods for the variance function and for the mean function. Asymptotic theory for the parametric and nonparametric parts of the model is developed. Simulations illustrate the results. An empirical example involving ozone levels is used to further illustrate the results, and is shown to be a case where the variance function does not depend upon the mean function.
Asymptotic theory; Estimating equation; Identifiability; Kernel regression; Modeling ozone levels; Partially linear single index model; Semiparametric efficiency; Single-index model; Variance function estimation
Prior specification for non-parametric Bayesian inference involves the difficult task of quantifying prior knowledge about a parameter of high, often infinite, dimension. A statistician is unlikely to have informed opinions about all aspects of such a parameter but will have real information about functionals of the parameter, such as the population mean or variance. The paper proposes a new framework for non-parametric Bayes inference in which the prior distribution for a possibly infinite dimensional parameter is decomposed into two parts: an informative prior on a finite set of functionals, and a non-parametric conditional prior for the parameter given the functionals. Such priors can be easily constructed from standard non-parametric prior distributions in common use and inherit the large support of the standard priors on which they are based. Additionally, posterior approximations under these informative priors can generally be made via minor adjustments to existing Markov chain approximation algorithms for standard non-parametric prior distributions. We illustrate the use of such priors in the context of multivariate density estimation using Dirichlet process mixture models, and in the modelling of high dimensional sparse contingency tables.
Contingency tables; Density estimation; Dirichlet process mixture model; Multivariate unordered categorical data; Non-informative prior; Prior elicitation; Sparse data
Many high dimensional classification techniques have been proposed in the literature based on sparse linear discriminant analysis (LDA). To efficiently use them, sparsity of linear classifiers is a prerequisite. However, this might not be readily available in many applications, and rotations of data are required to create the needed sparsity. In this paper, we propose a family of rotations to create the required sparsity. The basic idea is to use the principal components of the sample covariance matrix of the pooled samples and its variants to rotate the data first and to then apply an existing high dimensional classifier. This rotate-and-solve procedure can be combined with any existing classifiers, and is robust against the sparsity level of the true model. We show that these rotations do create the sparsity needed for high dimensional classifications and provide theoretical understanding why such a rotation works empirically. The effectiveness of the proposed method is demonstrated by a number of simulated and real data examples, and the improvements of our method over some popular high dimensional classification rules are clearly shown.
Classification; Equivariance; Principal Components; High Dimensional Data; Linear Discriminant Analysis; Rotate-and-Solve
We consider estimation of regression models for sparse asynchronous longitudinal observations, where time-dependent responses and covariates are observed intermittently within subjects. Unlike with synchronous data, where the response and covariates are observed at the same time point, with asynchronous data, the observation times are mismatched. Simple kernel-weighted estimating equations are proposed for generalized linear models with either time invariant or time-dependent coefficients under smoothness assumptions for the covariate processes which are similar to those for synchronous data. For models with either time invariant or time-dependent coefficients, the estimators are consistent and asymptotically normal but converge at slower rates than those achieved with synchronous data. Simulation studies evidence that the methods perform well with realistic sample sizes and may be superior to a naive application of methods for synchronous data based on an ad hoc last value carried forward approach. The practical utility of the methods is illustrated on data from a study on human immunodeficiency virus.
Asynchronous longitudinal data; Convergence rates; Generalized linear regression; Kernel-weighted estimation; Temporal smoothness
In the absence of relevant prior experience, popular Bayesian estimation techniques usually begin with some form of “uninformative” prior distribution intended to have minimal inferential influence. Bayes rule will still produce nice-looking estimates and credible intervals, but these lack the logical force attached to experience-based priors and require further justification. This paper concerns the frequentist assessment of Bayes estimates. A simple formula is shown to give the frequentist standard deviation of a Bayesian point estimate. The same simulations required for the point estimate also produce the standard deviation. Exponential family models make the calculations particularly simple, and bring in a connection to the parametric bootstrap.
general accuracy formula; parametric bootstrap; abc intervals; hierarchical and empirical Bayes; MCMC
Functional additive models (FAMs) provide a flexible yet simple framework for regressions involving functional predictors. The utilization of data-driven basis in an additive rather than linear structure naturally extends the classical functional linear model. However, the critical issue of selecting nonlinear additive components has been less studied. In this work, we propose a new regularization framework for the structure estimation in the context of Reproducing Kernel Hilbert Spaces. The proposed approach takes advantage of the functional principal components which greatly facilitates the implementation and the theoretical analysis. The selection and estimation are achieved by penalized least squares using a penalty which encourages the sparse structure of the additive components. Theoretical properties such as the rate of convergence are investigated. The empirical performance is demonstrated through simulation studies and a real data application.
Component selection; Additive models; Functional data analysis; Smoothing spline; Principal components; Reproducing kernel Hilbert space
Random effects or shared parameter models are commonly advocated for the analysis of combined repeated measurement and event history data, including dropout from longitudinal trials. Their use in practical applications has generally been limited by computational cost and complexity, meaning that only simple special cases can be fitted by using readily available software. We propose a new approach that exploits recent distributional results for the extended skew normal family to allow exact likelihood inference for a flexible class of random-effects models. The method uses a discretization of the timescale for the time-to-event outcome, which is often unavoidable in any case when events correspond to dropout. We place no restriction on the times at which repeated measurements are made. An analysis of repeated lung function measurements in a cystic fibrosis cohort is used to illustrate the method.
Cystic fibrosis; Dropout; Joint modelling; Repeated measurements; Skew normal distribution; Survival analysis
We consider the problem of estimating multiple related Gaussian graphical models from a high-dimensional data set with observations belonging to distinct classes. We propose the joint graphical lasso, which borrows strength across the classes in order to estimate multiple graphical models that share certain characteristics, such as the locations or weights of nonzero edges. Our approach is based upon maximizing a penalized log likelihood. We employ generalized fused lasso or group lasso penalties, and implement a fast ADMM algorithm to solve the corresponding convex optimization problems. The performance of the proposed method is illustrated through simulated and real data examples.
alternating directions method of multipliers; generalized fused lasso; group lasso; graphical lasso; network estimation; Gaussian graphical model; high-dimensional
Models of dynamic networks — networks that evolve over time — have manifold applications. We develop a discrete-time generative model for social network evolution that inherits the richness and flexibility of the class of exponential-family random graph models. The model — a Separable Temporal ERGM (STERGM) — facilitates separable modeling of the tie duration distributions and the structural dynamics of tie formation. We develop likelihood-based inference for the model, and provide computational algorithms for maximum likelihood estimation. We illustrate the interpretability of the model in analyzing a longitudinal network of friendship ties within a school.
Social networks; Longitudinal; Exponential random graph model; Markov chain Monte Carlo; Maximum likelihood estimation
Multi-phased designs and biased sampling designs are two of the well recognized approaches to enhance study efficiency. In this paper, we propose a new and cost-effective sampling design, the two-phase probability dependent sampling design (PDS), for studies with a continuous outcome. This design will enable investigators to make efficient use of resources by targeting more informative subjects for sampling. We develop a new semiparametric empirical likelihood inference method to take advantage of data obtained through a PDS design. Simulation study results indicate that the proposed sampling scheme, coupled with the proposed estimator, is more efficient and more powerful than the existing outcome dependent sampling design and the simple random sampling design with the same sample size. We illustrate the proposed method with a real data set from an environmental epidemiologic study.
Empirical likelihood; Missing data; Semiparametric; Probability sample
A sufficient cause interaction between two exposures signals the presence of individuals for whom the outcome would occur only under certain values of the two exposures. When the outcome is dichotomous and all exposures are categorical, then under certain no confounding assumptions, empirical conditions for sufficient cause interactions can be constructed based on the sign of linear contrasts of conditional outcome probabilities between differently exposed subgroups, given confounders. It is argued that logistic regression models are unsatisfactory for evaluating such contrasts, and that Bernoulli regression models with linear link are prone to misspecification. We therefore develop semiparametric tests for sufficient cause interactions under models which postulate probability contrasts in terms of a finite-dimensional parameter, but which are otherwise unspecified. Estimation is often not feasible in these models because it would require nonparametric estimation of auxiliary conditional expectations given high-dimensional variables. We therefore develop ‘multiply robust tests’ under a union model that assumes at least one of several working submodels holds. In the special case of a randomized experiment or a family-based genetic study in which the joint exposure distribution is known by design or Mendelian inheritance, the procedure leads to asymptotically distribution-free tests of the null hypothesis of no sufficient cause interaction.
Double robustness; Effect modification; Gene-environment interaction; Gene-gene interaction; Semiparametric inference; Sufficient cause; Synergism
We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui and his colleagues have propose ‘SAFE’ rules, based on univariate inner products between each predictor and the outcome, which guarantee that a coefficient will be 0 in the solution vector. This provides a reduction in the number of variables that need to be entered into the optimization. We propose strong rules that are very simple and yet screen out far more predictors than the SAFE rules. This great practical improvement comes at a price: the strong rules are not foolproof and can mistakenly discard active predictors, i.e. predictors that have non-zero coefficients in the solution. We therefore combine them with simple checks of the Karush–Kuhn–Tucker conditions to ensure that the exact solution to the convex problem is delivered. Of course, any (approximate) screening method can be combined with the Karush–Kuhn–Tucker, conditions to ensure the exact solution; the strength of the strong rules lies in the fact that, in practice, they discard a very large number of the inactive predictors and almost never commit mistakes. We also derive conditions under which they are foolproof. Strong rules provide substantial savings in computational time for a variety of statistical optimization problems.
Formal rules governing signed edges on causal directed acyclic graphs are described in this paper and it is shown how these rules can be useful in reasoning about causality. Specifically, the notions of a monotonic effect, a weak monotonic effect and a signed edge are introduced. Results are developed relating these monotonic effects and signed edges to the sign of the causal effect of an intervention in the presence of intermediate variables. The incorporation of signed edges into the directed acyclic graph causal framework furthermore allows for the development of rules governing the relationship between monotonic effects and the sign of the covariance between two variables. It is shown that when certain assumptions about monotonic effects can be made then these results can be used to draw conclusions about the presence of causal effects even when data is missing on confounding variables.
Bias; Causal inference; Confounding; Directed acyclic graphs; Structural equations
Marginal log-linear (MLL) models provide a flexible approach to multivariate discrete data. MLL parametrizations under linear constraints induce a wide variety of models, including models defined by conditional independences. We introduce a subclass of MLL models which correspond to Acyclic Directed Mixed Graphs (ADMGs) under the usual global Markov property. We characterize for precisely which graphs the resulting parametrization is variation independent. The MLL approach provides the first description of ADMG models in terms of a minimal list of constraints. The parametrization is also easily adapted to sparse modelling techniques, which we illustrate using several examples of real data.
acyclic directed mixed graph; discrete graphical model; marginal log-linear parameter; parsimonious modelling; variation independence
Estimation of high-dimensional covariance matrices is known to be a difficult problem, has many applications, and is of current interest to the larger statistics community. In many applications including so-called the “large p small n” setting, the estimate of the covariance matrix is required to be not only invertible, but also well-conditioned. Although many regularization schemes attempt to do this, none of them address the ill-conditioning problem directly. In this paper, we propose a maximum likelihood approach, with the direct goal of obtaining a well-conditioned estimator. No sparsity assumption on either the covariance matrix or its inverse are are imposed, thus making our procedure more widely applicable. We demonstrate that the proposed regularization scheme is computationally efficient, yields a type of Steinian shrinkage estimator, and has a natural Bayesian interpretation. We investigate the theoretical properties of the regularized covariance estimator comprehensively, including its regularization path, and proceed to develop an approach that adaptively determines the level of regularization that is required. Finally, we demonstrate the performance of the regularized estimator in decision-theoretic comparisons and in the financial portfolio optimization setting. The proposed approach has desirable properties, and can serve as a competitive procedure, especially when the sample size is small and when a well-conditioned estimator is required.
covariance estimation; regularization; convex optimization; condition number; eigenvalue; shrinkage; cross-validation; risk comparisons; portfolio optimization
Modern technologies are producing a wealth of data with complex structures. For instance, in two-dimensional digital imaging, flow cytometry and electroencephalography, matrix-type covariates frequently arise when measurements are obtained for each combination of two underlying variables. To address scientific questions arising from those data, new regression methods that take matrices as covariates are needed, and sparsity or other forms of regularization are crucial owing to the ultrahigh dimensionality and complex structure of the matrix data. The popular lasso and related regularization methods hinge on the sparsity of the true signal in terms of the number of its non-zero coefficients. However, for the matrix data, the true signal is often of, or can be well approximated by, a low rank structure. As such, the sparsity is frequently in the form of low rank of the matrix parameters, which may seriously violate the assumption of the classical lasso. We propose a class of regularized matrix regression methods based on spectral regularization. A highly efficient and scalable estimation algorithm is developed, and a degrees-of-freedom formula is derived to facilitate model selection along the regularization path. Superior performance of the method proposed is demonstrated on both synthetic and real examples.
Electroencephalography; Multi-dimensional array; Nesterov method; Nuclear norm; Spectral regularization; Tensor regression
This paper develops nonparametric methods based on contact intervals for the analysis of infectious disease data. The contact interval from person i to person j is the time between the onset of infectiousness in i and infectious contact from i to j, where we define infectious contact as a contact sufficient to infect a susceptible individual. The hazard function of the contact interval distribution equals the hazard of infectious contact from i to j, so it provides a summary of the evolution of infectiousness over time. When who-infects-whom is observed, the Nelson-Aalen estimator produces an unbiased estimate of the cumulative hazard function of the contact interval distribution. When who-infects-whom is not observed, we use an EM algorithm to average the Nelson-Aalen estimates from all possible combinations of who-infected-whom consistent with the observed data. This converges to a nonparametric maximum likelihood estimate of the cumulative hazard function that we call the marginal Nelson-Aalen estimate. We study the behavior of these methods in simulations and use them to analyze household surveillance data from the 2009 influenza A(H1N1) pandemic.
Chain-binomial models; Contact intervals; Generation intervals; Infectious disease; Nonparametric methods; Survival analysis