Local polynomial regression has received extensive attention for the nonparametric estimation of regression functions when both the response and the covariate are in Euclidean space. However, little has been done when the response is in a Riemannian manifold. We develop an intrinsic local polynomial regression estimate for the analysis of symmetric positive definite (SPD) matrices as responses that lie in a Riemannian manifold with covariate in Euclidean space. The primary motivation and application of the proposed methodology is in computer vision and medical imaging. We examine two commonly used metrics, including the trace metric and the Log-Euclidean metric on the space of SPD matrices. For each metric, we develop a cross-validation bandwidth selection method, derive the asymptotic bias, variance, and normality of the intrinsic local constant and local linear estimators, and compare their asymptotic mean square errors. Simulation studies are further used to compare the estimators under the two metrics and to examine their finite sample performance. We use our method to detect diagnostic differences between diffusion tensors along fiber tracts in a study of human immunodeficiency virus.
For high-dimensional classification, it is well known that naively performing the Fisher discriminant rule leads to poor results due to diverging spectra and noise accumulation. Therefore, researchers proposed independence rules to circumvent the diverging spectra, and sparse independence rules to mitigate the issue of noise accumulation. However, in biological applications, there are often a group of correlated genes responsible for clinical outcomes, and the use of the covariance information can significantly reduce misclassification rates. In theory the extent of such error rate reductions is unveiled by comparing the misclassification rates of the Fisher discriminant rule and the independence rule. To materialize the gain based on finite samples, a Regularized Optimal Affine Discriminant (ROAD) is proposed. ROAD selects an increasing number of features as the regularization relaxes. Further benefits can be achieved when a screening method is employed to narrow the feature pool before hitting the ROAD. An efficient Constrained Coordinate Descent algorithm (CCD) is also developed to solve the associated optimization problems. Sampling properties of oracle type are established. Simulation studies and real data analysis support our theoretical results and demonstrate the advantages of the new classification procedure under a variety of correlation structures. A delicate result on continuous piecewise linear solution path for the ROAD optimization problem at the population level justifies the linear interpolation of the CCD algorithm.
High Dimensional Classification; LDA; Regularized Optimal Affine Discriminant; Fisher Discriminant; Independence Rule
We study the heteroscedastic partially linear single-index model with an unspecified error variance function, which allows for high dimensional covariates in both the linear and the single-index components of the mean function. We propose a class of consistent estimators of the parameters by using a proper weighting strategy. An interesting finding is that the linearity condition which is widely assumed in the dimension reduction literature is not necessary for methodological or theoretical development: it contributes only to the simplification of non-optimal consistent estimation. We also find that the performance of the usual weighted least square type of estimators deteriorates when the non-parametric component is badly estimated. However, estimators in our family automatically provide protection against such deterioration, in that the consistency can be achieved even if the baseline non-parametric function is completely misspecified. We further show that the most efficient estimator is a member of this family and can be easily obtained by using non-parametric estimation. Properties of the estimators proposed are presented through theoretical illustration and numerical simulations. An example on gender discrimination is used to demonstrate and to compare the practical performance of the estimators.
Dimension reduction; Double robustness; Linearity condition; Semiparametric efficiency bound; Single index model
It is now common to survey microbial communities by sequencing nucleic acid material extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac, which gives a somewhat ad hoc phylogenetics-based distance between two communities, is one of the most commonly used tools for these analyses. We provide a foundation for such methods by establishing that, if we equate a metagenomic sample with its empirical distribution on a reference phylogenetic tree, then the weighted UniFrac distance between two samples is just the classical Kantorovich–Rubinstein, or earth mover’s, distance between the corresponding empirical distributions. We demonstrate that this Kantorovich–Rubinstein distance and extensions incorporating uncertainty in the sample locations can be written as a readily computable integral over the tree, we develop Lp Zolotarev-type generalizations of the metric, and we show how the p-value of the resulting natural permutation test of the null hypothesis ‘no difference between two communities’ can be approximated by using a Gaussian process functional. We relate the L2-case to an analysis-of-variance type of decomposition, finding that the distribution of its associated Gaussian functional is that of a computable linear combination of independent
χ12 random variables.
Barycentre; CAT(κ)space; Gaussian process; Hadamard space; Metagenomics; Monte Carlo methods; Negative curvature; Optimal transport; Permutation test; Phylogenetics; Randomization test; Reproducing kernel Hilbert space; Wasserstein metric
Association models, like frailty and copula models, are frequently used to analyze clustered survival data and evaluate within-cluster associations. The assumption of noninformative censoring is commonly applied to these models, though it may not be true in many situations. In this paper, we consider bivariate competing risk data and focus on association models specified for the bivariate cumulative incidence function (CIF), a nonparametrically identifiable quantity. Copula models are proposed which relate the bivariate CIF to its corresponding univariate CIFs, similarly to independently right censored data, and accommodate frailty models for the bivariate CIF. Two estimating equations are developed to estimate the association parameter, permitting the univariate CIFs to be estimated either parametrically or nonparametrically. Goodness-of-fit tests are presented for formally evaluating the parametric models. Both estimators perform well with moderate sample sizes in simulation studies. The practical use of the methodology is illustrated in an analysis of dementia associations.
Variance estimation is a fundamental problem in statistical modelling. In ultrahigh dimensional linear regression where the dimensionality is much larger than the sample size, traditional variance estimation techniques are not applicable. Recent advances in variable selection in ultrahigh dimensional linear regression make this problem accessible. One of the major problems in ultrahigh dimensional regression is the high spurious correlation between the unobserved realized noise and some of the predictors. As a result, the realized noises are actually predicted when extra irrelevant variables are selected, leading to serious underestimate of the level of noise. We propose a two-stage refitted procedure via a data splitting technique, called refitted cross-validation, to attenuate the influence of irrelevant variables with high spurious correlations. Our asymptotic results show that the resulting procedure performs as well as the oracle estimator, which knows in advance the mean regression function. The simulation studies lend further support to our theoretical claims. The naive two-stage estimator and the plug-in one-stage estimators using the lasso and smoothly clipped absolute deviation are also studied and compared. Their performances can be improved by the reffitted cross-validation method proposed.
Data splitting; Dimension reduction; High dimensionality; Refitted cross-validation; Sure screening; Variance estimation; Variable selection
We consider the supervised classification setting, in which the data consist of p features measured on n observations, each of which belongs to one of K classes. Linear discriminant analysis (LDA) is a classical method for this problem. However, in the high-dimensional setting where p ≫ n, LDA is not appropriate for two reasons. First, the standard estimate for the within-class covariance matrix is singular, and so the usual discriminant rule cannot be applied. Second, when p is large, it is difficult to interpret the classification rule obtained from LDA, since it involves all p features. We propose penalized LDA, a general approach for penalizing the discriminant vectors in Fisher’s discriminant problem in a way that leads to greater interpretability. The discriminant problem is not convex, so we use a minorization-maximization approach in order to efficiently optimize it when convex penalties are applied to the discriminant vectors. In particular, we consider the use of L1 and fused lasso penalties. Our proposal is equivalent to recasting Fisher’s discriminant problem as a biconvex problem. We evaluate the performances of the resulting methods on a simulation study, and on three gene expression data sets. We also survey past methods for extending LDA to the high-dimensional setting, and explore their relationships with our proposal.
classification; feature selection; high dimensional; lasso; linear discriminant analysis; supervised learning
In many situations information from a sample of individuals can be supplemented by population level information on the relationship between a dependent variable and explanatory variables. Inclusion of the population level information can reduce bias and increase the efficiency of the parameter estimates.
Population level information can be incorporated via constraints on functions of the model parameters. In general the constraints are nonlinear making the task of maximum likelihood estimation harder. In this paper we develop an alternative approach exploiting the notion of an empirical likelihood. It is shown that within the framework of generalised linear models, the population level information corresponds to linear constraints, which are comparatively easy to handle. We provide a two-step algorithm that produces parameter estimates using only unconstrained estimation. We also provide computable expressions for the standard errors. We give an application to demographic hazard modelling by combining panel survey data with birth registration data to estimate annual birth probabilities by parity.
Empirical Likelihood; Constrained Optimisation; Generalised Linear Models
In high-dimensional model selection problems, penalized least-square approaches have been extensively used. This paper addresses the question of both robustness and efficiency of penalized model selection methods, and proposes a data-driven weighted linear combination of convex loss functions, together with weighted L1-penalty. It is completely data-adaptive and does not require prior knowledge of the error distribution. The weighted L1-penalty is used both to ensure the convexity of the penalty term and to ameliorate the bias caused by the L1-penalty. In the setting with dimensionality much larger than the sample size, we establish a strong oracle property of the proposed method that possesses both the model selection consistency and estimation efficiency for the true non-zero coefficients. As specific examples, we introduce a robust method of composite L1-L2, and optimal composite quantile method and evaluate their performance in both simulated and real data examples.
Composite QMLE; LASSO; Model Selection; NP Dimensionality; Oracle Property; Robust statistics; SCAD
We consider the problem of optimal design of experiments for random effects models, especially population models, where a small number of correlated observations can be taken on each individual, while the observations corresponding to different individuals are assumed to be uncorrelated. We focus on c-optimal design problems and show that the classical equivalence theorem and the famous geometric characterization of Elfving (1952) from the case of uncorrelated data can be adapted to the problem of selecting optimal sets of observations for the n individual patients. The theory is demonstrated by finding optimal designs for a linear model with correlated observations and a nonlinear random effects population model, which is commonly used in pharmacokinetics.
c-optimal design; correlated observations; Elfving's theorem; pharmacokinetic models; random effects; locally optimal design; geometric characterization
Neuroimaging studies aim to analyze imaging data with complex spatial patterns in a large number of locations (called voxels) on a two-dimensional (2D) surface or in a 3D volume. Conventional analyses of imaging data include two sequential steps: spatially smoothing imaging data and then independently fitting a statistical model at each voxel. However, conventional analyses suffer from the same amount of smoothing throughout the whole image, the arbitrary choice of smoothing extent, and low statistical power in detecting spatial patterns. We propose a multiscale adaptive regression model (MARM) to integrate the propagation–separation (PS) approach (Polzehl and Spokoiny, 2000, 2006) with statistical modeling at each voxel for spatial and adaptive analysis of neuroimaging data from multiple subjects. MARM has three features: being spatial, being hierarchical, and being adaptive. We use a multiscale adaptive estimation and testing procedure (MAET) to utilize imaging observations from the neighboring voxels of the current voxel to adaptively calculate parameter estimates and test statistics. Theoretically, we establish consistency and asymptotic normality of the adaptive parameter estimates and the asymptotic distribution of the adaptive test statistics. Our simulation studies and real data analysis confirm that MARM significantly outperforms conventional analyses of imaging data.
Kernel; Multiscale adaptive regression; Neuroimaging data; Propagation separation; Smoothing; Sphere; Test statistics
Testing the equality of two survival distributions can be difficult in a prevalent cohort study when non random sampling of subjects is involved. Due to the biased sampling scheme, independent censoring assumption is often violated. Although the issues about biased inference caused by length-biased sampling have been widely recognized in statistical, epidemiological and economical literature, there is no satisfactory solution for efficient two-sample testing. We propose an asymptotic most efficient nonparametric test by properly adjusting for length-biased sampling. The test statistic is derived from a full likelihood function, and can be generalized from the two-sample test to a k-sample test. The asymptotic properties of the test statistic under the null hypothesis are derived using its asymptotic independent and identically distributed representation. We conduct extensive Monte Carlo simulations to evaluate the performance of the proposed test statistics and compare them with the conditional test and the standard logrank test for different biased sampling schemes and right-censoring mechanisms. For length-biased data, empirical studies demonstrated that the proposed test is substantially more powerful than the existing methods. For general left-truncated data, the proposed test is robust, still maintains accurate control of type I error rate, and is also more powerful than the existing methods, if the truncation patterns and right-censoring patterns are the same between the groups. We illustrate the methods using two real data examples.
We consider the development of Bayesian Nonparametric methods for product partition models such as Hidden Markov Models and change point models. Our approach uses a Mixture of Dirichlet Process (MDP) model for the unknown sampling distribution (likelihood) for the observations arising in each state and a computationally efficient data augmentation scheme to aid inference. The method uses novel MCMC methodology which combines recent retrospective sampling methods with the use of slice sampler variables. The methodology is computationally efficient, both in terms of MCMC mixing properties, and robustness to the length of the time series being investigated. Moreover, the method is easy to implement requiring little or no user-interaction. We apply our methodology to the analysis of genomic copy number variation.
Retrospective sampling; block Gibbs sampler; local/global clustering; partition models; partial exchangeability
We consider functional measurement error models, i.e. models where covariates are measured with error and yet no distributional assumptions are made about the mismeasured variable. We propose and study a score-type local test and an orthogonal series-based, omnibus goodness-of-fit test in this context, where no likelihood function is available or calculated—i.e. all the tests are proposed in the semiparametric model framework. We demonstrate that our tests have optimality properties and computational advantages that are similar to those of the classical score tests in the parametric model framework. The test procedures are applicable to several semiparametric extensions of measurement error models, including when the measurement error distribution is estimated non-parametrically as well as for generalized partially linear models. The performance of the local score-type and omnibus goodness-of-fit tests is demonstrated through simulation studies and analysis of a nutrition data set.
Efficient estimation; Efficient testing; Errors in variables; Goodness-of-fit tests; Local alternatives; Measurement error; Score testing; Semiparametric models
Local polynomial regression is a useful nonparametric regression tool to explore fine data structures and has been widely used in practice. In this paper, we propose a new nonparametric regression technique called local composite-quantile-regression (CQR) smoothing in order to further improve local polynomial regression. Sampling properties of the proposed estimation procedure are studied. We derive the asymptotic bias, variance and normality of the proposed estimate. Asymptotic relative efficiency of the proposed estimate with respect to the local polynomial regression is investigated. It is shown that the proposed estimate can be much more efficient than the local polynomial regression estimate for various non-normal errors, while being almost as efficient as the local polynomial regression estimate for normal errors. Simulation is conducted to examine the performance of the proposed estimates. The simulation results are consistent with our theoretical findings. A real data example is used to illustrate the proposed method.
Asymptotic efficiency; CQR estimator; Kernel function; Local polynomial regression; Nonparametric regression
In semiparametric models, the dimension d of the maximum likelihood problem is potentially unlimited. Conventional estimation methods generally behave like O(d3). A new O(d) estimation procedure is proposed for a large class of semiparametric models. Potentially unlimited dimension is handled in a numerically efficient way through a Nelson–Aalen-like estimator. Discussion of the new method is put in the context of recently developed minorization–maximization algorithms based on surrogate objective functions. The procedure for semiparametric models is used to demonstrate three methods to construct a surrogate objective function: using the difference of two concave functions, the EM way and the new quasi-EM (QEM) approach. The QEM approach is based on a generalization of the EM-like construction of the surrogate objective function so it does not depend on the missing data representation of the model. Like the EM algorithm, the QEM method has a dual interpretation, a result of merging the idea of surrogate maximization with the idea of imputation and self-consistency. The new approach is compared with other possible approaches by using simulations and analysis of real data. The proportional odds model is used as an example throughout the paper.
EM algorithm; Frailty; Nonparametric maximum likelihood estimation; Profile likelihood; Semiparametric models
We discuss a Bayesian discovery procedure for multiple comparison problems. We show that under a coherent decision theoretic framework, a loss function combining true positive and false positive counts leads to a decision rule based on a threshold of the posterior probability of the alternative. Under a semi-parametric model for the data, we show that the Bayes rule can be approximated by the optimal discovery procedure (ODP), recently introduced by Storey (2007a). Improving the approximation leads us to a Bayesian discovery procedure (BDP), which exploits the multiple shrinkage in clusters implied by the assumed nonparametric model. We compare the BDP and the ODP estimates in a simple simulation study and in an assessment of differential gene expression based on microarray data from tumor samples. We extend the setting of the ODP by discussing modifications of the loss function that lead to different single thresholding statistics. Finally, we provide an application of the previous arguments to dependent (spatial) data.
Multiple Comparison; False Discovery Rate; Loss Function; Bayes Optimal Rule
With the ready availability of spatial databases and geographical information system software, statisticians are increasingly encountering multivariate modelling settings featuring associations of more than one type: spatial associations between data locations and associations between the variables within the locations. Although flexible modelling of multivariate point-referenced data has recently been addressed by using a linear model of co-regionalization, existing methods for multivariate areal data typically suffer from unnecessary restrictions on the covariance structure or undesirable dependence on the conditioning order of the variables. We propose a class of Bayesian hierarchical models for multivariate areal data that avoids these restrictions, permitting flexible and order-free modelling of correlations both between variables and across areal units. Our framework encompasses a rich class of multivariate conditionally autoregressive models that are computationally feasible via modern Markov chain Monte Carlo methods. We illustrate the strengths of our approach over existing models by using simulation studies and also offer a real data application involving annual lung, larynx and oesophageal cancer death-rates in Minnesota counties between 1990 and 2000.
Lattice data; Linear model of co-regionalization; Markov chain Monte Carlo methods; Multivariate conditionally autoregressive model; Spatial statistics
We consider estimation, from a double-blind randomized trial, of treatment effect within levels of base-line covariates on an outcome that is measured after a post-treatment event E has occurred in the subpopulation 𝒫E,E that would experience event E regardless of treatment. Specifically, we consider estimation of the parameters γ indexing models for the outcome mean conditional on treatment and base-line covariates in the subpopulation 𝒫E,E. Such parameters are not identified from randomized trial data but become identified if additionally it is assumed that the subpopulation 𝒫Ē,E of subjects that would experience event E under the second treatment but not under the first is empty and a parametric model for the conditional probability that a subject experiences event E if assigned to the first treatment given that the subject would experience the event if assigned to the second treatment, his or her outcome under the second treatment and his or her pretreatment covariates. We develop a class of estimating equations whose solutions comprise, up to asymptotic equivalence, all consistent and asymptotically normal estimators of γ under these two assumptions. In addition, we derive a locally semiparametric efficient estimator of γ. We apply our methods to estimate the effect on mean viral load of vaccine versus placebo after infection with human immunodeficiency virus (the event E) in a placebo-controlled randomized acquired immune deficiency syndrome vaccine trial.
Counterfactuals; Missing data; Potential outcomes; Principal stratification; Structural model; Vaccine trials
In recent years, many methods have been developed for regression in high-dimensional settings. We propose covariance-regularized regression, a family of methods that use a shrunken estimate of the inverse covariance matrix of the features in order to achieve superior prediction. An estimate of the inverse covariance matrix is obtained by maximizing its log likelihood, under a multivariate normal model, subject to a constraint on its elements; this estimate is then used to estimate coefficients for the regression of the response onto the features. We show that ridge regression, the lasso, and the elastic net are special cases of covariance-regularized regression, and we demonstrate that certain previously unexplored forms of covariance-regularized regression can outperform existing methods in a range of situations. The covariance-regularized regression framework is extended to generalized linear models and linear discriminant analysis, and is used to analyze gene expression data sets with multiple class and survival outcomes.
regression; classification; n ≪ p; covariance regularization
Ecological studies, in which data are available at the level of the group, rather than at the level of the individual, are susceptible to a range of biases due to their inability to characterize within-group variability in exposures and confounders. In order to overcome these biases, we propose a hybrid design in which ecological data are supplemented with a sample of individual-level case-control data. We develop the likelihood for this design and illustrate its benefits via simulation, both in bias reduction when compared to an ecological study, and in efficiency gains relative to a conventional case-control study. An interesting special case of the proposed design is the situation where ecological data are supplemented with case-only data. The design is illustrated using a dataset of county-specific lung cancer mortality rates in the state of Ohio from 1988.
Ecological bias; Efficiency; Outcome-dependent sampling; Two-phase sampling; Within-area confounding
Motivated from the problem of testing for genetic effects on complex traits in the presence of gene-environment interaction, we develop score tests in general semiparametric regression problems that involves Tukey style 1 degree-of-freedom form of interaction between parametrically and non-parametrically modelled covariates. We find that the score test in this type of model, as recently developed by Chatterjee and co-workers in the fully parametric setting, is biased and requires undersmoothing to be valid in the presence of non-parametric components. Moreover, in the presence of repeated outcomes, the asymptotic distribution of the score test depends on the estimation of functions which are defined as solutions of integral equations, making implementation difficult and computationally taxing. We develop profiled score statistics which are unbiased and asymptotically efficient and can be performed by using standard bandwidth selection methods. In addition, to overcome the difficulty of solving functional equations, we give easy interpretations of the target functions, which in turn allow us to develop estimation procedures that can be easily implemented by using standard computational methods. We present simulation studies to evaluate type I error and power of the method proposed compared with a naive test that does not consider interaction. Finally, we illustrate our methodology by analysing data from a case-control study of colorectal adenoma that was designed to investigate the association between colorectal adenoma and the candidate gene NAT2 in relation to smoking history.
Additive models; Diplotypes; Function estimation; Non-parametric regression; Omnibus hypothesis testing; Partially linear model; Repeated measures; Score test; Semiparametric models; Smooth backfitting; Tukey’s 1 degree-of-freedom model
Existing Bayesian model selection procedures require the specification of prior distributions on the parameters appearing in every model in the selection set. In practice, this requirement limits the application of Bayesian model selection methodology. To overcome this limitation, we propose a new approach towards Bayesian model selection that uses classical test statistics to compute Bayes factors between possible models. In several test cases, our approach produces results that are similar to previously proposed Bayesian model selection and model averaging techniques in which prior distributions were carefully chosen. In addition to eliminating the requirement to specify complicated prior distributions, this method offers important computational and algorithmic advantages over existing simulation-based methods. Because it is easy to evaluate the operating characteristics of this procedure for a given sample size and specified number of covariates, our method facilitates the selection of hyperparameter values through prior-predictive simulation.
Bayes factor; Coherency; False discovery rate; F-statistic; Likelihood ratio statistic; Model selection
Increasingly, scientific studies yield functional data, in which the ideal units of observation are curves and the observed data consist of sets of curves that are sampled on a fine grid. We present new methodology that generalizes the linear mixed model to the functional mixed model framework, with model fitting done by using a Bayesian wavelet-based approach. This method is flexible, allowing functions of arbitrary form and the full range of fixed effects structures and between-curve covariance structures that are available in the mixed model framework. It yields nonparametric estimates of the fixed and random-effects functions as well as the various between-curve and within-curve covariance matrices. The functional fixed effects are adaptively regularized as a result of the non-linear shrinkage prior that is imposed on the fixed effects’ wavelet coefficients, and the random-effect functions experience a form of adaptive regularization because of the separately estimated variance components for each wavelet coefficient. Because we have posterior samples for all model quantities, we can perform pointwise or joint Bayesian inference or prediction on the quantities of the model. The adaptiveness of the method makes it especially appropriate for modelling irregular functional data that are characterized by numerous local features like peaks.
Bayesian methods; Functional data analysis; Mixed models; Model averaging; Nonparametric regression; Proteomics; Wavelets
With scientific data available at geocoded locations, investigators are increasingly turning to spatial process models for carrying out statistical inference. Over the last decade, hierarchical models implemented through Markov chain Monte Carlo methods have become especially popular for spatial modelling, given their flexibility and power to fit models that would be infeasible with classical methods as well as their avoidance of possibly inappropriate asymptotics. However, fitting hierarchical spatial models often involves expensive matrix decompositions whose computational complexity increases in cubic order with the number of spatial locations, rendering such models infeasible for large spatial data sets. This computational burden is exacerbated in multivariate settings with several spatially dependent response variables. It is also aggravated when data are collected at frequent time points and spatiotemporal process models are used. With regard to this challenge, our contribution is to work with what we call predictive process models for spatial and spatiotemporal data. Every spatial (or spatiotemporal) process induces a predictive process model (in fact, arbitrarily many of them). The latter models project process realizations of the former to a lower dimensional subspace, thereby reducing the computational burden. Hence, we achieve the flexibility to accommodate non-stationary, non-Gaussian, possibly multivariate, possibly spatiotemporal processes in the context of large data sets. We discuss attractive theoretical properties of these predictive processes. We also provide a computational template encompassing these diverse settings. Finally, we illustrate the approach with simulated and real data sets.
Co-regionalization; Gaussian processes; Hierarchical modelling; Kriging; Markov chain Monte Carlo methods; Multivariate spatial processes; Space-time processes