Ultra-high dimensional data often display heterogeneity due to either heteroscedastic variance or other forms of non-location-scale covariate effects. To accommodate heterogeneity, we advocate a more general interpretation of sparsity which assumes that only a small number of covariates influence the conditional distribution of the response variable given all candidate covariates; however, the sets of relevant covariates may differ when we consider different segments of the conditional distribution. In this framework, we investigate the methodology and theory of nonconvex penalized quantile regression in ultra-high dimension. The proposed approach has two distinctive features: (1) it enables us to explore the entire conditional distribution of the response variable given the ultra-high dimensional covariates and provides a more realistic picture of the sparsity pattern; (2) it requires substantially weaker conditions compared with alternative methods in the literature; thus, it greatly alleviates the difficulty of model checking in the ultra-high dimension. In theoretic development, it is challenging to deal with both the nonsmooth loss function and the nonconvex penalty function in ultra-high dimensional parameter space. We introduce a novel sufficient optimality condition which relies on a convex differencing representation of the penalized loss function and the subdifferential calculus. Exploring this optimality condition enables us to establish the oracle property for sparse quantile regression in the ultra-high dimension under relaxed conditions. The proposed method greatly enhances existing tools for ultra-high dimensional data analysis. Monte Carlo simulations demonstrate the usefulness of the proposed procedure. The real data example we analyzed demonstrates that the new approach reveals substantially more information compared with alternative methods.
doi:10.1080/01621459.2012.656014
PMCID: PMC3471246
PMID: 23082036
Penalized Quantile Regression; SCAD; Sparsity; Ultra-high dimensional data
Tropospheric ozone is one of the six criteria pollutants regulated by the United States Environmental Protection Agency under the Clean Air Act and has been linked with several adverse health effects, including mortality. Due to the strong dependence on weather conditions, ozone may be sensitive to climate change and there is great interest in studying the potential effect of climate change on ozone, and how this change may affect public health. In this paper we develop a Bayesian spatial model to predict ozone under different meteorological conditions, and use this model to study spatial and temporal trends and to forecast ozone concentrations under different climate scenarios. We develop a spatial quantile regression model that does not assume normality and allows the covariates to affect the entire conditional distribution, rather than just the mean. The conditional distribution is allowed to vary from site-to-site and is smoothed with a spatial prior. For extremely large datasets our model is computationally infeasible, and we develop an approximate method. We apply the approximate version of our model to summer ozone from 1997–2005 in the Eastern U.S., and use deterministic climate models to project ozone under future climate conditions. Our analysis suggests that holding all other factors fixed, an increase in daily average temperature will lead to the largest increase in ozone in the Industrial Midwest and Northeast.
doi:10.1198/jasa.2010.ap09237
PMCID: PMC3583387
PMID: 23459794
Climate change; Ozone; Semiparametric Bayesian methods; Spatial data
Conventional agreement studies have been confined to addressing the sense of reproducibility, and therefore are limited to assessing measurements on the same scale. In this work, we propose a new concept, called “broad sense agreement,” which extends the classical framework of agreement to evaluate the capability of interpreting a continuous measurement in an ordinal scale. We present a natural measure for broad sense agreement. Nonparametric estimation and inference procedures are developed for the proposed measure along with theoretical justifications. We also consider longitudinal settings which involve agreement assessments at multiple time points. Simulation studies have demonstrated good performance of the proposed method with small sample sizes. We illustrate our methods via an application to a mental health study.
doi:10.1198/jasa.2011.tm10483
PMCID: PMC3565435
PMID: 23399559
Asymptotic normality; Consistency; Hypothesis testing; Jackknife; Nonparametric statistics; Order consistency
The current goal of initial antiretroviral (ARV) therapy is suppression of plasma human immunodeficiency virus (HIV)-1 RNA levels to below 200 copies per milliliter. A proportion of HIV-infected patients who initiate antiretroviral therapy in clinical practice or antiretroviral clinical trials either fail to suppress HIV-1 RNA or have HIV-1 RNA levels rebound on therapy. Frequently, these patients have sustained CD4 cell counts responses and limited or no clinical symptoms and, therefore, have potentially limited indications for altering therapy which they may be tolerating well despite increased viral replication. On the other hand, increased viral replication on therapy leads to selection of resistance mutations to the antiretroviral agents comprising their therapy and potentially cross-resistance to other agents in the same class decreasing the likelihood of response to subsequent antiretroviral therapy. The optimal time to switch antiretroviral therapy to ensure sustained virologic suppression and prevent clinical events in patients who have rebound in their HIV-1 RNA, yet are stable, is not known. Randomized clinical trials to compare early versus delayed switching have been difficult to design and more difficult to enroll. In some clinical trials, such as the AIDS Clinical Trials Group (ACTG) Study A5095, patients randomized to initial antiretroviral treatment combinations, who fail to suppress HIV-1 RNA or have a rebound of HIV-1 RNA on therapy are allowed to switch from the initial ARV regimen to a new regimen, based on clinician and patient decisions. We delineate a statistical framework to estimate the effect of early versus late regimen change using data from ACTG A5095 in the context of two-stage designs.
In causal inference, a large class of doubly robust estimators are derived through semiparametric theory with applications to missing data problems. This class of estimators is motivated through geometric arguments and relies on large samples for good performance. By now, several authors have noted that a doubly robust estimator may be suboptimal when the outcome model is misspecified even if it is semiparametric efficient when the outcome regression model is correctly specified. Through auxiliary variables, two-stage designs, and within the contextual backdrop of our scientific problem and clinical study, we propose improved doubly robust, locally efficient estimators of a population mean and average causal effect for early versus delayed switching to second-line ARV treatment regimens. Our analysis of the ACTG A5095 data further demonstrates how methods that use auxiliary variables can improve over methods that ignore them. Using the methods developed here, we conclude that patients who switch within 8 weeks of virologic failure have better clinical outcomes, on average, than patients who delay switching to a new second-line ARV regimen after failing on the initial regimen. Ordinary statistical methods fail to find such differences. This article has online supplementary material.
doi:10.1080/01621459.2011.646932
PMCID: PMC3545451
PMID: 23329858
Causal inference; Double robustness; Longitudinal data analysis; Missing data; Rubin causal model; Semiparametric efficient estimation
We introduce the large portfolio selection using gross-exposure constraints. We show that with gross-exposure constraint the empirically selected optimal portfolios based on estimated covariance matrices have similar performance to the theoretical optimal ones and there is no error accumulation effect from estimation of vast covariance matrices. This gives theoretical justification to the empirical results in Jagannathan and Ma (2003). We also show that the no-short-sale portfolio can be improved by allowing some short positions. The applications to portfolio selection, tracking, and improvements are also addressed. The utility of our new approach is illustrated by simulation and empirical studies on the 100 Fama-French industrial portfolios and the 600 stocks randomly selected from Russell 3000.
doi:10.1080/01621459.2012.682825
PMCID: PMC3535429
PMID: 23293404
Short-sale constraint; mean-variance efficiency; portfolio selection; risk assessment; risk optimization; portfolio improvement
Summary
In high-dimensional data analysis, feature selection becomes one means for dimension reduction, which proceeds with parameter estimation. Concerning accuracy of selection and estimation, we study nonconvex constrained and regularized likelihoods in the presence of nuisance parameters. Theoretically, we show that constrained L0-likelihood and its computational surrogate are optimal in that they achieve feature selection consistency and sharp parameter estimation, under one necessary condition required for any method to be selection consistent and to achieve sharp parameter estimation. It permits up to exponentially many candidate features. Computationally, we develop difference convex methods to implement the computational surrogate through prime and dual subproblems. These results establish a central role of L0-constrained and regularized likelihoods in feature selection and parameter estimation involving selection. As applications of the general method and theory, we perform feature selection in linear regression and logistic regression, and estimate a precision matrix in Gaussian graphical models. In these situations, we gain a new theoretical insight and obtain favorable numerical results. Finally, we discuss an application to predict the metastasis status of breast cancer patients with their gene expression profiles.
doi:10.1080/01621459.2011.645783
PMCID: PMC3378256
PMID: 22736876
Coordinate decent; continuous but non-smooth minimization; general likelihood; graphical models; nonconvex; (p, n)-asymptotics
Identifying the risk factors for comorbidity is important in psychiatric research. Empirically, studies have shown that testing multiple, correlated traits simultaneously is more powerful than testing a single trait at a time in association analysis. Furthermore, for complex diseases, especially mental illnesses and behavioral disorders, the traits are often recorded in different scales such as dichotomous, ordinal and quantitative. In the absence of covariates, nonparametric association tests have been developed for multiple complex traits to study comorbidity. However, genetic studies generally contain measurements of some covariates that may affect the relationship between the risk factors of major interest (such as genes) and the outcomes. While it is relatively easy to adjust these covariates in a parametric model for quantitative traits, it is challenging for multiple complex traits with possibly different scales. In this article, we propose a nonparametric test for multiple complex traits that can adjust for covariate effects. The test aims to achieve an optimal scheme of adjustment by using a maximum statistic calculated from multiple adjusted test statistics. We derive the asymptotic null distribution of the maximum test statistic, and also propose a resampling approach, both of which can be used to assess the significance of our test. Simulations are conducted to compare the type I error and power of the nonparametric adjusted test to the unadjusted test and other existing adjusted tests. The empirical results suggest that our proposed test increases the power through adjustment for covariates when there exist environmental effects, and is more robust to model misspecifications than some existing parametric adjusted tests. We further demonstrate the advantage of our test by analyzing a data set on genetics of alcoholism.
doi:10.1080/01621459.2011.643707
PMCID: PMC3381868
PMID: 22745516
Comorbidity; Environmental factor; Family-based association test; Maximum test statistic; Multiple traits; Ordinal traits
We propose recursively imputed survival tree (RIST) regression for right-censored data. This new nonparametric regression procedure uses a novel recursive imputation approach combined with extremely randomized trees that allows significantly better use of censored data than previous tree based methods, yielding improved model fit and reduced prediction error. The proposed method can also be viewed as a type of Monte Carlo EM algorithm which generates extra diversity in the tree-based fitting process. Simulation studies and data analyses demonstrate the superior performance of RIST compared to previous methods.
doi:10.1080/01621459.2011.637468
PMCID: PMC3486435
PMID: 23125470
Trees; Ensemble; Random Forests; Censored data; Imputation; Survival Analysis
Portfolio allocation with gross-exposure constraint is an effective method to increase the efficiency and stability of portfolios selection among a vast pool of assets, as demonstrated in Fan et al. (2011). The required high-dimensional volatility matrix can be estimated by using high frequency financial data. This enables us to better adapt to the local volatilities and local correlations among vast number of assets and to increase significantly the sample size for estimating the volatility matrix. This paper studies the volatility matrix estimation using high-dimensional high-frequency data from the perspective of portfolio selection. Specifically, we propose the use of “pairwise-refresh time” and “all-refresh time” methods based on the concept of “refresh time” proposed by Barndorff-Nielsen et al. (2008) for estimation of vast covariance matrix and compare their merits in the portfolio selection. We establish the concentration inequalities of the estimates, which guarantee desirable properties of the estimated volatility matrix in vast asset allocation with gross exposure constraints. Extensive numerical studies are made via carefully designed simulations. Comparing with the methods based on low frequency daily data, our methods can capture the most recent trend of the time varying volatility and correlation, hence provide more accurate guidance for the portfolio allocation in the next time period. The advantage of using high-frequency data is significant in our simulation and empirical studies, which consist of 50 simulated assets and 30 constituent stocks of Dow Jones Industrial Average index.
doi:10.1080/01621459.2012.656041
PMCID: PMC3526073
PMID: 23264708
Volatility matrix estimation; high frequency data; concentration inequalities; portfolio allocation; risk assessment; refresh time
In longitudinal biomedical studies, there is often interest in the rate functions, which describe the functional rates of change of biomarker profiles. This paper proposes a semiparametric approach to model these functions as the realizations of stochastic processes defined by stochastic differential equations. These processes are dependent on the covariates of interest and vary around a specified parametric function. An efficient Markov chain Monte Carlo algorithm is developed for inference. The proposed method is compared with several existing methods in terms of goodness-of-fit and more importantly the ability to forecast future functional data in a simulation study. The proposed methodology is applied to prostate-specific antigen profiles for illustration. Supplementary materials for this paper are available online.
doi:10.1198/jasa.2011.tm09294
PMCID: PMC3298426
PMID: 22423170
Euler approximation; Functional data analysis; Gaussian process; Rate function; Stochastic differential equation; Semiparametric stochastic velocity model
In applications to dependent data, first and foremost relational data, a number of discrete exponential family models has turned out to be near-degenerate and problematic in terms of Markov chain Monte Carlo simulation and statistical inference. We introduce the notion of instability with an eye to characterize, detect, and penalize discrete exponential family models that are near-degenerate and problematic in terms of Markov chain Monte Carlo simulation and statistical inference. We show that unstable discrete exponential family models are characterized by excessive sensitivity and near-degeneracy. In special cases, the subset of the natural parameter space corresponding to non-degenerate distributions and mean-value parameters far from the boundary of the mean-value parameter space turns out to be a lower-dimensional subspace of the natural parameter space. These characteristics of unstable discrete exponential family models tend to obstruct Markov chain Monte Carlo simulation and statistical inference. In applications to relational data, we show that discrete exponential family models with Markov dependence tend to be unstable and that the parameter space of some curved exponential families contains unstable subsets.
doi:10.1198/jasa.2011.tm10747
PMCID: PMC3405854
PMID: 22844170
social networks; statistical exponential families; curved exponential families; undirected graphical models; Markov chain Monte Carlo
Gene regulation is a complicated process. The interaction of many genes and their products forms an intricate biological network. Identification of this dynamic network will help us understand the biological process in a systematic way. However, the construction of such a dynamic network is very challenging for a high-dimensional system. In this article we propose to use a set of ordinary differential equations (ODE), coupled with dimensional reduction by clustering and mixed-effects modeling techniques, to model the dynamic gene regulatory network (GRN). The ODE models allow us to quantify both positive and negative gene regulations as well as feedback effects of one set of genes in a functional module on the dynamic expression changes of the genes in another functional module, which results in a directed graph network. A five-step procedure, Clustering, Smoothing, regulation Identification, parameter Estimates refining and Function enrichment analysis (CSIEF) is developed to identify the ODE-based dynamic GRN. In the proposed CSIEF procedure, a series of cutting-edge statistical methods and techniques are employed, that include non-parametric mixed-effects models with a mixture distribution for clustering, nonparametric mixed-effects smoothing-based methods for ODE models, the smoothly clipped absolute deviation (SCAD)-based variable selection, and stochastic approximation EM (SAEM) approach for mixed-effects ODE model parameter estimation. The key step, the SCAD-based variable selection of the proposed procedure is justified by investigating its asymptotic properties and validated by Monte Carlo simulations. We apply the proposed method to identify the dynamic GRN for yeast cell cycle progression data. We are able to annotate the identified modules through function enrichment analyses. Some interesting biological findings are discussed. The proposed procedure is a promising tool for constructing a general dynamic GRN and more complicated dynamic networks.
doi:10.1198/jasa.2011.ap10194
PMCID: PMC3509540
PMID: 23204614
Differential equations; Network graph; NLME; Nonparametric mixed-effects model; Saccharomyces cerevisiae; SAEM; SCAD; Time course microarray data; Two-stage smoothing-based method; Yeast cell cycles
We consider nonparametric regression of a scalar outcome on a covariate when the outcome is missing at random (MAR) given the covariate and other observed auxiliary variables. We propose a class of augmented inverse probability weighted (AIPW) kernel estimating equations for nonparametric regression under MAR. We show that AIPW kernel estimators are consistent when the probability that the outcome is observed, that is, the selection probability, is either known by design or estimated under a correctly specified model. In addition, we show that a specific AIPW kernel estimator in our class that employs the fitted values from a model for the conditional mean of the outcome given covariates and auxiliaries is double-robust, that is, it remains consistent if this model is correctly specified even if the selection probabilities are modeled or specified incorrectly. Furthermore, when both models happen to be right, this double-robust estimator attains the smallest possible asymptotic variance of all AIPW kernel estimators and maximally extracts the information in the auxiliary variables. We also describe a simple correction to the AIPW kernel estimating equations that while preserving double-robustness it ensures efficiency improvement over nonaugmented IPW estimation when the selection model is correctly specified regardless of the validity of the second model used in the augmentation term. We perform simulations to evaluate the finite sample performance of the proposed estimators, and apply the methods to the analysis of the AIDS Costs and Services Utilization Survey data. Technical proofs are available online.
doi:10.1198/jasa.2010.tm08463
PMCID: PMC3491912
PMID: 23144520
Asymptotics; Augmented kernel estimating equations; Double robustness; Efficiency; Inverse probability weighted kernel estimating equations; Kernel smoothing
In many applications researchers collect multivariate binary response data under two or more, naturally ordered, experimental conditions. In such situations one is often interested in using all binary outcomes simultaneously to detect an ordering among the experimental conditions. To make such comparisons we develop a general methodology for testing for the multivariate stochastic order between K ≥ 2 multivariate binary distributions. The proposed test uses order restricted estimators which, according to our simulation study, are more efficient than the unrestricted estimators in terms of mean squared error. The power of the proposed test was compared with several alternative tests. These included procedures which combine individual univariate tests for order, such as union intersection tests and a Bonferroni based test. We also compared the proposed test with unrestricted Hotelling's T2 type test. Our simulations suggest that the proposed method competes well with these alternatives. The gain in power is often substantial. The proposed methodology is illustrated by applying it to a two–year rodent cancer bioassay data obtained from the US National Toxicology Program (NTP). Supplemental materials are available online.
doi:10.1198/jasa.2011.tm10322
PMCID: PMC3437998
PMID: 22973069
dose response studies; multivariate binary data; order restricted statistical inference; stochastic order relations
We present new statistical analyses of data arising from a clinical trial designed to compare two-stage dynamic treatment regimes (DTRs) for advanced prostate cancer. The trial protocol mandated that patients were to be initially randomized among four chemotherapies, and that those who responded poorly were to be rerandomized to one of the remaining candidate therapies. The primary aim was to compare the DTRs’ overall success rates, with success defined by the occurrence of successful responses in each of two consecutive courses of the patient’s therapy. Of the one hundred and fifty study participants, forty seven did not complete their therapy per the algorithm. However, thirty five of them did so for reasons that precluded further chemotherapy; i.e. toxicity and/or progressive disease. Consequently, rather than comparing the overall success rates of the DTRs in the unrealistic event that these patients had remained on their assigned chemotherapies, we conducted an analysis that compared viable switch rules defined by the per-protocol rules but with the additional provision that patients who developed toxicity or progressive disease switch to a non-prespecified therapeutic or palliative strategy. This modification involved consideration of bivariate per-course outcomes encoding both efficacy and toxicity. We used numerical scores elicited from the trial’s Principal Investigator to quantify the clinical desirability of each bivariate per-course outcome, and defined one endpoint as their average over all courses of treatment. Two other simpler sets of scores as well as log survival time also were used as endpoints. Estimation of each DTR-specific mean score was conducted using inverse probability weighted methods that assumed that missingness in the twelve remaining drop-outs was informative but explainable in that it only depended on past recorded data. We conducted additional worst-best case analyses to evaluate sensitivity of our findings to extreme departures from the explainable drop-out assumption.
doi:10.1080/01621459.2011.641416
PMCID: PMC3433243
PMID: 22956855
Causal inference; Efficiency; Informative dropout; Inverse probability weighting; Marginal structural models; Optimal regime; Simultaneous confidence intervals
Genomewide association studies have become the primary tool for discovering the genetic basis of complex human diseases. Such studies are susceptible to the confounding effects of population stratification, in that the combination of allele-frequency heterogeneity with disease-risk heterogeneity among different ancestral subpopulations can induce spurious associations between genetic variants and disease. This article provides a statistically rigorous and computationally feasible solution to this challenging problem of unmeasured confounders. We show that the odds ratio of disease with a genetic variant is identifiable if and only if the genotype is independent of the unknown population substructure conditional on a set of observed ancestry-informative markers in the disease-free population. Under this condition, the odds ratio of interest can be estimated by fitting a semiparametric logistic regression model with an arbitrary function of a propensity score relating the genotype probability to ancestry-informative markers. Approximating the unknown function of the propensity score by B-splines, we derive a consistent and asymptotically normal estimator for the odds ratio of interest with a consistent variance estimator. Simulation studies demonstrate that the proposed inference procedures perform well in realistic settings. An application to the well-known Wellcome Trust Case-Control Study is presented. Supplemental materials are available online.
doi:10.1198/jasa.2011.tm10294
PMCID: PMC3314247
PMID: 22467997
B-spline; Case-control study; Principal components; Propensity score; Semiparametric logistic regression; Single nucleotide polymorphism
The proportional odds model may serve as a useful alternative to the Cox proportional hazards model to study association between covariates and their survival functions in medical studies. In this article, we study an extended proportional odds model that incorporates the so-called “external” time-varying covariates. In the extended model, regression parameters have a direct interpretation of comparing survival functions, without specifying the baseline survival odds function. Semiparametric and maximum likelihood estimation procedures are proposed to estimate the extended model. Our methods are demonstrated by Monte-Carlo simulations, and applied to a landmark randomized clinical trial of a short course Nevirapine (NVP) for mother-to-child transmission (MTCT) of human immunodeficiency virus type-1 (HIV-1). Additional application includes analysis of the well-known Veterans Administration (VA) Lung Cancer Trial.
doi:10.1080/01621459.2012.656021
PMCID: PMC3420072
PMID: 22904583
Counting process; Estimating function; HIV/AIDS; Maximum likelihood estimation; Semiparametric model; Time-varying covariate
A sensitivity analysis displays the increase in uncertainty that attends an inference when a key assumption is relaxed. In matched observational studies of treatment effects, a key assumption in some analyses is that subjects matched for observed covariates are comparable, and this assumption is relaxed by positing a relevant covariate that was not observed and not controlled by matching. What properties would such an unobserved covariate need to have to materially alter the inference about treatment effects? For ease of calculation and reporting, it is convenient that the sensitivity analysis be of low dimension, perhaps indexed by a scalar sensitivity parameter, but for interpretation in specific contexts, a higher dimensional analysis may be of greater relevance. An amplification of a sensitivity analysis is defined as a map from each point in a low dimensional sensitivity analysis to a set of points, perhaps a ‘curve,’ in a higher dimensional sensitivity analysis such that the possible inferences are the same for all points in the set. Possessing an amplification, an investigator may calculate and report the low dimensional analysis, yet have available the interpretations of the higher dimensional analysis.
doi:10.1198/jasa.2009.tm08470
PMCID: PMC3416023
PMID: 22888178
Amplification; causal effects; observational study; sensitivity analysis
We present a mixture cure model with the survival time of the “uncured” group coming from a class of linear transformation models, which is an extension of the proportional odds model. This class of model, first proposed by Dabrowska and Doksum (1988), which we term “generalized proportional odds model,” is well suited for the mixture cure model setting due to a clear separation between long-term and short-term effects. A standard expectation–maximization algorithm can be employed to locate the nonparametric maximum likelihood estimators, which are shown to be consistent and semiparametric efficient. However, there are difficulties in the M-step due to the nonparametric component. We overcome these difficulties by proposing two different algorithms. The first is to employ an majorize-minimize (MM) algorithm in the M-step instead of the usual Newton–Raphson method, and the other is based on an alternative form to express the model as a proportional hazards frailty model. The two new algorithms are compared in a simulation study with an existing estimating equation approach by Lu and Ying (2004). The MM algorithm provides both computational stability and efficiency. A case study of leukemia data is conducted to illustrate the proposed procedures.
doi:10.1198/jasa.2009.tm08459
PMCID: PMC3410987
PMID: 22865944
Expectation–maximization algorithm; Logistic regression; Majorize-minimize algorithm; Newton–Raphson algorithm; Nonparametric maximum likelihood estimator; Transformation model
As an alternative to the local partial likelihood method of Tibshirani and Hastie and Fan, Gijbels, and King, a global partial likelihood method is proposed to estimate the covariate effect in a nonparametric proportional hazards model, λ(t|x) = exp{ψ(x)}λ0(t). The estimator, ψ̂(x), reduces to the Cox partial likelihood estimator if the covariate is discrete. The estimator is shown to be consistent and semiparametrically efficient for linear functionals of ψ(x). Moreover, Breslow-type estimation of the cumulative baseline hazard function, using the proposed estimator ψ̂(x), is proved to be efficient. The asymptotic bias and variance are derived under regularity conditions. Computation of the estimator involves an iterative but simple algorithm. Extensive simulation studies provide evidence supporting the theory. The method is illustrated with the Stanford heart transplant data set. The proposed global approach is also extended to a partially linear proportional hazards model and found to provide efficient estimation of the slope parameter. This article has the supplementary materials online.
doi:10.1198/jasa.2010.tm08636
PMCID: PMC3404854
PMID: 22844168
Cox model; Local linear smoothing; Local partial likelihood; Semiparametric efficiency
Analysis of high dimensional data often seeks to identify a subset of important features and assess their effects on the outcome. Traditional statistical inference procedures based on standard regression methods often fail in the presence of high-dimensional features. In recent years, regularization methods have emerged as promising tools for analyzing high dimensional data. These methods simultaneously select important features and provide stable estimation of their effects. Adaptive LASSO and SCAD for instance, give consistent and asymptotically normal estimates with oracle properties. However, in finite samples, it remains difficult to obtain interval estimators for the regression parameters. In this paper, we propose perturbation resampling based procedures to approximate the distribution of a general class of penalized parameter estimates. Our proposal, justified by asymptotic theory, provides a simple way to estimate the covariance matrix and confidence regions. Through finite sample simulations, we verify the ability of this method to give accurate inference and compare it to other widely used standard deviation and confidence interval estimates. We also illustrate our proposals with a data set used to study the association of HIV drug resistance and a large number of genetic mutations.
doi:10.1198/jasa.2011.tm10382
PMCID: PMC3404855
PMID: 22844171
High dimensional regression; Interval estimation; Oracle property; Regularized estimation; Resampling methods
Summary
To evaluate the clinical utility of new risk markers, a crucial step is to measure their predictive accuracy with prospective studies. However, it is often infeasible to obtain marker values for all study participants. The nested case-control (NCC) design is a useful cost-effective strategy for such settings. Under the NCC design, markers are only ascertained for cases and a fraction of controls sampled randomly from the risk sets. The outcome dependent sampling generates a complex data structure and therefore a challenge for analysis. Existing methods for analyzing NCC studies focus primarily on association measures. Here, we propose a class of non-parametric estimators for commonly used accuracy measures. We derived asymptotic expansions for accuracy estimators based on both finite population and Bernoulli sampling and established asymptotic equivalence between the two. Simulation results suggest that the proposed procedures perform well in finite samples. The new procedures were illustrated with data from the Framingham Offspring study.
doi:10.1198/jasa.2011.tm09807
PMCID: PMC3404857
PMID: 22844169
Biomarker study; Classification Accuracy; Conditional Kaplan Meier; Inverse Probability Weighting; Nested case-control study; Predictive Value; ROC Curve; Time-dependent Accuracy
We propose a novel alternative to case-control sampling for the estimation of individual-level risk in spatial epidemiology. Our approach uses weighted estimating equations to estimate regression parameters in the intensity function of an inhomogeneous spatial point process, when information on risk-factors is available at the individual level for cases, but only at a spatially aggregated level for the population at risk. We develop data-driven methods to select the weights used in the estimating equations and show through simulation that the choice of weights can have a major impact on efficiency of estimation. We develop a formal test to detect non-Poisson behavior in the underlying point process and assess the performance of the test using simulations of Poisson and Poisson cluster point processes. We apply our methods to data on the spatial distribution of childhood meningococcal disease cases in Merseyside, U.K. between 1981 and 2007.
doi:10.1198/jasa.2010.ap09323
PMCID: PMC3395722
PMID: 22798701
Estimating equations; Inhomogeneous spatial-point processes; Meningococcal disease
With the recent explosion of scientific data of unprecedented size and complexity, feature ranking and screening are playing an increasingly important role in many scientific studies. In this article, we propose a novel feature screening procedure under a unified model framework, which covers a wide variety of commonly used parametric and semiparametric models. The new method does not require imposing a specific model structure on regression functions, and thus is particularly appealing to ultrahigh-dimensional regressions, where there are a huge number of candidate predictors but little information about the actual model forms. We demonstrate that, with the number of predictors growing at an exponential rate of the sample size, the proposed procedure possesses consistency in ranking, which is both useful in its own right and can lead to consistency in selection. The new procedure is computationally efficient and simple, and exhibits a competent empirical performance in our intensive simulations and real data analysis.
doi:10.1198/jasa.2011.tm10563
PMCID: PMC3384506
PMID: 22754050
Feature ranking; feature screening; ultrahigh-dimensional regression; variable selection
In randomized studies, treatment comparisons conditional on intermediate post-randomization outcomes using standard analytic methods do not have a causal interpretation. An alternate approach entails treatment comparisons within principal strata defined by the potential outcomes for the intermediate outcome that would be observed under each treatment assignment. In this paper, we develop methods for randomization-based inference within principal strata. The proposed methods are compared with existing large-sample methods as well as traditional intent-to-treat approaches. This research is motivated by HIV prevention studies where few infections are expected and inference is desired within the always-infected principal stratum, i.e., all individuals who would become infected regardless of randomization assignment.
doi:10.1198/jasa.2011.tm10356
PMCID: PMC3188760
PMID: 21987597
causal inference; covariate adjustment; exact test; randomization