PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1267600)

Clipboard (0)
None

Related Articles

1.  A Partial Linear Model in the Outcome Dependent Sampling Setting to Evaluate the Effect of Prenatal PCB Exposure on Cognitive Function in Children 
Biometrics  2010;67(3):876-885.
Summary
Outcome-dependent sampling (ODS) has been widely used in biomedical studies because it is a cost effective way to improve study efficiency. However, in the setting of a continuous outcome, the representation of the exposure variable has been limited to the framework of linear models, due to the challenge in terms of both theory and computation. Partial linear models (PLM) are a powerful inference tool to nonparametrically model the relation between an outcome and the exposure variable. In this article, we consider a case study of a partial linear model for data from an ODS design. We propose a semiparametric maximum likelihood method to make inferences with a PLM. We develop the asymptotic properties and conduct simulation studies to show that the proposed ODS estimator can produce a more efficient estimate than that from a traditional simple random sampling design with the same sample size. Using this newly developed method, we were able to explore an open question in epidemiology: whether in utero exposure to background levels of PCBs is associated with children’s intellectual impairment. Our model provides further insights into the relation between low-level PCB exposure and children’s cognitive function. The results shed new light on a body of inconsistent epidemiologic findings.
doi:10.1111/j.1541-0420.2010.01500.x
PMCID: PMC3182522  PMID: 21039397
Cost-effective designs; Empirical likelihood; Outcome dependent sampling; Partial linear model; Polychlorinated biphenyls; P-spline
2.  Semiparametric inference for a 2-stage outcome-auxiliary-dependent sampling design with continuous outcome 
Biostatistics (Oxford, England)  2011;12(3):521-534.
Two-stage design has long been recognized to be a cost-effective way for conducting biomedical studies. In many trials, auxiliary covariate information may also be available, and it is of interest to exploit these auxiliary data to improve the efficiency of inferences. In this paper, we propose a 2-stage design with continuous outcome where the second-stage data is sampled with an “outcome-auxiliary-dependent sampling” (OADS) scheme. We propose an estimator which is the maximizer for an estimated likelihood function. We show that the proposed estimator is consistent and asymptotically normally distributed. The simulation study indicates that greater study efficiency gains can be achieved under the proposed 2-stage OADS design by utilizing the auxiliary covariate information when compared with other alternative sampling schemes. We illustrate the proposed method by analyzing a data set from an environmental epidemiologic study.
doi:10.1093/biostatistics/kxq080
PMCID: PMC3114654  PMID: 21252082
Auxiliary covariate; Kernel smoothing; Outcome-auxiliary-dependent sampling; 2-stage sampling design
3.  Mixed effect regression analysis for a cluster-based two-stage outcome-auxiliary-dependent sampling design with a continuous outcome 
Biostatistics (Oxford, England)  2012;13(4):650-664.
Two-stage design is a well-known cost-effective way for conducting biomedical studies when the exposure variable is expensive or difficult to measure. Recent research development further allowed one or both stages of the two-stage design to be outcome dependent on a continuous outcome variable. This outcome-dependent sampling feature enables further efficiency gain in parameter estimation and overall cost reduction of the study (e.g. Wang, X. and Zhou, H., 2010. Design and inference for cancer biomarker study with an outcome and auxiliary-dependent subsampling. Biometrics 66, 502–511; Zhou, H., Song, R., Wu, Y. and Qin, J., 2011. Statistical inference for a two-stage outcome-dependent sampling design with a continuous outcome. Biometrics 67, 194–202). In this paper, we develop a semiparametric mixed effect regression model for data from a two-stage design where the second-stage data are sampled with an outcome-auxiliary-dependent sample (OADS) scheme. Our method allows the cluster- or center-effects of the study subjects to be accounted for. We propose an estimated likelihood function to estimate the regression parameters. Simulation study indicates that greater study efficiency gains can be achieved under the proposed two-stage OADS design with center-effects when compared with other alternative sampling schemes. We illustrate the proposed method by analyzing a dataset from the Collaborative Perinatal Project.
doi:10.1093/biostatistics/kxs013
PMCID: PMC3440236  PMID: 22723503
Center effect; Mixed model; Outcome-auxiliary-dependent sampling; Validation sample
4.  Statistical Inference for a Two-Stage Outcome-Dependent Sampling Design with a Continuous Outcome 
Biometrics  2011;67(1):194-202.
Summary
The two-stage case-control design has been widely used in epidemiology studies for its cost-effectiveness and improvement of the study efficiency (White, 1982; Breslow and Cain, 1988). The evolution of modern biomedical studies has called for cost-effective designs with a continuous outcome and exposure variables. In this paper, we propose a new two-stage outcome-dependent sampling scheme with a continuous outcome variable, where both the first-stage data and the second-stage data are from outcome-dependent sampling schemes. We develop a semiparametric empirical likelihood estimation for inference about the regression parameters in the proposed design. Simulation studies were conducted to investigate the small sample behavior of the proposed estimator. We demonstrate that, for a given statistical power, the proposed design will require a substantially smaller sample size than the alternative designs. The proposed method is illustrated with an environmental health study conducted at National Institute of Health.
doi:10.1111/j.1541-0420.2010.01446.x
PMCID: PMC4106685  PMID: 20560938
Biased sampling; Empirical likelihood; Outcome dependent; Sample size; Two-stage design
5.  A Partially Linear Regression Model for Data from an Outcome-Dependent Sampling Design 
Summary
The outcome dependent sampling scheme has been gaining attention in both the statistical literature and applied fields. Epidemiological and environmental researchers have been using it to select the observations for more powerful and cost-effective studies. Motivated by a study of the effect of in utero exposure to polychlorinated biphenyls on children’s IQ at age 7, in which the effect of an important confounding variable is nonlinear, we consider a semi-parametric regression model for data from an outcome-dependent sampling scheme where the relationship between the response and covariates is only partially parameterized. We propose a penalized spline maximum likelihood estimation (PSMLE) for inference on both the parametric and the nonparametric components and develop their asymptotic properties. Through simulation studies and an analysis of the IQ study, we compare the proposed estimator with several competing estimators. Practical considerations of implementing those estimators are discussed.
doi:10.1111/j.1467-9876.2010.00756.x
PMCID: PMC3181132  PMID: 21966030
Outcome dependent sampling; Estimated likelihood; Semiparametric method; Penalized spline
6.  Semiparametric Inference for Data with a Continuous Outcome from a Two-Phase Probability Dependent Sampling Scheme 
Multi-phased designs and biased sampling designs are two of the well recognized approaches to enhance study efficiency. In this paper, we propose a new and cost-effective sampling design, the two-phase probability dependent sampling design (PDS), for studies with a continuous outcome. This design will enable investigators to make efficient use of resources by targeting more informative subjects for sampling. We develop a new semiparametric empirical likelihood inference method to take advantage of data obtained through a PDS design. Simulation study results indicate that the proposed sampling scheme, coupled with the proposed estimator, is more efficient and more powerful than the existing outcome dependent sampling design and the simple random sampling design with the same sample size. We illustrate the proposed method with a real data set from an environmental epidemiologic study.
doi:10.1111/rssb.12029
PMCID: PMC3984585  PMID: 24737947
Empirical likelihood; Missing data; Semiparametric; Probability sample
7.  A marginal approach to reduced-rank penalized spline smoothing with application to multilevel functional data 
Multilevel functional data is collected in many biomedical studies. For example, in a study of the effect of Nimodipine on patients with subarachnoid hemorrhage (SAH), patients underwent multiple 4-hour treatment cycles. Within each treatment cycle, subjects’ vital signs were reported every 10 minutes. This data has a natural multilevel structure with treatment cycles nested within subjects and measurements nested within cycles. Most literature on nonparametric analysis of such multilevel functional data focus on conditional approaches using functional mixed effects models. However, parameters obtained from the conditional models do not have direct interpretations as population average effects. When population effects are of interest, we may employ marginal regression models. In this work, we propose marginal approaches to fit multilevel functional data through penalized spline generalized estimating equation (penalized spline GEE). The procedure is effective for modeling multilevel correlated generalized outcomes as well as continuous outcomes without suffering from numerical difficulties. We provide a variance estimator robust to misspecification of correlation structure. We investigate the large sample properties of the penalized spline GEE estimator with multilevel continuous data and show that the asymptotics falls into two categories. In the small knots scenario, the estimated mean function is asymptotically efficient when the true correlation function is used and the asymptotic bias does not depend on the working correlation matrix. In the large knots scenario, both the asymptotic bias and variance depend on the working correlation. We propose a new method to select the smoothing parameter for penalized spline GEE based on an estimate of the asymptotic mean squared error (MSE). We conduct extensive simulation studies to examine property of the proposed estimator under different correlation structures and sensitivity of the variance estimation to the choice of smoothing parameter. Finally, we apply the methods to the SAH study to evaluate a recent debate on discontinuing the use of Nimodipine in the clinical community.
doi:10.1080/01621459.2013.826134
PMCID: PMC3909538  PMID: 24497670
Penalized spline; GEE; Semiparametric models; Longitudinal data; Functional data
8.  Penalized Bregman divergence for large-dimensional regression and classification 
Biometrika  2010;97(3):551-566.
Summary
Regularization methods are characterized by loss functions measuring data fits and penalty terms constraining model parameters. The commonly used quadratic loss is not suitable for classification with binary responses, whereas the loglikelihood function is not readily applicable to models where the exact distribution of observations is unknown or not fully specified. We introduce the penalized Bregman divergence by replacing the negative loglikelihood in the conventional penalized likelihood with Bregman divergence, which encompasses many commonly used loss functions in the regression analysis, classification procedures and machine learning literature. We investigate new statistical properties of the resulting class of estimators with the number pn of parameters either diverging with the sample size n or even nearly comparable with n, and develop statistical inference tools. It is shown that the resulting penalized estimator, combined with appropriate penalties, achieves the same oracle property as the penalized likelihood estimator, but asymptotically does not rely on the complete specification of the underlying distribution. Furthermore, the choice of loss function in the penalized classifiers has an asymptotically relatively negligible impact on classification performance. We illustrate the proposed method for quasilikelihood regression and binary classification with simulation evaluation and real-data application.
doi:10.1093/biomet/asq033
PMCID: PMC3372245  PMID: 22822248
Consistency; Divergence minimization; Exponential family; Loss function; Optimal Bayes rule; Oracle property; Quasilikelihood
9.  NEW EFFICIENT ESTIMATION AND VARIABLE SELECTION METHODS FOR SEMIPARAMETRIC VARYING-COEFFICIENT PARTIALLY LINEAR MODELS 
Annals of statistics  2011;39(1):305-332.
The complexity of semiparametric models poses new challenges to statistical inference and model selection that frequently arise from real applications. In this work, we propose new estimation and variable selection procedures for the semiparametric varying-coefficient partially linear model. We first study quantile regression estimates for the nonparametric varying-coefficient functions and the parametric regression coefficients. To achieve nice efficiency properties, we further develop a semiparametric composite quantile regression procedure. We establish the asymptotic normality of proposed estimators for both the parametric and nonparametric parts and show that the estimators achieve the best convergence rate. Moreover, we show that the proposed method is much more efficient than the least-squares-based method for many non-normal errors and that it only loses a small amount of efficiency for normal errors. In addition, it is shown that the loss in efficiency is at most 11.1% for estimating varying coefficient functions and is no greater than 13.6% for estimating parametric components. To achieve sparsity with high-dimensional covariates, we propose adaptive penalization methods for variable selection in the semiparametric varying-coefficient partially linear model and prove that the methods possess the oracle property. Extensive Monte Carlo simulation studies are conducted to examine the finite-sample performance of the proposed procedures. Finally, we apply the new methods to analyze the plasma beta-carotene level data.
doi:10.1214/10-AOS842
PMCID: PMC3109949  PMID: 21666869
Asymptotic relative efficiency; composite quantile regression; semiparametric varying-coefficient partially linear model; oracle properties; variable selection
10.  A Perturbation Method for Inference on Regularized Regression Estimates 
Analysis of high dimensional data often seeks to identify a subset of important features and assess their effects on the outcome. Traditional statistical inference procedures based on standard regression methods often fail in the presence of high-dimensional features. In recent years, regularization methods have emerged as promising tools for analyzing high dimensional data. These methods simultaneously select important features and provide stable estimation of their effects. Adaptive LASSO and SCAD for instance, give consistent and asymptotically normal estimates with oracle properties. However, in finite samples, it remains difficult to obtain interval estimators for the regression parameters. In this paper, we propose perturbation resampling based procedures to approximate the distribution of a general class of penalized parameter estimates. Our proposal, justified by asymptotic theory, provides a simple way to estimate the covariance matrix and confidence regions. Through finite sample simulations, we verify the ability of this method to give accurate inference and compare it to other widely used standard deviation and confidence interval estimates. We also illustrate our proposals with a data set used to study the association of HIV drug resistance and a large number of genetic mutations.
doi:10.1198/jasa.2011.tm10382
PMCID: PMC3404855  PMID: 22844171
High dimensional regression; Interval estimation; Oracle property; Regularized estimation; Resampling methods
11.  ROC curve estimation under test-result-dependent sampling 
Biostatistics (Oxford, England)  2012;14(1):160-172.
The receiver operating characteristic (ROC) curve is often used to evaluate the performance of a biomarker measured on continuous scale to predict the disease status or a clinical condition. Motivated by the need for novel study designs with better estimation efficiency and reduced study cost, we consider a biased sampling scheme that consists of a SRC and a supplemental TDC. Using this approach, investigators can oversample or undersample subjects falling into certain regions of the biomarker measure, yielding improved precision for the estimation of the ROC curve with a fixed sample size. Test-result-dependent sampling will introduce bias in estimating the predictive accuracy of the biomarker if standard ROC estimation methods are used. In this article, we discuss three approaches for analyzing data of a test-result-dependent structure with a special focus on the empirical likelihood method. We establish asymptotic properties of the empirical likelihood estimators for covariate-specific ROC curves and covariate-independent ROC curves and give their corresponding variance estimators. Simulation studies show that the empirical likelihood method yields good properties and is more efficient than alternative methods. Recommendations on number of regions, cutoff points, and subject allocation is made based on the simulation results. The proposed methods are illustrated with a data example based on an ongoing lung cancer clinical trial.
doi:10.1093/biostatistics/kxs020
PMCID: PMC3577107  PMID: 22723502
Binormal model; Covariate-independent ROC curve; Covariate-specific ROC curve; Empirical likelihood method; Test-result-dependent sampling
12.  Design and Inference for Cancer Biomarker Study with an Outcome and Auxiliary-Dependent Subsampling 
Biometrics  2009;66(2):502-511.
Summary
In cancer research, it is important to evaluate the performance of a biomarker (e.g. molecular, genetic, or imaging) that correlates patients’ prognosis or predicts patients’ response to a treatment in large prospective study. Due to overall budget constraint and high cost associated with bioassays, investigators often have to select a subset from all registered patients for biomarker assessment. To detect a potentially moderate association between the biomarker and the outcome, investigators need to decide how to select the subset of a fixed size such that the study efficiency can be enhanced. We show that, instead of drawing a simple random sample from the study cohort, greater efficiency can be achieved by allowing the selection probability to depend on the outcome and an auxiliary variable; we refer to such a sampling scheme as outcome and auxiliary-dependent subsampling (OADS). This paper is motivated by the need to analyze data from a lung cancer biomarker study that adopts the OADS design to assess EGFR mutations as a predictive biomarker for whether a subject responds to a greater extent to EGFR inhibitor drugs. We propose an estimated maximum likelihood method that accommodates the OADS design and utilizes all observed information, especially those contained in the likelihood score of EGFR mutations (an auxiliary variable of EGFR mutations) that is available to all patients. We derive the asymptotic properties of the proposed estimator and evaluate its finite sample properties via simulation. We illustrate the proposed method with a data example.
doi:10.1111/j.1541-0420.2009.01280.x
PMCID: PMC2891224  PMID: 19508239
Auxiliary Variable; Biomarker; Estimated Likelihood Method; Kernel Smoother; Outcome and Auxiliary-Dependent Subsampling
13.  Penalized spline estimation for functional coefficient regression models 
The functional coefficient regression models assume that the regression coefficients vary with some “threshold” variable, providing appreciable flexibility in capturing the underlying dynamics in data and avoiding the so-called “curse of dimensionality” in multivariate nonparametric estimation. We first investigate the estimation, inference, and forecasting for the functional coefficient regression models with dependent observations via penalized splines. The P-spline approach, as a direct ridge regression shrinkage type global smoothing method, is computationally efficient and stable. With established fixed-knot asymptotics, inference is readily available. Exact inference can be obtained for fixed smoothing parameter λ, which is most appealing for finite samples. Our penalized spline approach gives an explicit model expression, which also enables multi-step-ahead forecasting via simulations. Furthermore, we examine different methods of choosing the important smoothing parameter λ: modified multi-fold cross-validation (MCV), generalized cross-validation (GCV), and an extension of empirical bias bandwidth selection (EBBS) to P-splines. In addition, we implement smoothing parameter selection using mixed model framework through restricted maximum likelihood (REML) for P-spline functional coefficient regression models with independent observations. The P-spline approach also easily allows different smoothness for different functional coefficients, which is enabled by assigning different penalty λ accordingly. We demonstrate the proposed approach by both simulation examples and a real data application.
doi:10.1016/j.csda.2009.09.036
PMCID: PMC3080050  PMID: 21516260
14.  Haplotype-Based Regression Analysis and Inference of Case–Control Studies with Unphased Genotypes and Measurement Errors in Environmental Exposures 
Biometrics  2007;64(3):673-684.
Summary. It is widely believed that risks of many complex diseases are determined by genetic susceptibilities, environmental exposures, and their interaction. Chatterjee and Carroll (2005, Biometrika 92, 399–418) developed an efficient retrospective maximum-likelihood method for analysis of case–control studies that exploits an assumption of gene–environment independence and leaves the distribution of the environmental covariates to be completely nonparametric. Spinka, Carroll, and Chatterjee (2005, Genetic Epidemiology 29, 108–127) extended this approach to studies where certain types of genetic information, such as haplotype phases, may be missing on some subjects. We further extend this approach to situations when some of the environmental exposures are measured with error. Using a polychotomous logistic regression model, we allow disease status to have K + 1 levels. We propose use of a pseudolikelihood and a related EM algorithm for parameter estimation. We prove consistency and derive the resulting asymptotic covariance matrix of parameter estimates when the variance of the measurement error is known and when it is estimated using replications. Inferences with measurement error corrections are complicated by the fact that the Wald test often behaves poorly in the presence of large amounts of measurement error. The likelihood-ratio (LR) techniques are known to be a good alternative. However, the LR tests are not technically correct in this setting because the likelihood function is based on an incorrect model, i.e., a prospective model in a retrospective sampling scheme. We corrected standard asymptotic results to account for the fact that the LR test is based on a likelihood-type function. The performance of the proposed method is illustrated using simulation studies emphasizing the case when genetic information is in the form of haplotypes and missing data arises from haplotype-phase ambiguity. An application of our method is illustrated using a population-based case–control study of the association between calcium intake and the risk of colorectal adenoma.
doi:10.1111/j.1541-0420.2007.00930.x
PMCID: PMC2672569  PMID: 18047538
EM algorithm; Errors in variables; Gene-environment independence; Gene-environment interactions; Likelihood-ratio tests in misspecified models; Inferences in measurement error models; Profile likelihood; Semiparametric methods
15.  Cox Regression in Nested Case-Control Studies with Auxiliary Covariates 
Biometrics  2009;66(2):374-381.
SUMMARY
Nested case-control (NCC) design is a popular sampling method in large epidemiologic studies for its cost effectiveness to investigate the temporal relationship of diseases with environmental exposures or biological precursors. Thomas’ maximum partial likelihood estimator is commonly used to estimate the regression parameters in Cox’s model for NCC data. In this paper, we consider a situation that failure/censoring information and some crude covariates are available for the entire cohort in addition to NCC data and propose an improved estimator that is asymptotically more efficient than Thomas’ estimator. We adopt a projection approach that, heretofore, has only been employed in situations of random validation sampling and show that it can be well adapted to NCC designs where the sampling scheme is a dynamic process and is not independent for controls. Under certain conditions, consistency and asymptotic normality of the proposed estimator are established and a consistent variance estimator is also developed. Furthermore, a simplified approximate estimator is proposed when the disease is rare. Extensive simulations are conducted to evaluate the finite sample performance of our proposed estimators and to compare the efficiency with Thomas’ estimator and other competing estimators. Moreover, sensitivity analyses are conducted to demonstrate the behavior of the proposed estimator when model assumptions are violated, and we find that the biases are reasonably small in realistic situations. We further demonstrate the proposed method with data from studies on Wilms’ tumor.
doi:10.1111/j.1541-0420.2009.01277.x
PMCID: PMC2889133  PMID: 19508242
Counting process; Cox proportional hazards model; Martingale; Risk set sampling; Survival analysis
16.  One-step Sparse Estimates in Nonconcave Penalized Likelihood Models 
Annals of statistics  2008;36(4):1509-1533.
Fan & Li (2001) propose a family of variable selection methods via penalized likelihood using concave penalty functions. The nonconcave penalized likelihood estimators enjoy the oracle properties, but maximizing the penalized likelihood function is computationally challenging, because the objective function is nondifferentiable and nonconcave. In this article we propose a new unified algorithm based on the local linear approximation (LLA) for maximizing the penalized likelihood for a broad class of concave penalty functions. Convergence and other theoretical properties of the LLA algorithm are established. A distinguished feature of the LLA algorithm is that at each LLA step, the LLA estimator can naturally adopt a sparse representation. Thus we suggest using the one-step LLA estimator from the LLA algorithm as the final estimates. Statistically, we show that if the regularization parameter is appropriately chosen, the one-step LLA estimates enjoy the oracle properties with good initial estimators. Computationally, the one-step LLA estimation methods dramatically reduce the computational cost in maximizing the nonconcave penalized likelihood. We conduct some Monte Carlo simulation to assess the finite sample performance of the one-step sparse estimation methods. The results are very encouraging.
doi:10.1214/009053607000000802
PMCID: PMC2759727  PMID: 19823597
AIC; BIC; Lasso; One-step estimator; Oracle Properties; SCAD
17.  PROFILE-KERNEL LIKELIHOOD INFERENCE WITH DIVERGING NUMBER OF PARAMETERS 
Annals of statistics  2008;36(5):2232-2260.
The generalized varying coefficient partially linear model with growing number of predictors arises in many contemporary scientific endeavor. In this paper we set foot on both theoretical and practical sides of profile likelihood estimation and inference. When the number of parameters grows with sample size, the existence and asymptotic normality of the profile likelihood estimator are established under some regularity conditions. Profile likelihood ratio inference for the growing number of parameters is proposed and Wilk’s phenomenon is demonstrated. A new algorithm, called the accelerated profile-kernel algorithm, for computing profile-kernel estimator is proposed and investigated. Simulation studies show that the resulting estimates are as efficient as the fully iterative profile-kernel estimates. For moderate sample sizes, our proposed procedure saves much computational time over the fully iterative profile-kernel one and gives stabler estimates. A set of real data is analyzed using our proposed algorithm.
doi:10.1214/07-AOS544
PMCID: PMC2630533  PMID: 19173010
Generalized linear models; varying coefficients; high dimensionality; asymptotic normality; profile likelihood; generalized likelihood ratio tests
18.  Variable Selection for Partially Linear Models with Measurement Errors1 
This article focuses on variable selection for partially linear models when the covariates are measured with additive errors. We propose two classes of variable selection procedures, penalized least squares and penalized quantile regression, using the nonconvex penalized principle. The first procedure corrects the bias in the loss function caused by the measurement error by applying the so-called correction-for-attenuation approach, whereas the second procedure corrects the bias by using orthogonal regression. The sampling properties for the two procedures are investigated. The rate of convergence and the asymptotic normality of the resulting estimates are established. We further demonstrate that, with proper choices of the penalty functions and the regularization parameter, the resulting estimates perform asymptotically as well as an oracle procedure (Fan and Li 2001). Choice of smoothing parameters is also discussed. Finite sample performance of the proposed variable selection procedures is assessed by Monte Carlo simulation studies. We further illustrate the proposed procedures by an application.
doi:10.1198/jasa.2009.0127
PMCID: PMC2697854  PMID: 20046976
Errors-in-variable; Error-free; Error-prone; Local linear regression; Quantile regression; SCAD
19.  Graph Estimation From Multi-Attribute Data 
Undirected graphical models are important in a number of modern applications that involve exploring or exploiting dependency structures underlying the data. For example, they are often used to explore complex systems where connections between entities are not well understood, such as in functional brain networks or genetic networks. Existing methods for estimating structure of undirected graphical models focus on scenarios where each node represents a scalar random variable, such as a binary neural activation state or a continuous mRNA abundance measurement, even though in many real world problems, nodes can represent multivariate variables with much richer meanings, such as whole images, text documents, or multi-view feature vectors. In this paper, we propose a new principled framework for estimating the structure of undirected graphical models from such multivariate (or multi-attribute) nodal data. The structure of a graph is inferred through estimation of non-zero partial canonical correlation between nodes. Under a Gaussian model, this strategy is equivalent to estimating conditional independencies between random vectors represented by the nodes and it generalizes the classical problem of covariance selection (Dempster, 1972). We relate the problem of estimating non-zero partial canonical correlations to maximizing a penalized Gaussian likelihood objective and develop a method that efficiently maximizes this objective. Extensive simulation studies demonstrate the effectiveness of the method under various conditions. We provide illustrative applications to uncovering gene regulatory networks from gene and protein profiles, and uncovering brain connectivity graph from positron emission tomography data. Finally, we provide sufficient conditions under which the true graphical structure can be recovered correctly.
PMCID: PMC4303188  PMID: 25620892
graphical model selection; multi-attribute data; network analysis; partial canonical correlation
20.  VARIABLE SELECTION IN LINEAR MIXED EFFECTS MODELS 
Annals of statistics  2012;40(4):2043-2068.
This paper is concerned with the selection and estimation of fixed and random effects in linear mixed effects models. We propose a class of nonconcave penalized profile likelihood methods for selecting and estimating important fixed effects. To overcome the difficulty of unknown covariance matrix of random effects, we propose to use a proxy matrix in the penalized profile likelihood. We establish conditions on the choice of the proxy matrix and show that the proposed procedure enjoys the model selection consistency where the number of fixed effects is allowed to grow exponentially with the sample size. We further propose a group variable selection strategy to simultaneously select and estimate important random effects, where the unknown covariance matrix of random effects is replaced with a proxy matrix. We prove that, with the proxy matrix appropriately chosen, the proposed procedure can identify all true random effects with asymptotic probability one, where the dimension of random effects vector is allowed to increase exponentially with the sample size. Monte Carlo simulation studies are conducted to examine the finite-sample performance of the proposed procedures. We further illustrate the proposed procedures via a real data example.
doi:10.1214/12-AOS1028
PMCID: PMC4026175  PMID: 24850975
Adaptive Lasso; linear mixed effects models; group variable selection; oracle property; SCAD
21.  Inference for Seemingly Unrelated Varying-Coefficient Nonparametric Regression Models 
This paper is concerned with the inference of seemingly unrelated (SU) varying-coefficient nonparametric regression models. We propose an estimation for the unknown coefficient functions, which is an extension of the two-stage procedure proposed by Linton, et al. (2004) in the longitudinal data framework where they focused on purely nonparametric regression. We show the resulted estimators are asymptotically normal and more efficient than those based on only the individual regression equation even when the error covariance matrix is homogeneous. Another focus of this paper is to extend the generalized likelihood ratio technique developed by Fan, Zhang and Zhang (2001) for testing the goodness of fit of models to the setting of SU regression. A wild block bootstrap based method is used to compute p-value of the test. Some simulation studies are given in support of the asymptotics. A real data set from an ongoing environmental epidemiologic study is used to illustrate the proposed procedures.
PMCID: PMC3893667  PMID: 24453433
Seemingly unrelated regression; Varying-coefficient model; Two-stage estimation; Asymptotic normality
22.  Analysis of Longitudinal Data with Semiparametric Estimation of Covariance Function 
Improving efficiency for regression coefficients and predicting trajectories of individuals are two important aspects in analysis of longitudinal data. Both involve estimation of the covariance function. Yet, challenges arise in estimating the covariance function of longitudinal data collected at irregular time points. A class of semiparametric models for the covariance function is proposed by imposing a parametric correlation structure while allowing a nonparametric variance function. A kernel estimator is developed for the estimation of the nonparametric variance function. Two methods, a quasi-likelihood approach and a minimum generalized variance method, are proposed for estimating parameters in the correlation structure. We introduce a semiparametric varying coefficient partially linear model for longitudinal data and propose an estimation procedure for model coefficients by using a profile weighted least squares approach. Sampling properties of the proposed estimation procedures are studied and asymptotic normality of the resulting estimators is established. Finite sample performance of the proposed procedures is assessed by Monte Carlo simulation studies. The proposed methodology is illustrated by an analysis of a real data example.
doi:10.1198/016214507000000095
PMCID: PMC2730591  PMID: 19707537
Kernel regression; local linear regression; profile weighted least squares; semiparametric varying coefficient model
23.  Global Partial Likelihood for Nonparametric Proportional Hazards Models 
As an alternative to the local partial likelihood method of Tibshirani and Hastie and Fan, Gijbels, and King, a global partial likelihood method is proposed to estimate the covariate effect in a nonparametric proportional hazards model, λ(t|x) = exp{ψ(x)}λ0(t). The estimator, ψ̂(x), reduces to the Cox partial likelihood estimator if the covariate is discrete. The estimator is shown to be consistent and semiparametrically efficient for linear functionals of ψ(x). Moreover, Breslow-type estimation of the cumulative baseline hazard function, using the proposed estimator ψ̂(x), is proved to be efficient. The asymptotic bias and variance are derived under regularity conditions. Computation of the estimator involves an iterative but simple algorithm. Extensive simulation studies provide evidence supporting the theory. The method is illustrated with the Stanford heart transplant data set. The proposed global approach is also extended to a partially linear proportional hazards model and found to provide efficient estimation of the slope parameter. This article has the supplementary materials online.
doi:10.1198/jasa.2010.tm08636
PMCID: PMC3404854  PMID: 22844168
Cox model; Local linear smoothing; Local partial likelihood; Semiparametric efficiency
24.  Gamma frailty transformation models for multivariate survival times 
Biometrika  2009;96(2):277-291.
SUMMARY
We propose a class of transformation models for multivariate failure times. The class of transformation models generalize the usual gamma frailty model and yields a marginally linear transformation model for each failure time. Nonparametric maximum likelihood estimation is used for inference. The maximum likelihood estimators for the regression coefficients are shown to be consistent and asymptotically normal, and their asymptotic variances attain the semiparametric efficiency bound. Simulation studies show that the proposed estimation procedure provides asymptotically efficient estimates and yields good inferential properties for small sample sizes. The method is illustrated using data from a cardiovascular study.
doi:10.1093/biomet/asp008
PMCID: PMC4063334  PMID: 24948838
Gamma frailty model; Linear transformation model; Proportional hazards model; Semiparametric efficiency
25.  Additive hazards regression and partial likelihood estimation for ecological monitoring data across space 
Statistics and its interface  2012;5(2):10.4310/SII.2012.v5.n2.a5.
We develop continuous-time models for the analysis of environmental or ecological monitoring data such that subjects are observed at multiple monitoring time points across space. Of particular interest are additive hazards regression models where the baseline hazard function can take on flexible forms. We consider time-varying covariates and take into account spatial dependence via autoregression in space and time. We develop statistical inference for the regression coefficients via partial likelihood. Asymptotic properties, including consistency and asymptotic normality, are established for parameter estimates under suitable regularity conditions. Feasible algorithms utilizing existing statistical software packages are developed for computation. We also consider a simpler additive hazards model with homogeneous baseline hazard and develop hypothesis testing for homogeneity. A simulation study demonstrates that the statistical inference using partial likelihood has sound finite-sample properties and offers a viable alternative to maximum likelihood estimation. For illustration, we analyze data from an ecological study that monitors bark beetle colonization of red pines in a plantation of Wisconsin.
doi:10.4310/SII.2012.v5.n2.a5
PMCID: PMC3849836  PMID: 24319528
Current status data; Grouped survival data; Maximum likelihood; Multiple monitoring times; Spatial autoregression; Spatial lattice

Results 1-25 (1267600)