PMCID: PMC2985393
PMID: 21260154
Summary
We consider variable selection in the Cox regression model (Cox, 1975, Biometrika
362, 269–276) with covariates missing at random. We investigate the smoothly clipped absolute deviation penalty and adaptive least absolute shrinkage and selection operator (LASSO) penalty, and propose a unified model selection and estimation procedure. A computationally attractive algorithm is developed, which simultaneously optimizes the penalized likelihood function and penalty parameters. We also optimize a model selection criterion, called the ICQ statistic (Ibrahim, Zhu, and Tang, 2008, Journal of the American Statistical Association
103, 1648–1658), to estimate the penalty parameters and show that it consistently selects all important covariates. Simulations are performed to evaluate the finite sample performance of the penalty estimates. Also, two lung cancer data sets are analyzed to demonstrate the proposed methodology.
doi:10.1111/j.1541-0420.2009.01274.x
PMCID: PMC3303197
PMID: 19459831
ALASSO; Missing data; Partial likelihood; Penalized likelihood; Proportional hazards model; SCAD; Variable selection
Bandeen-Roche and Liang (2002, Modelling multivariate failure time associations in the presence of a competing risk. Biometrika 89, 299–314.) tailored Oakes (1989, Bivariate survival models induced by frailties. Journal of the American Statistical Association 84, 487–493.)'s conditional hazard ratio to evaluate cause-specific associations in bivariate competing risks data. In many population-based family studies, one observes complex multivariate competing risks data, where the family sizes may be > 2, certain marginals may be exchangeable, and there may be multiple correlated relative pairs having a given pairwise association. Methods for bivariate competing risks data are inadequate in these settings. We show that the rank correlation estimator of Bandeen-Roche and Liang (2002) extends naturally to general clustered family structures. Consistency, asymptotic normality, and variance estimation are easily obtained with U-statistic theories. A natural by-product is an easily implemented test for constancy of the association over different time regions. In the Cache County Study on Memory in Aging, familial associations in dementia onset are of interest, accounting for death prior to dementia. The proposed methods using all available data suggest attenuation in dementia associations at later ages, which had been somewhat obscured in earlier analyses.
doi:10.1093/biostatistics/kxp039
PMCID: PMC2800162
PMID: 19826137
Cause-specific hazard ratio; Concordance estimator; Dependent censoring; Exchangeable clustered data; Time-varying association
Background
Models for complex biological systems may involve a large number of parameters. It may well be that some of these parameters cannot be derived from observed data via regression techniques. Such parameters are said to be unidentifiable, the remaining parameters being identifiable. Closely related to this idea is that of redundancy, that a set of parameters can be expressed in terms of some smaller set. Before data is analysed it is critical to determine which model parameters are identifiable or redundant to avoid ill-defined and poorly convergent regression.
Methodology/Principal Findings
In this paper we outline general considerations on parameter identifiability, and introduce the notion of weak local identifiability and gradient weak local identifiability. These are based on local properties of the likelihood, in particular the rank of the Hessian matrix. We relate these to the notions of parameter identifiability and redundancy previously introduced by Rothenberg (Econometrica 39 (1971) 577–591) and Catchpole and Morgan (Biometrika 84 (1997) 187–196). Within the widely used exponential family, parameter irredundancy, local identifiability, gradient weak local identifiability and weak local identifiability are shown to be largely equivalent. We consider applications to a recently developed class of cancer models of Little and Wright (Math Biosciences 183 (2003) 111–134) and Little et al. (J Theoret Biol 254 (2008) 229–238) that generalize a large number of other recently used quasi-biological cancer models.
Conclusions/Significance
We have shown that the previously developed concepts of parameter local identifiability and redundancy are closely related to the apparently weaker properties of weak local identifiability and gradient weak local identifiability—within the widely used exponential family these concepts largely coincide.
doi:10.1371/journal.pone.0008915
PMCID: PMC2811744
PMID: 20111720
Epidemiologic studies often aim to estimate the odds ratio for the association between a binary exposure and a binary disease outcome. Because confounding bias is of serious concern in observational studies, investigators typically estimate the adjusted odds ratio in a multivariate logistic regression which conditions on a large number of potential confounders. It is well known that modeling error in specification of the confounders can lead to substantial bias in the adjusted odds ratio for exposure. As a remedy, Tchetgen Tchetgen et al. (Biometrika. 2010;97(1):171–180) recently developed so-called doubly robust estimators of an adjusted odds ratio by carefully combining standard logistic regression with reverse regression analysis, in which exposure is the dependent variable and both the outcome and the confounders are the independent variables. Double robustness implies that only one of the 2 modeling strategies needs to be correct in order to make valid inferences about the odds ratio parameter. In this paper, I aim to introduce this recent methodology into the epidemiologic literature by presenting a simple closed-form doubly robust estimator of the adjusted odds ratio for a binary exposure. A SAS macro (SAS Institute Inc., Cary, North Carolina) is given in an online appendix to facilitate use of the approach in routine epidemiologic practice, and a simulated data example is also provided for the purpose of illustration.
doi:10.1093/aje/kws377
PMCID: PMC3664333
PMID: 23558352
case-control sampling; doubly robust estimator; logistic regression; odds ratio; SAS macro
Treatment noncompliance and missing outcomes at posttreatment assessments are common problems in field experiments in naturalistic settings. Although the two complications often occur simultaneously, statistical methods that address both complications have not been routinely considered in data analysis practice in the prevention research field. This paper shows that identification and estimation of causal treatment effects considering both noncompliance and missing outcomes can be relatively easily conducted under various missing data assumptions. We review a few assumptions on missing data in the presence of noncompliance, including the latent ignorability proposed by Frangakis and Rubin (Biometrika 86:365–379, 1999), and show how these assumptions can be used in the parametric complier average causal effect (CACE) estimation framework. As an easy way of sensitivity analysis, we propose the use of alternative missing data assumptions, which will provide a range of causal effect estimates. In this way, we are less likely to settle with a possibly biased causal effect estimate based on a single assumption. We demonstrate how alternative missing data assumptions affect identification of causal effects, focusing on the CACE. The data from the Johns Hopkins School Intervention Study (Ialongo et al., Am J Community Psychol 27:599–642, 1999) will be used as an example.
doi:10.1007/s11121-010-0175-4
PMCID: PMC2912956
PMID: 20379779
Causal inference; Complier average causal effect; Latent ignorability; Missing at random; Missing data; Noncompliance
In dementia screening tests, item selection for shortening an existing screening test can be achieved using multiple logistic regression. However, maximum likelihood estimates for such logistic regression models often experience serious bias or even non-existence because of separation and multicollinearity problems resulting from a large number of highly correlated items. Firth (1993, Biometrika, 80(1), 27–38) proposed a penalized likelihood estimator for generalized linear models and it was shown to reduce bias and the non-existence problems. The ridge regression has been used in logistic regression to stabilize the estimates in cases of multicollinearity. However, neither solves the problems for each other. In this paper, we propose a double penalized maximum likelihood estimator combining Firth’s penalized likelihood equation with a ridge parameter. We present a simulation study evaluating the empirical performance of the double penalized likelihood estimator in small to moderate sample sizes. We demonstrate the proposed approach using a current screening data from a community-based dementia study.
PMCID: PMC2849171
PMID: 20376286
Logistic regression; maximum likelihood; penalized maximum likelihood; ridge regression; item selection
The censored linear regression model, also referred to as the accelerated failure time (AFT) model when the logarithm of the survival time is used as the response variable, is widely seen as an alternative to the popular Cox model when the assumption of proportional hazards is questionable. Buckley and James [Linear regression with censored data, Biometrika 66 (1979) 429−436] extended the least squares estimator to the semiparametric censored linear regression model in which the error distribution is completely unspecified. The Buckley–James estimator performs well in many simulation studies and examples. The direct interpretation of the AFT model is also more attractive than the Cox model, as Cox has pointed out, in practical situations. However, the application of the Buckley–James estimation was limited in practice mainly due to its illusive variance. In this paper, we use the empirical likelihood method to derive a new test and confidence interval based on the Buckley–James estimator of the regression coefficient. A standard chi-square distribution is used to calculate the P-value and the confidence interval. The proposed empirical likelihood method does not involve variance estimation. It also shows much better small sample performance than some existing methods in our simulation studies.
doi:10.1016/j.jmva.2007.02.007
PMCID: PMC2583435
PMID: 19018294
Censored data; Wilks theorem; Accelerated failure time model; Linear regression model
Survival data can contain an unknown fraction of subjects who are “cured” in the sense of not being at risk of failure. We describe such data with cure-mixture models, which separately model cure status and the hazard of failure among non-cured subjects. No diagnostic currently exists for evaluating the fit of such models; the popular Schoenfeld residual (Schoenfeld, 1982. Partial residuals for the proportional hazards regression-model. Biometrika
69, 239–241) is not applicable to data with cures. In this article, we propose a pseudo-residual, modeled on Schoenfeld's, to assess the fit of the survival regression in the non-cured fraction. Unlike Schoenfeld's approach, which tests the validity of the proportional hazards (PH) assumption, our method uses the full hazard and is thus also applicable to non-PH models. We derive the asymptotic distribution of the residuals and evaluate their performance by simulation in a range of parametric models. We apply our approach to data from a smoking cessation drug trial.
doi:10.1093/biostatistics/kxs043
PMCID: PMC3695652
PMID: 23197383
Accelerated failure time; Long-term survivors; Proportional hazards; Residual analysis
Frailty models are useful for measuring unobserved heterogeneity in risk of failures across clusters, providing cluster-specific risk prediction. In a frailty model, the latent frailties shared by members within a cluster are assumed to act multiplicatively on the hazard function. In order to obtain parameter and frailty variate estimates, we consider the hierarchical likelihood (H-likelihood) approach (Ha, Lee and Song, 2001. Hierarchical-likelihood approach for frailty models. Biometrika
88, 233–243) in which the latent frailties are treated as “parameters” and estimated jointly with other parameters of interest. We find that the H-likelihood estimators perform well when the censoring rate is low, however, they are substantially biased when the censoring rate is moderate to high. In this paper, we propose a simple and easy-to-implement bias correction method for the H-likelihood estimators under a shared frailty model. We also extend the method to a multivariate frailty model, which incorporates complex dependence structure within clusters. We conduct an extensive simulation study and show that the proposed approach performs very well for censoring rates as high as 80%. We also illustrate the method with a breast cancer data set. Since the H-likelihood is the same as the penalized likelihood function, the proposed bias correction method is also applicable to the penalized likelihood estimators.
doi:10.1093/biostatistics/kxr040
PMCID: PMC3577105
PMID: 22088962
Frailty model; Hierarchical likelihood; Multivariate survival; NPMLE; Penalized likelihood; Semiparametric
PMCID: PMC197859
PMID: 14455908
PMCID: PMC2054504
PMID: 20786971
Summary
Meta-analysis is widely used to synthesize the results of multiple studies. Although meta-analysis is traditionally carried out by combining the summary statistics of relevant studies, advances in technologies and communications have made it increasingly feasible to access the original data on individual participants. In the present paper, we investigate the relative efficiency of analyzing original data versus combining summary statistics. We show that, for all commonly used parametric and semiparametric models, there is no asymptotic efficiency gain by analyzing original data if the parameter of main interest has a common value across studies, the nuisance parameters have distinct values among studies, and the summary statistics are based on maximum likelihood. We also assess the relative efficiency of the two methods when the parameter of main interest has different values among studies or when there are common nuisance parameters across studies. We conduct simulation studies to confirm the theoretical results and provide empirical comparisons from a genetic association study.
doi:10.1093/biomet/asq006
PMCID: PMC3412575
PMID: 23049122
Cox regression; Evidence-based medicine; Genetic association; Individual patient data; Information matrix; Linear regression; Logistic regression; Maximum likelihood; Profile likelihood; Research synthesis
PMCID: PMC1468981
PMID: 4938462
PMCID: PMC2217790
PMID: 13212043
A comparison among two forms of half-diallel analysis was made. The different half-diallel techniques used were Griffing's model I, method 2 and 4. These methods of diallel analysis were found to be interrelated. However, as Griffing's model I, method 4 partitioned heterosis into different components as well as gave information about combining ability and this method had certainly some advantages over the other. The results further indicated using parental generations in the second Griffing method may cause biased estimate of the GCA and SCA variances. Thus, using the fourth Griffing method is more suitable than the other methods in providing time, cost, and facilities, and it is recommended as an applicable method.
doi:10.1100/2012/524873
PMCID: PMC3349209
PMID: 22593691
PMCID: PMC2091903
PMID: 18881275
PMCID: PMC2972887
PMID: 18120759