There is an active debate in the literature on censored data about the relative performance of model based maximum likelihood estimators, IPCW-estimators, and a variety of double robust semiparametric efficient estimators. Kang and Schafer (2007) demonstrate the fragility of double robust and IPCW-estimators in a simulation study with positivity violations. They focus on a simple missing data problem with covariates where one desires to estimate the mean of an outcome that is subject to missingness. Responses by Robins, et al. (2007), Tsiatis and Davidian (2007), Tan (2007) and Ridgeway and McCaffrey (2007) further explore the challenges faced by double robust estimators and offer suggestions for improving their stability. In this article, we join the debate by presenting targeted maximum likelihood estimators (TMLEs). We demonstrate that TMLEs that guarantee that the parametric submodel employed by the TMLE procedure respects the global bounds on the continuous outcomes, are especially suitable for dealing with positivity violations because in addition to being double robust and semiparametric efficient, they are substitution estimators. We demonstrate the practical performance of TMLEs relative to other estimators in the simulations designed by Kang and Schafer (2007) and in modified simulations with even greater estimation challenges.
censored data; collaborative double robustness; collaborative targeted maximum likelihood estimation; double robust; estimator selection; inverse probability of censoring weighting; locally efficient estimation; maximum likelihood estimation; semiparametric model; targeted maximum likelihood estimation; targeted minimum loss based estimation; targeted nuisance parameter estimator selection
A concrete example of the collaborative double-robust targeted likelihood estimator (C-TMLE) introduced in a companion article in this issue is presented, and applied to the estimation of causal effects and variable importance parameters in genomic data. The focus is on non-parametric estimation in a point treatment data structure. Simulations illustrate the performance of C-TMLE relative to current competitors such as the augmented inverse probability of treatment weighted estimator that relies on an external non-collaborative estimator of the treatment mechanism, and inefficient estimation procedures including propensity score matching and standard inverse probability of treatment weighting. C-TMLE is also applied to the estimation of the covariate-adjusted marginal effect of individual HIV mutations on resistance to the anti-retroviral drug lopinavir. The influence curve of the C-TMLE is used to establish asymptotically valid statistical inference. The list of mutations found to have a statistically significant association with resistance is in excellent agreement with mutation scores provided by the Stanford HIVdb mutation scores database.
causal effect; cross-validation; collaborative double robust; double robust; efficient influence curve; penalized likelihood; penalization; estimator selection; locally efficient; maximum likelihood estimation; model selection; super efficiency; super learning; targeted maximum likelihood estimation; targeted nuisance parameter estimator selection; variable importance
A two-stage procedure for estimating sensitivity and specificity is described. The procedure is developed in the context of a validation study for self-reported atypical nevi, a potentially useful measure in the study of risk factors for malignant melanoma. The first stage consists of a sample of N individuals classified only by the test measure. The second stage is a subsample of size m, stratified according the information collected in the first stage, in which the presence of atypical nevi is determined by clinical examination. Using missing data methods for contingency tables, maximum likelihood estimators for the joint distribution of the test measure and the "gold standard" clinical evaluation are presented, along with efficient estimators for the sensitivity and specificity. Asymptotic coefficients of variation are computed to compare alternative sampling strategies for the second stage.
Collaborative double robust targeted maximum likelihood estimators represent a fundamental further advance over standard targeted maximum likelihood estimators of a pathwise differentiable parameter of a data generating distribution in a semiparametric model, introduced in van der Laan, Rubin (2006). The targeted maximum likelihood approach involves fluctuating an initial estimate of a relevant factor (Q) of the density of the observed data, in order to make a bias/variance tradeoff targeted towards the parameter of interest. The fluctuation involves estimation of a nuisance parameter portion of the likelihood, g. TMLE has been shown to be consistent and asymptotically normally distributed (CAN) under regularity conditions, when either one of these two factors of the likelihood of the data is correctly specified, and it is semiparametric efficient if both are correctly specified.
In this article we provide a template for applying collaborative targeted maximum likelihood estimation (C-TMLE) to the estimation of pathwise differentiable parameters in semi-parametric models. The procedure creates a sequence of candidate targeted maximum likelihood estimators based on an initial estimate for Q coupled with a succession of increasingly non-parametric estimates for g. In a departure from current state of the art nuisance parameter estimation, C-TMLE estimates of g are constructed based on a loss function for the targeted maximum likelihood estimator of the relevant factor Q that uses the nuisance parameter to carry out the fluctuation, instead of a loss function for the nuisance parameter itself. Likelihood-based cross-validation is used to select the best estimator among all candidate TMLE estimators of Q0 in this sequence. A penalized-likelihood loss function for Q is suggested when the parameter of interest is borderline-identifiable.
We present theoretical results for “collaborative double robustness,” demonstrating that the collaborative targeted maximum likelihood estimator is CAN even when Q and g are both mis-specified, providing that g solves a specified score equation implied by the difference between the Q and the true Q0. This marks an improvement over the current definition of double robustness in the estimating equation literature.
We also establish an asymptotic linearity theorem for the C-DR-TMLE of the target parameter, showing that the C-DR-TMLE is more adaptive to the truth, and, as a consequence, can even be super efficient if the first stage density estimator does an excellent job itself with respect to the target parameter.
This research provides a template for targeted efficient and robust loss-based learning of a particular target feature of the probability distribution of the data within large (infinite dimensional) semi-parametric models, while still providing statistical inference in terms of confidence intervals and p-values. This research also breaks with a taboo (e.g., in the propensity score literature in the field of causal inference) on using the relevant part of likelihood to fine-tune the fitting of the nuisance parameter/censoring mechanism/treatment mechanism.
asymptotic linearity; coarsening at random; causal effect; censored data; crossvalidation; collaborative double robust; double robust; efficient influence curve; estimating function; estimator selection; influence curve; G-computation; locally efficient; loss-function; marginal structural model; maximum likelihood estimation; model selection; pathwise derivative; semiparametric model; sieve; super efficiency; super-learning; targeted maximum likelihood estimation; targeted nuisance parameter estimator selection; variable importance
Outcome-dependent sampling designs have been shown to be a cost effective way to enhance study efficiency. We show that the outcome-dependent sampling design with a continuous outcome can be viewed as an extension of the two-stage case-control designs to the continuous-outcome case. We further show that the two-stage outcome-dependent sampling has a natural link with the missing-data and biased-sampling framework. Through the use of semiparametric inference and missing-data techniques, we show that a certain semiparametric maximum likelihood estimator is computationally convenient and achieves the semiparametric efficient information bound. We demonstrate this both theoretically and through simulation.
Biased sampling; Empirical process; Maximum likelihood estimation; Missing data; Outcome-dependent; Profile likelihood; Two-stage
The two-stage design is popular in epidemiology studies and clinical trials due to its cost effectiveness. Typically, the first stage sample contains cheaper and possibly biased information, while the second stage validation sample consists of a subset of subjects with accurate and complete information. In this paper, we study estimation of a survival function with right-censored survival data from a two-stage design. A non-parametric estimator is derived by combining data from both stages. We also study its large sample properties and derive pointwise and simultaneous confidence intervals for the survival function. The proposed estimator effectively reduces the variance and finite-sample bias of the Kaplan–Meier estimator solely based on the second stage validation sample. Finally, we apply our method to a real data set from a medical device post-marketing surveillance study.
censoring; Kaplan–Meier estimator; martingale; Nelson–Aalen estimator; truncation
When a large number of candidate variables are present, a dimension reduction procedure is usually conducted to reduce the variable space before the subsequent analysis is carried out. The goal of dimension reduction is to find a list of candidate genes with a more operable length ideally including all the relevant genes. Leaving many uninformative genes in the analysis can lead to biased estimates and reduced power. Therefore, dimension reduction is often considered a necessary predecessor of the analysis because it can not only reduce the cost of handling numerous variables, but also has the potential to improve the performance of the downstream analysis algorithms.
We propose a TMLE-VIM dimension reduction procedure based on the variable importance measurement (VIM) in the frame work of targeted maximum likelihood estimation (TMLE). TMLE is an extension of maximum likelihood estimation targeting the parameter of interest. TMLE-VIM is a two-stage procedure. The first stage resorts to a machine learning algorithm, and the second step improves the first stage estimation with respect to the parameter of interest.
We demonstrate with simulations and data analyses that our approach not only enjoys the prediction power of machine learning algorithms, but also accounts for the correlation structures among variables and therefore produces better variable rankings. When utilized in dimension reduction, TMLE-VIM can help to obtain the shortest possible list with the most truly associated variables.
For longitudinal binary data with non-monotone non-ignorably missing outcomes over time, a full likelihood approach is complicated algebraically, and with many follow-up times, maximum likelihood estimation can be computationally prohibitive. As alternatives, two pseudo-likelihood approaches have been proposed that use minimal parametric assumptions. One formulation requires specification of the marginal distributions of the outcome and missing data mechanism at each time point, but uses an “independence working assumption,” i.e., an assumption that observations are independent over time. Another method avoids having to estimate the missing data mechanism by formulating a “protective estimator.” In simulations, these two estimators can be very inefficient, both for estimating time trends in the first case and for estimating both time-varying and time-stationary effects in the second. In this paper, we propose use of the optimal weighted combination of these two estimators, and in simulations we show that the optimal weighted combination can be much more efficient than either estimator alone. Finally, the proposed method is used to analyze data from two longitudinal clinical trials of HIV-infected patients.
The Cox proportional hazards model or its discrete time analogue, the logistic failure time model, posit highly restrictive parametric models and attempt to estimate parameters which are specific to the model proposed. These methods are typically implemented when assessing effect modification in survival analyses despite their flaws. The targeted maximum likelihood estimation (TMLE) methodology is more robust than the methods typically implemented and allows practitioners to estimate parameters that directly answer the question of interest. TMLE will be used in this paper to estimate two newly proposed parameters of interest that quantify effect modification in the time to event setting. These methods are then applied to the Tshepo study to assess if either gender or baseline CD4 level modify the effect of two cART therapies of interest, efavirenz (EFV) and nevirapine (NVP), on the progression of HIV. The results show that women tend to have more favorable outcomes using EFV while males tend to have more favorable outcomes with NVP. Furthermore, EFV tends to be favorable compared to NVP for individuals at high CD4 levels.
causal effect; semi-parametric; censored longitudinal data; double robust; efficient influence curve; influence curve; G-computation; Targeted Maximum Likelihood Estimation; Cox-proportional hazards; survival analysis
We develop nonparametric estimation procedures for the marginal mean function of a counting process based on periodic observations, using two types of self-consistent estimating equations. The first is derived from the likelihood studied in Wellner & Zhang (2000), assuming a Poisson counting process, and gives a nondecreasing estimator, which is the same as the nonparametric maximum likelihood estimator of Wellner & Zhang and thus is consistent without the Poisson assumption. Motivated by the construction of parametric generalized estimating equations, the second type is a set of data-adaptive quasi-score functions, which are likelihood estimating functions under a mixed-Poisson assumption. We evaluate the procedures via simulation, and illustrate them with the data from a bladder cancer study.
Counting process; Interval censoring; Marginal mean function; Nonparametric estimation; Quasi-score function
The analysis of longitudinal data to study changes in variables measured repeatedly over time has received considerable attention in many fields. This paper proposes a two-level structural equation model for analyzing multivariate longitudinal responses that are mixed continuous and ordered categorical variables. The first-level model is defined for measures taken at each time point nested within individuals for investigating their characteristics that are changed with time. The second level is defined for individuals to assess their characteristics that are invariant with time. The proposed model accommodates fixed covariates, nonlinear terms of the latent variables, and missing data. A maximum likelihood (ML) approach is developed for the estimation of parameters and model comparison. Results of a simulation study indicate that the performance of the ML estimation is satisfactory. The proposed methodology is applied to a longitudinal study concerning cocaine use.
latent variables; longitudinal study on cocaine use; maximum likelihood; MCEM algorithm; model comparison; ordered categorical variables
Sequential Multiple Assignment Randomized (SMAR) designs are used to evaluate treatment policies, also known as adaptive treatment strategies (ATS). The determination of SMAR sample sizes is challenging because of the sequential and adaptive nature of ATS, and the multi-stage randomized assignment used to evaluate them.
We derive sample size formulae appropriate for the nested structure of successive SMAR randomizations. This nesting gives rise to ATS that have overlapping data, and hence between-strategy covariance. We focus on the case when covariance is substantial enough to reduce sample size through improved inferential efficiency.
Our design calculations draw upon two distinct methodologies for SMAR trials, using the equality of the optimal semi-parametric and Bayesian predictive estimators of standard error. This ‘hybrid’ approach produces a generalization of the t-test power calculation that is carried out in terms of effect size and regression quantities familiar to the trialist.
Simulation studies support the reasonableness of underlying assumptions, as well as the adequacy of the approximation to between-strategy covariance when it is substantial. Investigation of the sensitivity of formulae to misspecification shows that the greatest influence is due to changes in effect size, which is an a priori clinical judgment on the part of the trialist.
We have restricted simulation investigation to SMAR studies of two and three stages, although the methods are fully general in that they apply to ‘K-stage’ trials.
Practical guidance is needed to allow the trialist to size a SMAR design using the derived methods. To this end, we define ATS to be ‘distinct’ when they differ by at least the (minimal) size of effect deemed to be clinically relevant. Simulation results suggest that the number of subjects needed to distinguish distinct strategies will be significantly reduced by adjustment for covariance only when small effects are of interest.
Treatment policies; Adaptive treatment strategies; Multi-stage designs; sample size formulae
Two-stage design has long been recognized to be a cost-effective way for conducting biomedical studies. In many trials, auxiliary covariate information may also be available, and it is of interest to exploit these auxiliary data to improve the efficiency of inferences. In this paper, we propose a 2-stage design with continuous outcome where the second-stage data is sampled with an “outcome-auxiliary-dependent sampling” (OADS) scheme. We propose an estimator which is the maximizer for an estimated likelihood function. We show that the proposed estimator is consistent and asymptotically normally distributed. The simulation study indicates that greater study efficiency gains can be achieved under the proposed 2-stage OADS design by utilizing the auxiliary covariate information when compared with other alternative sampling schemes. We illustrate the proposed method by analyzing a data set from an environmental epidemiologic study.
Auxiliary covariate; Kernel smoothing; Outcome-auxiliary-dependent sampling; 2-stage sampling design
In this work, we develop modeling and estimation approach for the analysis of cross-sectional clustered data with multimodal conditional distributions where the main interest is in analysis of subpopulations. It is proposed to model such data in a hierarchical model with conditional distributions viewed as finite mixtures of normal components. With a large number of observations in the lowest level clusters, a two-stage estimation approach is used. In the first stage, the normal mixture parameters in each lowest level cluster are estimated using robust methods. Robust alternatives to the maximum likelihood estimation are used to provide stable results even for data with conditional distributions such that their components may not quite meet normality assumptions. Then the lowest level cluster-specific means and standard deviations are modeled in a mixed effects model in the second stage. A small simulation study was conducted to compare performance of finite normal mixture population parameter estimates based on robust and maximum likelihood estimation in stage 1. The proposed modeling approach is illustrated through the analysis of mice tendon fibril diameters data. Analyses results address genotype differences between corresponding components in the mixtures and demonstrate advantages of robust estimation in stage 1.
Robust finite normal mixture; Weighted likelihood estimator; Hierarchical models; Mixed effects models; Two-stage estimation
Significant progress has been made in developing subsampling techniques to process large samples of aquatic invertebrates. However, limited information is available regarding subsampling techniques for terrestrial invertebrate samples. Therefore a novel subsampling procedure was evaluated for processing samples of terrestrial invertebrates collected using two common field techniques: pitfall and pan traps. A three-phase sorting protocol was developed for estimating abundance and taxa richness of invertebrates. First, large invertebrates and plant material were removed from the sample using a sieve with a 4 mm mesh size. Second, the sample was poured into a specially designed, gridded sampling tray, and 16 cells, comprising 25% of the sampling tray, were randomly subsampled and processed. Third, the remainder of the sample was scanned for 4–7 min to record rare taxa missed in the second phase. To compare estimated abundance and taxa richness with the true values of these variables for the samples, the remainder of each sample was processed completely. The results were analyzed relative to three sample size categories: samples with less than 250 invertebrates (low abundance samples), samples with 250–500 invertebrates (moderate abundance samples), and samples with more than 500 invertebrates (high abundance samples). The number of invertebrates estimated after subsampling eight or more cells was highly precise for all sizes and types of samples. High accuracy for moderate and high abundance samples was achieved after even as few as six subsamples. However, estimates of the number of invertebrates for low abundance samples were less reliable. The subsampling technique also adequately estimated taxa richness; on average, subsampling detected 89% of taxa found in samples. Thus, the subsampling technique provided accurate data on both the abundance and taxa richness of terrestrial invertebrate samples. Importantly, subsampling greatly decreased the time required to process samples, cutting the time per sample by up to 80%. Based on these data, this subsampling technique is recommended to minimize the time and cost of processing moderate to large samples without compromising the integrity of the data and to maximize the information extracted from large terrestrial invertebrate samples. For samples with a relatively low number of invertebrates, complete counting is preferred.
pitfall traps; laboratory sampling techniques
To compare two samples of censored data, we propose a unified semiparametric inference for the parameter of interest when the model for one sample is parametric and that for the other is nonparametric. The parameter of interest may represent, for example, a comparison of means, or survival probabilities. The confidence interval derived from the semiparametric inference, which is based on the empirical likelihood principle, improves its counterpart constructed from the common estimating equation. The empirical likelihood ratio is shown to be asymptotically chi-squared. Simulation experiments illustrate that the method based on the empirical likelihood substantially outperforms the method based on the estimating equation. A real dataset is analysed.
Estimating equation; Confidence interval; Coverage; Kaplan-Meier estimation; Empirical likelihood ratio; Empirical likelihood function
Length-biased sampling has been well recognized in economics, industrial reliability, etiology applications, epidemiological, genetic and cancer screening studies. Length-biased right-censored data have a unique data structure different from traditional survival data. The nonparametric and semiparametric estimations and inference methods for traditional survival data are not directly applicable for length-biased right-censored data. We propose new expectation-maximization algorithms for estimations based on full likelihoods involving infinite dimensional parameters under three settings for length-biased data: estimating nonparametric distribution function, estimating nonparametric hazard function under an increasing failure rate constraint, and jointly estimating baseline hazards function and the covariate coefficients under the Cox proportional hazards model. Extensive empirical simulation studies show that the maximum likelihood estimators perform well with moderate sample sizes and lead to more efficient estimators compared to the estimating equation approaches. The proposed estimates are also more robust to various right-censoring mechanisms. We prove the strong consistency properties of the estimators, and establish the asymptotic normality of the semi-parametric maximum likelihood estimators under the Cox model using modern empirical processes theory. We apply the proposed methods to a prevalent cohort medical study. Supplemental materials are available online.
Cox regression model; EM algorithm; Increasing failure rate; Non-parametric likelihood; Profile likelihood; Right-censored data
In this article we study a semiparametric mixture model for the two-sample problem with right censored data. The model implies that the densities for the continuous outcomes are related by a parametric tilt but otherwise unspecified. It provides a useful alternative to the Cox (1972) proportional hazards model for the comparison of treatments based on right censored survival data. We propose an iterative algorithm for the semiparametric maximum likelihood estimates of the parametric and nonparametric components of the model. The performance of the proposed method is studied using simulation. We illustrate our method in an application to melanoma.
Biased sampling; EM algorithm; maximum likelihood estimation; mixture model; semiparametric model
In longitudinal and repeated measures data analysis, often the goal is to determine the effect of a treatment or aspect on a particular outcome (e.g., disease progression). We consider a semiparametric repeated measures regression model, where the parametric component models effect of the variable of interest and any modification by other covariates. The expectation of this parametric component over the other covariates is a measure of variable importance. Here, we present a targeted maximum likelihood estimator of the finite dimensional regression parameter, which is easily estimated using standard software for generalized estimating equations.
The targeted maximum likelihood method provides double robust and locally efficient estimates of the variable importance parameters and inference based on the influence curve. We demonstrate these properties through simulation under correct and incorrect model specification, and apply our method in practice to estimating the activity of transcription factor (TF) over cell cycle in yeast. We specifically target the importance of SWI4, SWI6, MBP1, MCM1, ACE2, FKH2, NDD1, and SWI5.
The semiparametric model allows us to determine the importance of a TF at specific time points by specifying time indicators as potential effect modifiers of the TF. Our results are promising, showing significant importance trends during the expected time periods. This methodology can also be used as a variable importance analysis tool to assess the effect of a large number of variables such as gene expressions or single nucleotide polymorphisms.
targeted maximum likelihood; semiparametric; repeated measures; longitudinal; transcription factors
Nested case-control (NCC) design is a popular sampling method in large epidemiologic studies for its cost effectiveness to investigate the temporal relationship of diseases with environmental exposures or biological precursors. Thomas’ maximum partial likelihood estimator is commonly used to estimate the regression parameters in Cox’s model for NCC data. In this paper, we consider a situation that failure/censoring information and some crude covariates are available for the entire cohort in addition to NCC data and propose an improved estimator that is asymptotically more efficient than Thomas’ estimator. We adopt a projection approach that, heretofore, has only been employed in situations of random validation sampling and show that it can be well adapted to NCC designs where the sampling scheme is a dynamic process and is not independent for controls. Under certain conditions, consistency and asymptotic normality of the proposed estimator are established and a consistent variance estimator is also developed. Furthermore, a simplified approximate estimator is proposed when the disease is rare. Extensive simulations are conducted to evaluate the finite sample performance of our proposed estimators and to compare the efficiency with Thomas’ estimator and other competing estimators. Moreover, sensitivity analyses are conducted to demonstrate the behavior of the proposed estimator when model assumptions are violated, and we find that the biases are reasonably small in realistic situations. We further demonstrate the proposed method with data from studies on Wilms’ tumor.
Counting process; Cox proportional hazards model; Martingale; Risk set sampling; Survival analysis
We consider methods for estimating the effect of a covariate on a disease onset distribution when the observed data structure consists of right-censored data on diagnosis times and current status data on onset times amongst individuals who have not yet been diagnosed. Dunson and Baird (2001, Biometrics 57, 306–403) approached this problem using maximum likelihood, under the assumption that the ratio of the diagnosis and onset distributions is monotonic nondecreasing. As an alternative, we propose a two-step estimator, an extension of the approach of van der Laan, Jewell, and Petersen (1997, Biometrika 84, 539–554) in the single sample setting, which is computationally much simpler and requires no assumptions on this ratio. A simulation study is performed comparing estimates obtained from these two approaches, as well as that from a standard current status analysis that ignores diagnosis data. Results indicate that the Dunson and Baird estimator outperforms the two-step estimator when the monotonicity assumption holds, but the reverse is true when the assumption fails. The simple current status estimator loses only a small amount of precision in comparison to the two-step procedure but requires monitoring time information for all individuals. In the data that motivated this work, a study of uterine fibroids and chemical exposure to dioxin, the monotonicity assumption is seen to fail. Here, the two-step and current status estimators both show no significant association between the level of dioxin exposure and the hazard for onset of uterine fibroids; the two-step estimator of the relative hazard associated with increasing levels of exposure has the least estimated variance amongst the three estimators considered.
Current status data; Proportional hazards; Uterine fibroids
Motivated by medical studies in which patients could be cured of disease but the disease event time may be subject to interval censoring, we presents a semiparametric non-mixture cure model for the regression analysis of interval-censored time-to-event datxa. We develop semiparametric maximum likelihood estimation for the model using the expectation-maximization method for interval-censored data. The maximization step for the baseline function is nonparametric and numerically challenging. We develop an efficient and numerically stable algorithm via modern convex optimization techniques, yielding a self-consistency algorithm for the maximization step. We prove the strong consistency of the maximum likelihood estimators under the Hellinger distance, which is an appropriate metric for the asymptotic property of the estimators for interval-censored data. We assess the performance of the estimators in a simulation study with small to moderate sample sizes. To illustrate the method, we also analyze a real data set from a medical study for the biochemical recurrence of prostate cancer among patients who have undergone radical prostatectomy. Supplemental materials for the computational algorithm are available online.
Convex optimization; Hellinger consistency; maximum likelihood estimation; primal-dual interior-point method; prostate cancer
Two Monte Carlo simulations were performed to compare methods for estimating and testing hypotheses of quadratic effects in latent variable regression models. The methods considered in the current study were (a) a 2-stage moderated regression approach using latent variable scores, (b) an unconstrained product indicator approach, (c) a latent moderated structural equation method, (d) a fully Bayesian approach, and (e) marginal maximum likelihood estimation. Of the 5 estimation methods, it was found that overall the methods based on maximum likelihood estimation and the Bayesian approach performed best in terms of bias, root-mean-square error, standard error ratios, power, and Type I error control, although key differences were observed. Similarities as well as disparities among methods are highlight and general recommendations articulated. As a point of comparison, all 5 approaches were fit to a reparameterized version of the latent quadratic model to educational reading data.
structural equation modeling; nonlinear models; quadratic; maximum likelihood; Bayesian
With the advent of new technologies, animal locations are being collected at ever finer spatio-temporal scales. We review analytical methods for dealing with correlated data in the context of resource selection, including post hoc variance inflation techniques, ‘two-stage’ approaches based on models fit to each individual, generalized estimating equations and hierarchical mixed-effects models. These methods are applicable to a wide range of correlated data problems, but can be difficult to apply and remain especially challenging for use–availability sampling designs because the correlation structure for combinations of used and available points are not likely to follow common parametric forms. We also review emerging approaches to studying habitat selection that use fine-scale temporal data to arrive at biologically based definitions of available habitat, while naturally accounting for autocorrelation by modelling animal movement between telemetry locations. Sophisticated analyses that explicitly model correlation rather than consider it a nuisance, like mixed effects and state-space models, offer potentially novel insights into the process of resource selection, but additional work is needed to make them more generally applicable to large datasets based on the use–availability designs. Until then, variance inflation techniques and two-stage approaches should offer pragmatic and flexible approaches to modelling correlated data.
generalized estimating equation; generalized linear mixed model; hierarchical model; resource-selection function; telemetry; use–availability
We propose a general strategy for variable selection in semiparametric regression models by penalizing appropriate estimating functions. Important applications include semiparametric linear regression with censored responses and semiparametric regression with missing predictors. Unlike the existing penalized maximum likelihood estimators, the proposed penalized estimating functions may not pertain to the derivatives of any objective functions and may be discrete in the regression coefficients. We establish a general asymptotic theory for penalized estimating functions and present suitable numerical algorithms to implement the proposed estimators. In addition, we develop a resampling technique to estimate the variances of the estimated regression coefficients when the asymptotic variances cannot be evaluated directly. Simulation studies demonstrate that the proposed methods perform well in variable selection and variance estimation. We illustrate our methods using data from the Paul Coverdell Stroke Registry.
Accelerated failure time model; Buckley-James estimator; Censoring; Least absolute shrinkage and selection operator; Least squares; Linear regression; Missing data; Smoothly clipped absolute deviation