Peer influence and social interactions can give rise to spillover effects in which the exposure of one individual may affect outcomes of other individuals. Even if the intervention under study occurs at the group or cluster level as in group-randomized trials, spillover effects can occur when the mediator of interest is measured at a lower level than the treatment. Evaluators who choose groups rather than individuals as experimental units in a randomized trial often anticipate that the desirable changes in targeted social behaviors will be reinforced through interference among individuals in a group exposed to the same treatment. In an empirical evaluation of the effect of a school-wide intervention on reducing individual students’ depressive symptoms, schools in matched pairs were randomly assigned to the 4Rs intervention or the control condition. Class quality was hypothesized as an important mediator assessed at the classroom level. We reason that the quality of one classroom may affect outcomes of children in another classroom because children interact not simply with their classmates but also with those from other classes in the hallways or on the playground. In investigating the role of class quality as a mediator, failure to account for such spillover effects of one classroom on the outcomes of children in other classrooms can potentially result in bias and problems with interpretation. Using a counterfactual conceptualization of direct, indirect and spillover effects, we provide a framework that can accommodate issues of mediation and spillover effects in group randomized trials. We show that the total effect can be decomposed into a natural direct effect, a within-classroom mediated effect and a spillover mediated effect. We give identification conditions for each of the causal effects of interest and provide results on the consequences of ignoring “interference” or “spillover effects” when they are in fact present. Our modeling approach disentangles these effects. The analysis examines whether the 4Rs intervention has an effect on children's depressive symptoms through changing the quality of other classes as well as through changing the quality of a child's own class.
doi:10.1080/01621459.2013.779832
PMCID: PMC3753117
PMID: 23997375
Direct/indirect effects; interference; multilevel models; social interactions
Constructing classification rules for accurate diagnosis of a disorder is an important goal in medical practice. In many clinical applications, there is no clinically significant anatomical or physiological deviation exists to identify the gold standard disease status to inform development of classification algorithms. Despite absence of perfect disease class identifiers, there are usually one or more disease-informative auxiliary markers along with feature variables comprising known symptoms. Existing statistical learning approaches do not effectively draw information from auxiliary prognostic markers. We propose a large margin classification method, with particular emphasis on the support vector machine (SVM), assisted by available informative markers in order to classify disease without knowing a subject’s true disease status. We view this task as statistical learning in the presence of missing data, and introduce a pseudo-EM algorithm to the classification. A major distinction with a regular EM algorithm is that we do not model the distribution of missing data given the observed feature variables either parametrically or semiparametrically. We also propose a sparse variable selection method embedded in the pseudo-EM algorithm. Theoretical examination shows that the proposed classification rule is Fisher consistent, and that under a linear rule, the proposed selection has an oracle variable selection property and the estimated coefficients are asymptotically normal. We apply the methods to build decision rules for including subjects in clinical trials of a new psychiatric disorder and present four applications to data available at the UCI Machine Learning Repository.
doi:10.1080/01621459.2013.775949
PMCID: PMC3770489
PMID: 24039320
Large margin classification; Support vector machine; Statistical learning; Classification rules; Missing data; Diagnostic and Statistical Manual of Mental Disorders
At both the individual and societal levels, the health and economic burden of disability in older adults is enormous in developed countries, including the U.S. Recent studies have revealed that the disablement process in older adults often comprises episodic periods of impaired functioning and periods that are relatively free of disability, amid a secular and natural trend of decline in functioning. Rather than an irreversible, progressive event that is analogous to a chronic disease, disability is better conceptualized and mathematically modeled as states that do not necessarily follow a strict linear order of good-to-bad. Statistical tools, including Markov models, which allow bidirectional transition between states, and random effects models, which allow individual-specific rate of secular decline, are pertinent. In this paper, we propose a mixed effects, multivariate, hidden Markov model to handle partially ordered disability states. The model generalizes the continuation ratio model for ordinal data in the generalized linear model literature and provides a formal framework for testing the effects of risk factors and/or an intervention on the transitions between different disability states. Under a generalization of the proportional odds ratio assumption, the proposed model circumvents the problem of a potentially large number of parameters when the number of states and the number of covariates are substantial. We describe a maximum likelihood method for estimating the partially ordered, mixed effects model and show how the model can be applied to a longitudinal data set that consists of N = 2,903 older adults followed for 10 years in the Health Aging and Body Composition Study. We further statistically test the effects of various risk factors upon the probabilities of transition into various severe disability states. The result can be used to inform geriatric and public health science researchers who study the disablement process.
doi:10.1080/01621459.2013.770307
PMCID: PMC3777389
PMID: 24058222
Latent Markov model; continuation ratio model; EM algorithm; generalized linear model; Health ABC study
Classical regression methods treat covariates as a vector and estimate a corresponding vector of regression coefficients. Modern applications in medical imaging generate covariates of more complex form such as multidimensional arrays (tensors). Traditional statistical and computational methods are proving insufficient for analysis of these high-throughput data due to their ultrahigh dimensionality as well as complex structure. In this article, we propose a new family of tensor regression models that efficiently exploit the special structure of tensor covariates. Under this framework, ultrahigh dimensionality is reduced to a manageable level, resulting in efficient estimation and prediction. A fast and highly scalable estimation algorithm is proposed for maximum likelihood estimation and its associated asymptotic properties are studied. Effectiveness of the new methods is demonstrated on both synthetic and real MRI imaging data.
doi:10.1080/01621459.2013.776499
PMCID: PMC4004091
PMID: 24791032
Brain imaging; dimension reduction; generalized linear model (GLM); magnetic resonance imaging (MRI); multidimensional array; tensor regression
Partial differential equation (PDE) models are commonly used to model complex dynamic systems in applied sciences such as biology and finance. The forms of these PDE models are usually proposed by experts based on their prior knowledge and understanding of the dynamic system. Parameters in PDE models often have interesting scientific interpretations, but their values are often unknown, and need to be estimated from the measurements of the dynamic system in the present of measurement errors. Most PDEs used in practice have no analytic solutions, and can only be solved with numerical methods. Currently, methods for estimating PDE parameters require repeatedly solving PDEs numerically under thousands of candidate parameter values, and thus the computational load is high. In this article, we propose two methods to estimate parameters in PDE models: a parameter cascading method and a Bayesian approach. In both methods, the underlying dynamic process modeled with the PDE model is represented via basis function expansion. For the parameter cascading method, we develop two nested levels of optimization to estimate the PDE parameters. For the Bayesian method, we develop a joint model for data and the PDE, and develop a novel hierarchical model allowing us to employ Markov chain Monte Carlo (MCMC) techniques to make posterior inference. Simulation studies show that the Bayesian method and parameter cascading method are comparable, and both outperform other available methods in terms of estimation accuracy. The two methods are demonstrated by estimating parameters in a PDE model from LIDAR data.
doi:10.1080/01621459.2013.794730
PMCID: PMC3867159
PMID: 24363476
Asymptotic theory; Basis function expansion; Bayesian method; Differential equations; Measurement error; Parameter cascading
Recently developed methods for power analysis expand the options available for study design. We demonstrate how easily the methods can be applied by (1) reviewing their formulation and (2) describing their application in the preparation of a particular grant proposal. The focus is a complex but ubiquitous setting: repeated measures in a longitudinal study. Describing the development of the research proposal allows demonstrating the steps needed to conduct an effective power analysis. Discussion of the example also highlights issues that typically must be considered in designing a study. First, we discuss the motivation for using detailed power calculations, focusing on multivariate methods in particular. Second, we survey available methods for the general linear multivariate model (GLMM) with Gaussian errors and recommend those based on F approximations. The treatment includes coverage of the multivariate and univariate approaches to repeated measures, MANOVA, ANOVA, multivariate regression, and univariate regression. Third, we describe the design of the power analysis for the example, a longitudinal study of a child’s intellectual performance as a function of mother’s estimated verbal intelligence. Fourth, we present the results of the power calculations. Fifth, we evaluate the tradeoffs in using reduced designs and tests to simplify power calculations. Finally, we discuss the benefits and costs of power analysis in the practice of statistics. We make three recommendations:
Align the design and hypothesis of the power analysis with the planned data analysis, as best as practical.Embed any power analysis in a defensible sensitivity analysis.Have the extent of the power analysis reflect the ethical, scientific, and monetary costs.
We conclude that power analysis catalyzes the interaction of statisticians and subject matter specialists. Using the recent advances for power analysis in linear models can further invigorate the interaction.
doi:10.1080/01621459.1992.10476281
PMCID: PMC4002049
PMID: 24790282
Analysis of variance; Multivariate linear models; Noncentral distribution; Repeated measures; Sample size determination
Multiple hypothesis testing is a fundamental problem in high dimensional inference, with wide applications in many scientific fields. In genome-wide association studies, tens of thousands of tests are performed simultaneously to find if any SNPs are associated with some traits and those tests are correlated. When test statistics are correlated, false discovery control becomes very challenging under arbitrary dependence. In the current paper, we propose a novel method based on principal factor approximation, which successfully subtracts the common dependence and weakens significantly the correlation structure, to deal with an arbitrary dependence structure. We derive an approximate expression for false discovery proportion (FDP) in large scale multiple testing when a common threshold is used and provide a consistent estimate of realized FDP. This result has important applications in controlling FDR and FDP. Our estimate of realized FDP compares favorably with Efron (2007)’s approach, as demonstrated in the simulated examples. Our approach is further illustrated by some real data applications. We also propose a dependence-adjusted procedure, which is more powerful than the fixed threshold procedure.
doi:10.1080/01621459.2012.720478
PMCID: PMC3983872
PMID: 24729644
Multiple hypothesis testing; high dimensional inference; false discovery rate; arbitrary dependence structure; genome-wide association studies
In this article, we study the power properties of quadratic-distance-based goodness-of-fit tests. First, we introduce the concept of a root kernel and discuss the considerations that enter the selection of this kernel. We derive an easy to use normal approximation to the power of quadratic distance goodness-of-fit tests and base the construction of a noncentrality index, an analogue of the traditional noncentrality parameter, on it. This leads to a method akin to the Neyman-Pearson lemma for constructing optimal kernels for specific alternatives. We then introduce a midpower analysis as a device for choosing optimal degrees of freedom for a family of alternatives of interest. Finally, we introduce a new diffusion kernel, called the Pearson-normal kernel, and study the extent to which the normal approximation to the power of tests based on this kernel is valid. Supplementary materials for this article are available online.
doi:10.1080/01621459.2013.836972
PMCID: PMC3979448
PMID: 24764609
Big data; High-dimensional testing; Midpower analysis; Optimal kernel construction; Pearson-normal kernel; Power lemma
Robust variable selection procedures through penalized regression have been gaining increased attention in the literature. They can be used to perform variable selection and are expected to yield robust estimates. However, to the best of our knowledge, the robustness of those penalized regression procedures has not been well characterized. In this paper, we propose a class of penalized robust regression estimators based on exponential squared loss. The motivation for this new procedure is that it enables us to characterize its robustness that has not been done for the existing procedures, while its performance is near optimal and superior to some recently developed methods. Specifically, under defined regularity conditions, our estimators are n-consistent and possess the oracle property. Importantly, we show that our estimators can achieve the highest asymptotic breakdown point of 1/2 and that their influence functions are bounded with respect to the outliers in either the response or the covariate domain. We performed simulation studies to compare our proposed method with some recent methods, using the oracle method as the benchmark. We consider common sources of influential points. Our simulation studies reveal that our proposed method performs similarly to the oracle method in terms of the model error and the positive selection rate even in the presence of influential points. In contrast, other existing procedures have a much lower non-causal selection rate. Furthermore, we re-analyze the Boston Housing Price Dataset and the Plasma Beta-Carotene Level Dataset that are commonly used examples for regression diagnostics of influential points. Our analysis unravels the discrepancies of using our robust method versus the other penalized regression method, underscoring the importance of developing and applying robust penalized regression methods.
doi:10.1080/01621459.2013.766613
PMCID: PMC3727454
PMID: 23913996
Robust regression; Variable selection; Breakdown point; Influence function
The World Health Organization (WHO) guidelines for monitoring the effectiveness of HIV treatment in resource-limited settings (RLS) are mostly based on clinical and immunological markers (e.g., CD4 cell counts). Recent research indicates that the guidelines are inadequate and can result in high error rates. Viral load (VL) is considered the “gold standard”, yet its widespread use is limited by cost and infrastructure. In this paper, we propose a diagnostic algorithm that uses information from routinely-collected clinical and immunological markers to guide a selective use of VL testing for diagnosing HIV treatment failure, under the assumption that VL testing is available only at a certain portion of patient visits. Our algorithm identifies the patient sub-population, such that the use of limited VL testing on them minimizes a pre-defined risk (e.g., misdiagnosis error rate). Diagnostic properties of our proposal algorithm are assessed by simulations. For illustration, data from the Miriam Hospital Immunology Clinic (RI, USA) are analyzed.
doi:10.1080/01621459.2013.810149
PMCID: PMC3963362
PMID: 24672142
Antiretroviral failure; constrained optimization; HIV/AIDS; resource limited; ROC; tripartite classification
It has become common for data sets to contain large numbers of variables in studies conducted in areas such as genetics, machine vision, image analysis and many others. When analyzing such data, parametric models are often too inflexible while nonparametric procedures tend to be non-robust because of insufficient data on these high dimensional spaces. This is particularly true when interest lies in building efficient classifiers in the presence of many predictor variables. When dealing with these types of data, it is often the case that most of the variability tends to lie along a few directions, or more generally along a much smaller dimensional submanifold of the data space. In this article, we propose a class of models that flexibly learn about this submanifold while simultaneously performing dimension reduction in classification. This methodology, allows the cell probabilities to vary nonparametrically based on a few coordinates expressed as linear combinations of the predictors. Also, as opposed to many black-box methods for dimensionality reduction, the proposed model is appealing in having clearly interpretable and identifiable parameters which provide insight into which predictors are important in determining accurate classification boundaries. Gibbs sampling methods are developed for posterior computation, and the methods are illustrated using simulated and real data applications.
doi:10.1080/01621459.2013.763566
PMCID: PMC3607648
PMID: 23539471
Classifier; Dimension reduction; Variable selection; Nonparametric Bayes
In many applications the graph structure in a network arises from two sources: intrinsic connections and connections due to external effects. We introduce a sparse estimation procedure for graphical models that is capable of isolating the intrinsic connections by removing the external effects. Technically, this is formulated as a conditional graphical model, in which the external effects are modeled as predictors, and the graph is determined by the conditional precision matrix. We introduce two sparse estimators of this matrix using the reproduced kernel Hilbert space combined with lasso and adaptive lasso. We establish the sparsity, variable selection consistency, oracle property, and the asymptotic distributions of the proposed estimators. We also develop their convergence rate when the dimension of the conditional precision matrix goes to infinity. The methods are compared with sparse estimators for unconditional graphical models, and with the constrained maximum likelihood estimate that assumes a known graph structure. The methods are applied to a genetic data set to construct a gene network conditioning on single-nucleotide polymorphisms.
doi:10.1080/01621459.2011.644498
PMCID: PMC3932550
PMID: 24574574
Conditional random field; Gaussian graphical models; Lasso and adaptive lasso; Oracle property; Reproducing kernel Hilbert space; Sparsity; Sparsistency; von Mises expansion
We develop methodology which combines statistical learning methods with generalized Markov models, thereby enhancing the former to account for time series dependence. Our methodology can accommodate very general and very long-term time dependence structures in an easily estimable and computationally tractable fashion. We apply our methodology to the scoring of sleep behavior in mice. As currently used methods are expensive, invasive, and labor intensive, there is considerable interest in high-throughput automated systems which would allow many mice to be scored cheaply and quickly. Previous efforts have been able to differentiate sleep from wakefulness, but they are unable to differentiate the rare and important state of REM sleep from non-REM sleep. Key difficulties in detecting REM are that (i) REM is much rarer than non-REM and wakefulness, (ii) REM looks similar to non-REM in terms of the observed covariates, (iii) the data are noisy, and (iv) the data contain strong time dependence structures crucial for differentiating REM from non-REM. Our new approach (i) shows improved differentiation of REM from non-REM sleep and (ii) accurately estimates aggregate quantities of sleep in our application to video-based sleep scoring of mice.
doi:10.1080/01621459.2013.779838
PMCID: PMC3913289
PMID: 24504359
sleep; REM; classification; Markov; time series
A major aim of longitudinal analyses of life course data is to describe the within- and between-individual variability in a behavioral outcome, such as crime. Statistical analyses of such data typically draw on mixture and mixed-effects growth models. In this work, we present a functional analytic point of view and develop an alternative method that models individual crime trajectories as departures from a population age-crime curve. Drawing on empirical and theoretical claims in criminology, we assume a unimodal population age-crime curve and allow individual expected crime trajectories to differ by their levels of offending and patterns of temporal misalignment. We extend Bayesian hierarchical curve registration methods to accommodate count data and to incorporate influence of baseline covariates on individual behavioral trajectories. Analyzing self-reported counts of yearly marijuana use from the Denver Youth Survey, we examine the influence of race and gender categories on differences in levels and timing of marijuana smoking. We find that our approach offers a flexible model for longitudinal crime trajectories and allows for a rich array of inferences of interest to criminologists and drug abuse researchers.
doi:10.1080/01621459.2012.716328
PMCID: PMC3913486
PMID: 24504416
Curve Registration; Drug Use; Functional Data; Generalized Linear Models; Individual Trajectories; Longitudinal Data; MCMC; Unimodal Smoothing
doi:10.1080/01621459.2012.665198
PMCID: PMC3908914
PMID: 24489418
This work presents methods for estimating genotype-specific distributions from genetic epidemiology studies where the event times are subject to right censoring, the genotypes are not directly observed, and the data arise from a mixture of scientifically meaningful subpopulations. Examples of such studies include kin-cohort studies and quantitative trait locus (QTL) studies. Current methods for analyzing censored mixture data include two types of nonparametric maximum likelihood estimators (NPMLEs) which do not make parametric assumptions on the genotype-specific density functions. Although both NPMLEs are commonly used, we show that one is inefficient and the other inconsistent. To overcome these deficiencies, we propose three classes of consistent nonparametric estimators which do not assume parametric density models and are easy to implement. They are based on the inverse probability weighting (IPW), augmented IPW (AIPW), and nonparametric imputation (IMP). The AIPW achieves the efficiency bound without additional modeling assumptions. Extensive simulation experiments demonstrate satisfactory performance of these estimators even when the data are heavily censored. We apply these estimators to the Cooperative Huntington’s Observational Research Trial (COHORT), and provide age-specific estimates of the effect of mutation in the Huntington gene on mortality using a sample of family members. The close approximation of the estimated non-carrier survival rates to that of the U.S. population indicates small ascertainment bias in the COHORT family sample. Our analyses underscore an elevated risk of death in Huntington gene mutation carriers compared to non-carriers for a wide age range, and suggest that the mutation equally affects survival rates in both genders. The estimated survival rates are useful in genetic counseling for providing guidelines on interpreting the risk of death associated with a positive genetic testing, and in facilitating future subjects at risk to make informed decisions on whether to undergo genetic mutation testings.
doi:10.1080/01621459.2012.699353
PMCID: PMC3905630
PMID: 24489419
Censored data; Finite mixture model; Huntington’s disease; Kin-cohort design; Quantitative trait locus
In a case-referent study, cases of disease are compared to non-cases with respect to their antecedent exposure to a treatment in an effort to determine whether exposure causes some cases of the disease. Because exposure is not randomly assigned in the population, as it would be if the population were a vast randomized trial, exposed and unexposed subjects may differ prior to exposure with respect to covariates that may or may not have been measured. After controlling for measured pre-exposure differences, for instance by matching, a sensitivity analysis asks about the magnitude of bias from unmeasured covariates that would need to be present to alter the conclusions of a study that presumed matching for observed covariates removes all bias. The definition of a case of disease affects sensitivity to unmeasured bias. We explore this issue using: (i) an asymptotic tool, the design sensitivity, (ii) a simulation for finite samples, and (iii) an example. Under favorable circumstances, a narrower case definition can yield an increase in the design sensitivity, and hence an increase in the power of a sensitivity analysis. Also, we discuss an adaptive method that seeks to discover the best case definition from the data at hand while controlling for multiple testing. An implementation in R is available as SensitivityCaseControl.
doi:10.1080/01621459.2013.820660
PMCID: PMC3904399
PMID: 24482549
Case-control study; matching; observational study; sensitivity analysis
In many applications, it is of interest to study trends over time in relationships among categorical variables, such as age group, ethnicity, religious affiliation, political party and preference for particular policies. At each time point, a sample of individuals provide responses to a set of questions, with different individuals sampled at each time. In such settings, there tends to be abundant missing data and the variables being measured may change over time. At each time point, one obtains a large sparse contingency table, with the number of cells often much larger than the number of individuals being surveyed. To borrow information across time in modeling large sparse contingency tables, we propose a Bayesian autoregressive tensor factorization approach. The proposed model relies on a probabilistic Parafac factorization of the joint pmf characterizing the categorical data distribution at each time point, with autocorrelation included across times. Efficient computational methods are developed relying on MCMC. The methods are evaluated through simulation examples and applied to social survey data.
doi:10.1080/01621459.2013.823866
PMCID: PMC3904485
PMID: 24482548
Dynamic model; Multivariate categorical data; Nonparametric Bayes; Panel data; Parafac; Probabilistic tensor factorization; Stick-breaking
In this paper we propose a Bayesian natural history model for disease progression based on the joint modeling of longitudinal biomarker levels, age at clinical detection of disease and disease status at diagnosis. We establish a link between the longitudinal responses and the natural history of the disease by using an underlying latent disease process which describes the onset of the disease and models the transition to an advanced stage of the disease as dependent on the biomarker levels. We apply our model to the data from the Baltimore Longitudinal Study of Aging on prostate specific antigen (PSA) to investigate the natural history of prostate cancer.
doi:10.1198/016214507000000356
PMCID: PMC3896511
PMID: 24453387
Natural history model; disease progression; latent variables; longitudinal response; Markov Chain Monte Carlo methods; prostate specific antigen
The nested case-control (NCC) design have been widely adopted as a cost-effective solution in many large cohort studies for risk assessment with expensive markers, such as the emerging biologic and genetic markers. To analyze data from NCC studies, conditional logistic regression (Goldstein and Langholz, 1992; Borgan et al., 1995) and maximum likelihood (Scheike and Juul, 2004; Zeng et al., 2006) based methods have been proposed. However, most of these methods either cannot be easily extended beyond the Cox model (Cox, 1972) or require additional modeling assumptions. More generally applicable approaches based on inverse probability weighting (IPW) have been proposed as useful alternatives (Samuelsen, 1997; Chen, 2001; Samuelsen et al., 2007). However, due to the complex correlation structure induced by repeated finite risk set sampling, interval estimation for such IPW estimators remain challenging especially when the estimation involves non-smooth objective functions or when making simultaneous inferences about functions. Standard resampling procedures such as the bootstrap cannot accommodate the correlation and thus are not directly applicable. In this paper, we propose a resampling procedure that can provide valid estimates for the distribution of a broad class of IPW estimators. Simulation results suggest that the proposed procedures perform well in settings when analytical variance estimator is infeasible to derive or gives less optimal performance. The new procedures are illustrated with data from the Framingham Offspring Study to characterize individual level cardiovascular risks over time based on the Framingham risk score, C-reactive protein (CRP) and a genetic risk score.
doi:10.1080/01621459.2013.856715
PMCID: PMC3891801
PMID: 24436503
Biomarker study; Interval Estimation; Inverse Probability Weighting; Nested case-control study; Resampling methods, Risk Prediction; Simultaneous Confidence Band; Survival Model
Images, often stored in multidimensional arrays, are fast becoming ubiquitous in medical and public health research. Analyzing populations of images is a statistical problem that raises a host of daunting challenges. The most significant challenge is the massive size of the datasets incorporating images recorded for hundreds or thousands of subjects at multiple visits. We introduce the population value decomposition (PVD), a general method for simultaneous dimensionality reduction of large populations of massive images. We show how PVD can be seamlessly incorporated into statistical modeling, leading to a new, transparent, and rapid inferential framework. Our PVD methodology was motivated by and applied to the Sleep Heart Health Study, the largest community-based cohort study of sleep containing more than 85 billion observations on thousands of subjects at two visits. This article has supplementary material online.
doi:10.1198/jasa.2011.ap10089
PMCID: PMC3886284
PMID: 24415813
Electroencephalography; Signal extraction
End-of-life medical expenses are a significant proportion of all health care expenditures. These costs were studied using costs of services from Medicare claims and cause of death (CoD) from death certificates. In the absence of a unique identifier linking the two datasets, common variables identified unique matches for only 33% of deaths. The remaining cases formed cells with multiple cases (32% in cells with an equal number of cases from each file and 35% in cells with an unequal number). We sampled from the joint posterior distribution of model parameters and the permutations that link cases from the two files within each cell. The linking models included the regression of location of death on CoD and other parameters, and the regression of cost measures with a monotone missing data pattern on CoD and other demographic characteristics. Permutations were sampled by enumerating the exact distribution for small cells and by the Metropolis algorithm for large cells. Sparse matrix data structures enabled efficient calculations despite the large dataset (≈1.7 million cases). The procedure generates m datasets in which the matches between the two files are imputed. The m datasets can be analyzed independently and results combined using Rubin's multiple imputation rules. Our approach can be applied in other file linking applications.
doi:10.1080/01621459.2012.726889
PMCID: PMC3640583
PMID: 23645944
Statistical Matching; Record Linkage; Administrative Data; Missing Data; Bayesian Analysis
We propose a unified estimation method for semiparametric linear transformation models under general biased sampling schemes. The new estimator is obtained from a set of counting process-based unbiased estimating equations, developed through introducing a general weighting scheme that offsets the sampling bias. The usual asymptotic properties, including consistency and asymptotic normality, are established under suitable regularity conditions. A closed-form formula is derived for the limiting variance and the plug-in estimator is shown to be consistent. We demonstrate the unified approach through the special cases of left truncation, length-bias, the case-cohort design and variants thereof. Simulation studies and applications to real data sets are presented.
doi:10.1080/01621459.2012.746073
PMCID: PMC3649773
PMID: 23667280
Case-cohort design; Counting process; Cox model; Estimating equations; Importance sampling; Length-bias; Proportional odds model; Regression; Truncation; Survival data
Large- and finite-sample efficiency and resistance to outliers are the key goals of robust statistics. Although often not simultaneously attainable, we develop and study a linear regression estimator that comes close. Efficiency obtains from the estimator’s close connection to generalized empirical likelihood, and its favorable robustness properties are obtained by constraining the associated sum of (weighted) squared residuals. We prove maximum attainable finite-sample replacement breakdown point, and full asymptotic efficiency for normal errors. Simulation evidence shows that compared to existing robust regression estimators, the new estimator has relatively high efficiency for small sample sizes, and comparable outlier resistance. The estimator is further illustrated and compared to existing methods via application to a real data set with purported outliers.
doi:10.1080/01621459.2013.779847
PMCID: PMC3747015
PMID: 23976805
Asymptotic efficiency; Breakdown point; Constrained optimization; Efficient estimation; Empirical likelihood; Exponential tilting; Least trimmed squares; Robust regression; Weighted least squares
When comparing a new treatment with a control in a randomized clinical study, the treatment effect is generally assessed by evaluating a summary measure over a specific study population. The success of the trial heavily depends on the choice of such a population. In this paper, we show a systematic, effective way to identify a promising population, for which the new treatment is expected to have a desired benefit, utilizing the data from a current study involving similar comparator treatments. Specifically, using the existing data, we first create a parametric scoring system as a function of multiple multiple baseline covariates to estimate subject-specific treatment differences. Based on this scoring system, we specify a desired level of treatment difference and obtain a subgroup of patients, defined as those whose estimated scores exceed this threshold. An empirically calibrated threshold-specific treatment difference curve across a range of score values is constructed. The subpopulation of patients satisfying any given level of treatment benefit can then be identified accordingly. To avoid bias due to overoptimism, we utilize a cross-training-evaluation method for implementing the above two-step procedure. We then show how to select the best scoring system among all competing models. Furthermore, for cases in which only a single pre-specified working model is involved, inference procedures are proposed for the average treatment difference over a range of score values using the entire data set, and are justified theoretically and numerically. Lastly, the proposals are illustrated with the data from two clinical trials in treating HIV and cardiovascular diseases. Note that if we are not interested in designing a new study for comparing similar treatments, the new procedure can also be quite useful for the management of future patients, so that treatment may be targeted towards those who would receive nontrivial benefits to compensate for the risk or cost of the new treatment.
doi:10.1080/01621459.2013.770705
PMCID: PMC3775385
PMID: 24058223
Cross-training-evaluation; Lasso procedure; Personalized medicine; Prediction; Ridge regression; Stratified medicine; Subgroup analysis; Variable selection