Standard assumptions incorporated into Bayesian model selection procedures result in procedures that are not competitive with commonly used penalized likelihood methods. We propose modifications of these methods by imposing nonlocal prior densities on model parameters. We show that the resulting model selection procedures are consistent in linear model settings when the number of possible covariates p is bounded by the number of observations n, a property that has not been extended to other model selection procedures. In addition to consistently identifying the true model, the proposed procedures provide accurate estimates of the posterior probability that each identified model is correct. Through simulation studies, we demonstrate that these model selection procedures perform as well or better than commonly used penalized likelihood methods in a range of simulation settings. Proofs of the primary theorems are provided in the Supplementary Material that is available online.
Adaptive LASSO; Dantzig selector; Elastic net; g-prior; Intrinsic Bayes factor; Intrinsic prior; Nonlocal prior; Nonnegative garrote; Oracle
In the presence of time-varying confounders affected by prior treatment, standard statistical methods for failure time analysis may be biased. Methods that correctly adjust for this type of covariate include the parametric g-formula, inverse probability weighted estimation of marginal structural Cox proportional hazards models, and g-estimation of structural nested accelerated failure time models. In this article, we propose a novel method to estimate the causal effect of a time-dependent treatment on failure in the presence of informative right-censoring and time-dependent confounders that may be affected by past treatment: g-estimation of structural nested cumulative failure time models (SNCFTMs). An SNCFTM considers the conditional effect of a final treatment at time m on the outcome at each later time k by modeling the ratio of two counterfactual cumulative risks at time k under treatment regimes that differ only at time m. Inverse probability weights are used to adjust for informative censoring. We also present a procedure that, under certain “no-interaction” conditions, uses the g-estimates of the model parameters to calculate unconditional cumulative risks under nondynamic (static) treatment regimes. The procedure is illustrated with an example using data from a longitudinal cohort study, in which the “treatments” are healthy behaviors and the outcome is coronary heart disease.
Causal inference; Coronary heart disease; Epidemiology; G-estimation; Inverse probability weighting
Many longitudinal studies involve relating an outcome process to a set of possibly time-varying covariates, giving rise to the usual regression models for longitudinal data. When the purpose of the study is to investigate the covariate effects when experimental environment undergoes abrupt changes or to locate the periods with different levels of covariate effects, a simple and easy-to-interpret approach is to introduce change-points in regression coefficients. In this connection, we propose a semiparametric change-point regression model, in which the error process (stochastic component) is nonparametric and the baseline mean function (functional part) is completely unspecified, the observation times are allowed to be subject-specific, and the number, locations and magnitudes of change-points are unknown and need to be estimated. We further develop an estimation procedure which combines the recent advance in semiparametric analysis based on counting process argument and multiple change-points inference, and discuss its large sample properties, including consistency and asymptotic normality, under suitable regularity conditions. Simulation results show that the proposed methods work well under a variety of scenarios. An application to a real data set is also given.
Change-points; Counting process; Time-varying coefficient
Real world networks exhibit a complex set of phenomena such as underlying hierarchical organization, multiscale interaction, and varying topologies of communities. Most existing methods do not adequately capture the intrinsic interplay among such phenomena. We propose a nonparametric Multiscale Community Blockmodel (MSCB) to model the generation of hierarchies in social communities, selective membership of actors to subsets of these communities, and the resultant networks due to within- and cross-community interactions. By using the nested Chinese Restaurant Process, our model automatically infers the hierarchy structure from the data. We develop a collapsed Gibbs sampling algorithm for posterior inference, conduct extensive validation using synthetic networks, and demonstrate the utility of our model in real-world datasets such as predator-prey networks and citation networks.
Hierarchical network analysis; Latent space model; Bayesian nonparametrics; Gibbs sampler
Recurrent event data are frequently encountered in studies with longitudinal designs. Let the recurrence time be the time between two successive recurrent events. Recurrence times can be treated as a type of correlated survival data in statistical analysis. In general, because of the ordinal nature of recurrence times, statistical methods that are appropriate for standard correlated survival data in marginal models may not be applicable to recurrence time data. Specifically, for estimating the marginal survival function, the Kaplan-Meier estimator derived from the pooled recurrence times serves as a consistent estimator for standard correlated survival data but not for recurrence time data. In this article we consider the problem of how to estimate the marginal survival function in nonparametric models. A class of nonparametric estimators is introduced. The appropriateness of the estimators is confirmed by statistical theory and simulations. Simulation and analysis from schizophrenia data are presented to illustrate the estimators' performance.
Correlated survival data; Frailty; Kaplan-Meier estimate; Longitudinal designs; Recurrent event
Recurrent event data are frequently encountered in longitudinal follow-up studies. In statistical literature, noninformative censoring is typically assumed when statistical methods and theory are developed for analyzing recurrent event data. In many applications, however, the observation of recurrent events could be terminated by informative dropouts or failure events, and it is unrealistic to assume that the censoring mechanism is independent of the recurrent event process. In this article we consider recurrent events of the same type and allow the censoring mechanism to be possibly informative. The occurrence of recurrent events is modeled by a subject-specific nonstationary Poisson process via a latent variable. A multiplicative intensity model is used as the underlying model for nonparametric estimation of the cumulative rate function. The multiplicative intensity model is also extended to a regression model by taking the covariate information into account. Statistical methods and theory are developed for estimation of the cumulative rate function and regression parameters. As a major feature of this article, we treat the distributions of both the censoring and latent variables as nuisance parameters. We avoid modeling and estimating the nuisance parameters by proper procedures. An analysis of the AIDS Link to Intravenous Experiences cohort data is presented to illustrate the proposed methods.
Frailty; Intensity function; Latent variable; Proportional rate model; Rate function
Recurrent event data are commonly encountered in longitudinal follow-up studies related to biomedical science, econometrics, reliability, and demography. In many studies, recurrent events serve as important measurements for evaluating disease progression, health deterioration, or insurance risk. When analyzing recurrent event data, an independent censoring condition is typically required for the construction of statistical methods. In some situations, however, the terminating time for observing recurrent events could be correlated with the recurrent event process, thus violating the assumption of independent censoring. In this article, we consider joint modeling of a recurrent event process and a failure time in which a common subject-specific latent variable is used to model the association between the intensity of the recurrent event process and the hazard of the failure time. The proposed joint model is flexible in that no parametric assumptions on the distributions of censoring times and latent variables are made, and under the model, informative censoring is allowed for observing both the recurrent events and failure times. We propose a “borrow-strength estimation procedure” by first estimating the value of the latent variable from recurrent event data, then using the estimated value in the failure time model. Some interesting implications and trajectories of the proposed model are presented. Properties of the regression parameter estimates and the estimated baseline cumulative hazard functions are also studied.
Borrow-strength method; Frailty; Informative censoring; Joint model; Nonstationary Poisson process
Multiplicative regression model or accelerated failure time model, which becomes linear regression model after logarithmic transformation, is useful in analyzing data with positive responses, such as stock prices or life times, that are particularly common in economic/financial or biomedical studies. Least squares or least absolute deviation are among the most widely used criterions in statistical estimation for linear regression model. However, in many practical applications, especially in treating, for example, stock price data, the size of relative error, rather than that of error itself, is the central concern of the practitioners. This paper offers an alternative to the traditional estimation methods by considering minimizing the least absolute relative errors for multiplicative regression models. We prove consistency and asymptotic normality and provide an inference approach via random weighting. We also specify the error distribution, with which the proposed least absolute relative errors estimation is efficient. Supportive evidence is shown in simulation studies. Application is illustrated in an analysis of stock returns in Hong Kong Stock Exchange.
Multiplicative regression model; Logarithm transformation; Relative error; Random weighting
There is increasing interest in discovering individualized treatment rules for patients who have heterogeneous responses to treatment. In particular, one aims to find an optimal individualized treatment rule which is a deterministic function of patient specific characteristics maximizing expected clinical outcome. In this paper, we first show that estimating such an optimal treatment rule is equivalent to a classification problem where each subject is weighted proportional to his or her clinical outcome. We then propose an outcome weighted learning approach based on the support vector machine framework. We show that the resulting estimator of the treatment rule is consistent. We further obtain a finite sample bound for the difference between the expected outcome using the estimated individualized treatment rule and that of the optimal treatment rule. The performance of the proposed approach is demonstrated via simulation studies and an analysis of chronic depression data.
Dynamic Treatment Regime; Individualized Treatment Rule; Weighted Support Vector Machine; RKHS; Risk Bound; Bayes Classifier; Cross Validation
The Canadian Study of Health and Aging (CSHA) employed a prevalent cohort design to study survival after onset of dementia, where patients with dementia were sampled and the onset time of dementia was determined retrospectively. The prevalent cohort sampling scheme favors individuals who survive longer. Thus, the observed survival times are subject to length bias. In recent years, there has been a rising interest in developing estimation procedures for prevalent cohort survival data that not only account for length bias but also actually exploit the incidence distribution of the disease to improve efficiency. This article considers semiparametric estimation of the Cox model for the time from dementia onset to death under a stationarity assumption with respect to the disease incidence. Under the stationarity condition, the semiparametric maximum likelihood estimation is expected to be fully efficient yet difficult to perform for statistical practitioners, as the likelihood depends on the baseline hazard function in a complicated way. Moreover, the asymptotic properties of the semiparametric maximum likelihood estimator are not well-studied. Motivated by the composite likelihood method (Besag 1974), we develop a composite partial likelihood method that retains the simplicity of the popular partial likelihood estimator and can be easily performed using standard statistical software. When applied to the CSHA data, the proposed method estimates a significant difference in survival between the vascular dementia group and the possible Alzheimer’s disease group, while the partial likelihood method for left-truncated and right-censored data yields a greater standard error and a 95% confidence interval covering 0, thus highlighting the practical value of employing a more efficient methodology. To check the assumption of stable disease for the CSHA data, we also present new graphical and numerical tests in the article. The R code used to obtain the maximum composite partial likelihood estimator for the CSHA data is available in the online Supplementary Material, posted on the journal web site.
Backward and forward recurrence time; Cross-sectional sampling; Random truncation; Renewal processes
Recent proteomic studies have identified proteins related to specific
phenotypes. In addition to marginal association analysis for individual
proteins, analyzing pathways (functionally related sets of proteins) may yield
additional valuable insights. Identifying pathways that differ between
phenotypes can be conceptualized as a multivariate hypothesis testing problem:
whether the mean vector μ of a
p-dimensional random vector X is
μ0. Proteins within the same biological
pathway may correlate with one another in a complicated way, and type I error
rates can be inflated if such correlations are incorrectly assumed to be absent.
The inflation tends to be more pronounced when the sample size is very small or
there is a large amount of missingness in the data, as is frequently the case in
proteomic discovery studies. To tackle these challenges, we propose a
regularized Hotelling’s T2
(RHT) statistic together with a non-parametric
testing procedure, which effectively controls the type I error rate and
maintains good power in the presence of complex correlation structures and
missing data patterns. We investigate asymptotic properties of the
RHT statistic under pertinent assumptions and compare
the test performance with four existing methods through simulation examples. We
apply the RHT test to a hormone therapy proteomics data
set, and identify several interesting biological pathways for which blood serum
concentrations changed following hormone therapy initiation.
proteomics; pathway analysis; regularization; Hotelling’s T2
We describe a new approach to analyze chirp syllables of free-tailed bats from two regions of Texas in which they are predominant: Austin and College Station. Our goal is to characterize any systematic regional differences in the mating chirps and assess whether individual bats have signature chirps. The data are analyzed by modeling spectrograms of the chirps as responses in a Bayesian functional mixed model. Given the variable chirp lengths, we compute the spectrograms on a relative time scale interpretable as the relative chirp position, using a variable window overlap based on chirp length. We use 2D wavelet transforms to capture correlation within the spectrogram in our modeling and obtain adaptive regularization of the estimates and inference for the regions-specific spectrograms. Our model includes random effect spectrograms at the bat level to account for correlation among chirps from the same bat, and to assess relative variability in chirp spectrograms within and between bats. The modeling of spectrograms using functional mixed models is a general approach for the analysis of replicated nonstationary time series, such as our acoustical signals, to relate aspects of the signals to various predictors, while accounting for between-signal structure. This can be done on raw spectrograms when all signals are of the same length, and can be done using spectrograms defined on a relative time scale for signals of variable length in settings where the idea of defining correspondence across signals based on relative position is sensible.
Bat Syllable; Bayesian Analysis; Chirp; Functional Data Analysis; Functional Mixed Model; Isomorphic Transformation; Nonstationary Time Series; Software; Spectrogram; Variable Overlap
Gaussian factor models have proven widely useful for parsimoniously characterizing dependence in multivariate data. There is a rich literature on their extension to mixed categorical and continuous variables, using latent Gaussian variables or through generalized latent trait models acommodating measurements in the exponential family. However, when generalizing to non-Gaussian measured variables the latent variables typically influence both the dependence structure and the form of the marginal distributions, complicating interpretation and introducing artifacts. To address this problem we propose a novel class of Bayesian Gaussian copula factor models which decouple the latent factors from the marginal distributions. A semiparametric specification for the marginals based on the extended rank likelihood yields straightforward implementation and substantial computational gains. We provide new theoretical and empirical justifications for using this likelihood in Bayesian inference. We propose new default priors for the factor loadings and develop efficient parameter-expanded Gibbs sampling for posterior computation. The methods are evaluated through simulations and applied to a dataset in political science. The models in this paper are implemented in the R package bfa.1
Factor analysis; Latent variables; Semiparametric; Extended rank likelihood; Parameter expansion; High dimensional
In recent years, a wide range of markers have become available as potential tools to predict risk or progression of disease. In addition to such biological and genetic markers, short term outcome information may be useful in predicting long term disease outcomes. When such information is available, it would be desirable to combine this along with predictive markers to improve the prediction of long term survival. Most existing methods for incorporating censored short term event information in predicting long term survival focus on modeling the disease process and are derived under restrictive parametric models in a multi-state survival setting. When such model assumptions fail to hold, the resulting prediction of long term outcomes may be invalid or inaccurate. When there is only a single discrete baseline covariate, a fully non-parametric estimation procedure to incorporate short term event time information has been previously proposed. However, such an approach is not feasible for settings with one or more continuous covariates due to the curse of dimensionality. In this paper, we propose to incorporate short term event time information along with multiple covariates collected up to a landmark point via a flexible varying-coefficient model. To evaluate and compare the prediction performance of the resulting landmark prediction rule, we use robust non-parametric procedures which do not require the correct specification of the proposed varying coefficient model. Simulation studies suggest that the proposed procedures perform well in finite samples. We illustrate them here using a dataset of post-dialysis patients with end-stage renal disease.
Landmark Prediction; Risk Prediction; Survival Time; Varying Coefficient Model
Gaussian latent factor models are routinely used for modeling of dependence in continuous, binary, and ordered categorical data. For unordered categorical variables, Gaussian latent factor models lead to challenging computation and complex modeling structures. As an alternative, we propose a novel class of simplex factor models. In the single-factor case, the model treats the different categorical outcomes as independent with unknown marginals. The model can characterize flexible dependence structures parsimoniously with few factors, and as factors are added, any multivariate categorical data distribution can be accurately approximated. Using a Bayesian approach for computation and inferences, a Markov chain Monte Carlo (MCMC) algorithm is proposed that scales well with increasing dimension, with the number of factors treated as unknown. We develop an efficient proposal for updating the base probability vector in hierarchical Dirichlet models. Theoretical properties are described, and we evaluate the approach through simulation examples. Applications are described for modeling dependence in nucleotide sequences and prediction from high-dimensional categorical features.
Classification; Contingency table; Factor analysis; Latent variable; Nonparametric Bayes; Nonnegative tensor factorization; Mutual information; Polytomous regression
We provide a novel and completely different approach to dimension-reduction problems from the existing literature. We cast the dimension-reduction problem in a semiparametric estimation framework and derive estimating equations. Viewing this problem from the new angle allows us to derive a rich class of estimators, and obtain the classical dimension reduction techniques as special cases in this class. The semiparametric approach also reveals that in the inverse regression context while keeping the estimation structure intact, the common assumption of linearity and/or constant variance on the covariates can be removed at the cost of performing additional nonparametric regression. The semiparametric estimators without these common assumptions are illustrated through simulation studies and a real data example. This article has online supplementary material.
Estimating equations; Nonparametric regression; Robustness; Semiparametric methods; Sliced inverse regression
The aim of this paper is to develop a semiparametric model for describing the variability of the medial representation of subcortical structures, which belongs to a Riemannian manifold, and establishing its association with covariates of interest, such as diagnostic status, age and gender. We develop a two-stage estimation procedure to calculate the parameter estimates. The first stage is to calculate an intrinsic least squares estimator of the parameter vector using the annealing evolutionary stochastic approximation Monte Carlo algorithm and then the second stage is to construct a set of estimating equations to obtain a more efficient estimate with the intrinsic least squares estimate as the starting point. We use Wald statistics to test linear hypotheses of unknown parameters and establish their limiting distributions. Simulation studies are used to evaluate the accuracy of our parameter estimates and the finite sample performance of the Wald statistics. We apply our methods to the detection of the difference in the morphological changes of the left and right hippocampi between schizophrenia patients and healthy controls using medial shape description.
Intrinsic least squares estimator; Medial representation; Semiparametric model; Wald statistic
Inspired by the non-regular framework studied in Laber and Murphy (2011), we propose a family of adaptive classifiers. We discuss briefly their asymptotic properties and show that under the non-regular framework these classifiers have an “oracle property,” and consequently have smaller asymptotic variance and smaller asymptotic test error variance than those of the original classifier. We also show that confidence intervals for the test error of the adaptive classifiers, based on either normal approximation or centered percentile bootstrap, are consistent.
Infection and cardiovascular disease are leading causes of hospitalization and death in older patients on dialysis. Our recent work found an increase in the relative incidence of cardiovascular outcomes during the ~ 30 days after infection-related hospitalizations using the case series model, which adjusts for measured and unmeasured baseline confounders. However, a major challenge in modeling/assessing the infection-cardiovascular risk hypothesis is that the exact time of infection, or more generally “exposure,” onsets cannot be ascertained based on hospitalization data. Only imprecise markers of the timing of infection onsets are available. Although there is a large literature on measurement error in the predictors in regression modeling, to date there is no work on measurement error on the timing of a time-varying exposure to our knowledge. Thus, we propose a new method, the measurement error case series (MECS) models, to account for measurement error in time-varying exposure onsets. We characterized the general nature of bias resulting from estimation that ignores measurement error and proposed a bias-corrected estimation for the MECS models. We examined in detail the accuracy of the proposed method to estimate the relative incidence. Hospitalization data from United States Renal Data System, which captures nearly all (> 99%) patients with end-stage renal disease in the U.S. over time, is used to illustrate the proposed method. The results suggest that the estimate of the cardiovascular incidence following the 30 days after infections, a period where acute effects of infection on vascular endothelium may be most pronounced, is substantially attenuated in the presence of infection onset measurement error.
Cardiovascular outcomes; Case series models; End stage renal disease; Infection; Measurement error; Non-homogeneous Poisson process; Time-varying exposure onset; United States Renal Data System
We examine the use of fixed-effects and random-effects moment-based meta-analytic methods for analysis of binary adverse event data. Special attention is paid to the case of rare adverse events which are commonly encountered in routine practice. We study estimation of model parameters and between-study heterogeneity. In addition, we examine traditional approaches to hypothesis testing of the average treatment effect and detection of the heterogeneity of treatment effect across studies. We derive three new methods, simple (unweighted) average treatment effect estimator, a new heterogeneity estimator, and a parametric bootstrapping test for heterogeneity. We then study the statistical properties of both the traditional and new methods via simulation. We find that in general, moment-based estimators of combined treatment effects and heterogeneity are biased and the degree of bias is proportional to the rarity of the event under study. The new methods eliminate much, but not all of this bias. The various estimators and hypothesis testing methods are then compared and contrasted using an example dataset on treatment of stable coronary artery disease.
In this paper we develop a method to estimate both individual social network size (i.e., degree) and the distribution of network sizes in a population by asking respondents how many people they know in specific subpopulations (e.g., people named Michael). Building on the scale-up method of Killworth et al. (1998b) and other previous attempts to estimate individual network size, we propose a latent non-random mixing model which resolves three known problems with previous approaches. As a byproduct, our method also provides estimates of the rate of social mixing between population groups. We demonstrate the model using a sample of 1,370 adults originally collected by McCarty et al. (2001). Based on insights developed during the statistical modeling, we conclude by offering practical guidelines for the design of future surveys to estimate social network size. Most importantly, we show that if the first names to be asked about are chosen properly, the simple scale-up degree estimates can enjoy the same bias-reduction as that from the our more complex latent non-random mixing model.
Social Networks; Survey Design; Personal Network Size; Negative Binomial Distribution; Latent Non-random Mixing Model
The pretest–posttest study design is commonly used in medical and social science research to assess the effect of a treatment or an intervention. Recently, interest has been rising in developing inference procedures that improve efficiency while relaxing assumptions used in the pretest–posttest data analysis, especially when the posttest measurement might be missing. In this article we propose a semiparametric estimation procedure based on empirical likelihood (EL) that incorporates the common baseline covariate information to improve efficiency. The proposed method also yields an asymptotically unbiased estimate of the response distribution. Thus functions of the response distribution, such as the median, can be estimated straightforwardly, and the EL method can provide a more appealing estimate of the treatment effect for skewed data. We show that, compared with existing methods, the proposed EL estimator has appealing theoretical properties, especially when the working model for the underlying relationship between the pretest and posttest measurements is misspecified. A series of simulation studies demonstrates that the EL-based estimator outperforms its competitors when the working model is misspecified and the data are missing at random. We illustrate the methods by analyzing data from an AIDS clinical trial (ACTG 175).
Auxiliary information; Biased sampling; Causal inference; Observational study; Survey sampling
There is often interest in predicting an individual’s latent health status based on high-dimensional biomarkers that vary over time. Motivated by time-course gene expression array data that we have collected in two influenza challenge studies performed with healthy human volunteers, we develop a novel time-aligned Bayesian dynamic factor analysis methodology. The time course trajectories in the gene expressions are related to a relatively low-dimensional vector of latent factors, which vary dynamically starting at the latent initiation time of infection. Using a nonparametric cure rate model for the latent initiation times, we allow selection of the genes in the viral response pathway, variability among individuals in infection times, and a subset of individuals who are not infected. As we demonstrate using held-out data, this statistical framework allows accurate predictions of infected individuals in advance of the development of clinical symptoms, without labeled data and even when the number of biomarkers vastly exceeds the number of individuals under study. Biological interpretation of several of the inferred pathways (factors) is provided.
Bayesian nonparametrics; Dynamic factor analysis; High-dimensional; Infectious disease; Joint model; Multidimensional longitudinal data; Multivariate functional data; Predictive model
It is frequently of interest to estimate the intervention effect that adjusts for post-randomization variables in clinical trials. In the recently completed HPTN 035 trial, there is differential condom use between the three microbicide gel arms and the No Gel control arm, so that intention to treat (ITT) analyses only assess the net treatment effect that includes the indirect treatment effect mediated through differential condom use. Various statistical methods in causal inference have been developed to adjust for post-randomization variables. We extend the principal stratification framework to time-varying behavioral variables in HIV prevention trials with a time-to-event endpoint, using a partially hidden Markov model (pHMM). We formulate the causal estimand of interest, establish assumptions that enable identifiability of the causal parameters, and develop maximum likelihood methods for estimation. Application of our model on the HPTN 035 trial reveals an interesting pattern of prevention effectiveness among different condom-use principal strata.
microbicide; causal inference; posttreatment variables; direct effect