Treating patients with novel biological agents is becoming a leading trend in oncology. Unlike cytotoxic agents, for which efficacy and toxicity monotonically increase with dose, biological agents may exhibit non-monotonic patterns in their dose-response relationships. Using a trial with two biological agents as an example, we propose a dose-finding design to identify the biologically optimal dose combination (BODC), which is defined as the dose combination of the two agents with the highest efficacy and tolerable toxicity. A change-point model is used to reflect the fact that the dose-toxicity surface of the combinational agents may plateau at higher dose levels, and a flexible logistic model is proposed to accommodate the possible non-monotonic pattern for the dose-efficacy relationship. During the trial, we continuously update the posterior estimates of toxicity and efficacy and assign patients to the most appropriate dose combination. We propose a novel dose-finding algorithm to encourage sufficient exploration of untried dose combinations in the two-dimensional space. Extensive simulation studies show that the proposed design has desirable operating characteristics in identifying the BODC under various patterns of dose-toxicity and dose-efficacy relationships.
Biologically optimal dose combination; Non-monotonic pattern; Drug combination; Dose finding; Change-point model; Adaptive design
There is growing interest in understanding the heterogeneity of treatment effects (HTE), which has important implications in treatment evaluation and selection. The standard approach to assessing HTE (i.e. subgroup analyses based on known effect modifiers) is informative about the heterogeneity between subpopulations but not within. It is arguably more informative to assess HTE in terms of individual treatment effects, which can be defined by using potential outcomes. However, estimation of HTE based on potential outcomes is challenged by the lack of complete identifiability. The paper proposes methods to deal with the identifiability problem by using relevant information in baseline covariates and repeated measurements. If a set of covariates is sufficient for explaining the dependence between potential outcomes, the joint distribution of potential outcomes and hence all measures of HTE will then be identified under a conditional independence assumption. Possible violations of this assumption can be addressed by including a random effect to account for residual dependence or by specifying the conditional dependence structure directly. The methods proposed are shown to reduce effectively the uncertainty about HTE in a trial of human immunodeficiency virus.
Causal inference; Conditional independence; Copula; Counterfactual; Random effect; Sensitivity analysis
In a unique longitudinal study of teen driving, risky driving behavior and the occurrence of crashes or near crashes are measured prospectively over the first 18 months of licensure. Of scientific interest is relating the two processes and developing a predictor of crashes from previous risky driving behavior. In this work, we propose two latent class models for relating risky driving behavior to the occurrence of a crash or near crash event. The first approach models the binary longitudinal crash/near crash outcome using a binary latent variable which depends on risky driving covariates and previous outcomes. A random effects model introduces heterogeneity among subjects in modeling the mean value of the latent state. The second approach extends the first model to the ordinal case where the latent state is composed of K ordinal classes. Additionally, we discuss an alternate hidden Markov model formulation. Estimation is performed using the expectation-maximization (EM) algorithm and Monte Carlo EM. We illustrate the importance of using these latent class modeling approaches through the analysis of the teen driving behavior.
driving study; latent class modeling; Monte Carlo EM
Motivated by the need to understand the dynamics of relationship formation and dissolution over time in real-world social networks we develop a new longitudinal model for transitions in the relationship status of pairs of individuals (“dyads”). We first specify a model for the relationship status of a single dyad and then extend it to account for important inter-dyad dependencies (e.g., transitivity – “a friend of a friend is a friend”) and heterogeneity. Model parameters are estimated using Bayesian analysis implemented via Markov chain Monte Carlo. We use the model to perform novel analyses of two diverse longitudinal friendship networks: an excerpt of the Teenage Friends and Lifestyle Study (a moderately sized network) and the Framingham Heart Study (FHS) (a large network).
Bayesian; Dyadic independence; Latent variables; Longitudinal model; Social networks and health; Transitivity
Partial area under the ROC curve (PAUC) has been proposed for gene selection in Pepe et al. (2003) and thereafter applied in real data analysis. It was noticed from empirical studies that this measure has several key weaknesses, such as an inability to reflect nonuniform weighting of different decision thresholds, resulting in large numbers of ties. We propose the weighted area under the ROC curve (WAUC) in this paper to address the problems associated with PAUC. Our proposed measure enjoys a greater flexibility to describe the discrimination accuracy of genes. Nonparametric and parametric estimation methods are introduced, including PAUC as a special case, along with theoretical properties of the estimators. We also provide a simple variance formula, yielding a novel variance estimator for nonparametric estimation of PAUC, which has proven challenging in previous work. The proposed methods permit sensitivity analyses, whereby the impact of differing weight functions on gene rankings may be assessed and results may be synthesized across weights. Simulations and re-analysis of two well-known microarray datasets illustrate the practical utility of WAUC.
Gene selection; Empirical distribution; Location-scale model; Partial area under the curve; Random threshold; Weighted area under the curve
Acute infectious diseases are transmitted over networks of social contacts. Epidemic models are used to predict the spread of emergent pathogens and compare intervention strategies. Many of these models assume equal probability of contact within mixing groups (homes, schools, etc.), but little work has inferred the actual contact network, which may influence epidemic estimates. We develop a penalized likelihood method to infer contact networks within households, a key area for disease transmission. Using egocentric surveys of contact behavior in Belgium, we estimate within-household contact networks for six different age compositions. Our estimates show dependency in contact behavior and vary substantively by age composition, with fewer contacts occurring in older households. Our results are relevant for epidemic models used to make policy recommendations.
Group (pooled) testing is often used to reduce the total number of tests that are needed to screen a large number of individuals for an infectious disease or some other binary characteristic. Traditionally, research in group testing has assumed that each individual is independent with the same risk of positivity. More recently, there has been a growing set of literature generalizing previous work in group testing to include heterogeneous populations so that each individual has a different risk of positivity. We investigate the effect of acknowledging population heterogeneity on a commonly used group testing procedure which is known as ‘halving’. For this procedure, positive groups are successively split into two equal-sized halves until all groups test negatively or until individual testing occurs. We show that heterogeneity does not affect the mean number of tests when individuals are randomly assigned to subgroups. However, when individuals are assigned to subgroups on the basis of their risk probabilities, we show that our proposed procedures reduce the number of tests by taking advantage of the heterogeneity. This is illustrated by using chlamydia and gonorrhoea screening data from the state of Nebraska.
Binary response; Classification; Identification; Pooled testing; Retesting; Screening
A subjective sampling ratio between the case and the control groups is not always an efficient choice to maximize the power or to minimize the total required sample size in comparative diagnostic trials.We derive explicit expressions for an optimal sampling ratio based on a common variance structure shared by several existing summary statistics of the receiver operating characteristic curve. We propose a two-stage procedure to estimate adaptively the optimal ratio without pilot data. We investigate the properties of the proposed method through theoretical proofs, extensive simulation studies and a real example in cancer diagnostic studies.
Area under the curve; Diagnostic accuracy; Partial area under the curve; Power; Receiver operating characteristic curve; Two-stage design
Ecological momentary assessment (EMA) is a method for collecting real-time data in subjects’ environments. It often uses electronic devices to obtain information on psychological state through administration of questionnaires at times selected from a probability-based sampling design. This information can be used to model the impact of momentary variation in psychological state on the lifetimes to events such as smoking lapse. Motivated by this, a probability-sampling framework is proposed for estimating the impact of time-varying covariates on the lifetimes to events. Presented as an alternative to joint modeling of the covariate process as well as event lifetimes, this framework calls for sampling covariates at the event lifetimes and at times selected according to a probability-based sampling design. A design-unbiased estimator for the cumulative hazard is substituted into the log likelihood, and the resulting objective function is maximized to obtain the proposed estimator. This estimator has two quantifiable sources of variation, that due to the survival model and that due to sampling the covariates. Data from a nicotine patch trial are used to illustrate the proposed approach.
Ecological momentary assessment; Estimating equations; Parametric hazard; Smoking
Hierarchical models (HM) have been used extensively in multisite time series studies of air pollution and health to estimate health effects of a single pollutant adjusted for other pollutants and other time-varying factors. Recently, Environmental Protection Agency (EPA) has called for research quantifying health effects of simultaneous exposure to many air pollutants. However, straightforward application of HM in this context is challenged by the need to specify a random-effect distribution on a high-dimensional vector of nuisance parameters. Here we introduce reduced HM as a general statistical approach for analyzing correlated data with many nuisance parameters. For reduced HM we first calculate the integrated likelihood of the parameter of interest (e.g. excess number of deaths attributed to simultaneous exposure to high levels of many pollutants), and we then specify a flexible random-effect distribution directly on this parameter. Simulation studies show that the reduced HM performs comparably to the full HM in many scenarios, and even performs better in some cases, particularly when the multivariate random-effect distribution of the full HM is misspecified. Methods are applied to estimate relative risks of cardiovascular hospital admissions associated with simultaneous exposure to elevated levels of particulate matter and ozone in 51 US counties during 1999–2005.
Air pollution; Multilevel models; Multisite time series data; Nuisance parameters; Random effects
The proportional odds logistic regression model is widely used for relating an ordinal outcome to a set of covariates. When the number of outcome categories is relatively large, the sample size is relatively small, and/or certain outcome categories are rare, maximum likelihood can yield biased estimates of the regression parameters. Firth (1993) and Kosmidis and Firth (2009) proposed a procedure to remove the leading term in the asymptotic bias of the maximum likelihood estimator. Their approach is most easily implemented for univariate outcomes. In this paper, we derive a bias correction that exploits the proportionality between Poisson and multinomial likelihoods for multinomial regression models. Specifically, we describe a bias correction for the proportional odds logistic regression model, based on the likelihood from a collection of independent Poisson random variables whose means are constrained to sum to 1, that is straightforward to implement. The proposed method is motivated by a study of predictors of post-operative complications in patients undergoing colon or rectal surgery (Gawande et al., 2007).
Discrete response; multinomial likelihood; multinomial logistic regression; penalized likelihood; Poisson likelihood
Typical oncology practice often includes not only an initial, frontline treatment, but also subsequent treatments given if the initial treatment fails. The physician chooses a treatment at each stage based on the patient’s baseline covariates and history of previous treatments and outcomes. Such sequentially adaptive medical decision-making processes are known as dynamic treatment regimes, treatment policies, or multi-stage adaptive treatment strategies. Conventional analyses in terms of frontline treatments that ignore subsequent treatments may be misleading, because they actually are an evaluation of more than front-line treatment effects on outcome. We are motivated by data from a randomized trial of four combination chemotherapies given as frontline treatments to patients with acute leukemia. Most patients in the trial also received a second-line treatment, chosen adaptively and subjectively rather than by randomization, either because the initial treatment was ineffective or the patient’s cancer later recurred. We evaluate effects on overall survival time of the 16 two-stage strategies that actually were used. Our methods include a likelihood-based regression approach in which the transition times of all possible multi-stage outcome paths are modeled, and estimating equations with inverse probability of treatment weighting to correct for bias. While the two approaches give different numerical estimates of mean survival time, they lead to the same substantive conclusions when comparing the two-stage regimes.
Causal inference; Clinical trial; Dynamic treatment regime; Treatment policy
We analyse data from a study involving 173 pregnant women. The data are observed values of the β human chorionic gonadotropin hormone measured during the first 80 days of gestational age, including from one up to six longitudinal responses for each woman. The main objective in this study is to predict normal versus abnormal pregnancy outcomes from data that are available at the early stages of pregnancy. We achieve the desired classification with a semiparametric hierarchical model. Specifically, we consider a Dirichlet process mixture prior for the distribution of the random effects in each group. The unknown random-effects distributions are allowed to vary across groups but are made dependent by using a design vector to select different features of a single underlying random probability measure. The resulting model is an extension of the dependent Dirichlet process model, with an additional probability model for group classification. The model is shown to perform better than an alternative model which is based on independent Dirichlet processes for the groups. Relevant posterior distributions are summarized by using Markov chain Monte Carlo methods.
Dependent non-parametric model; Discriminant analysis; Longitudinal data; Markov chain Monte Carlo sampling; Non-parametric modelling; Random-effects models; Species sampling models
The paper describes a Bayesian spatial discrete time survival model to estimate the effect of air pollution on the risk of preterm birth. The standard approach treats prematurity as a binary outcome and cannot effectively examine time varying exposures during pregnancy. Time varying exposures can arise either in short-term lagged exposures due to seasonality in air pollution or long-term cumulative exposures due to changes in length of exposure. Our model addresses this challenge by viewing gestational age as time-to-event data where each pregnancy becomes at risk at a prespecified time (e.g. the 28th week). The pregnancy is then followed until either a birth occurs before the 37th week (preterm), or it reaches the 37th week, and a full-term birth is expected. The model also includes a flexible spatially varying baseline hazard function to control for unmeasured spatial confounders and to borrow information across areal units. The approach proposed is applied to geocoded birth records in Mecklenburg County, North Carolina, for the period 2001–2005.We examine the risk of preterm birth that is associated with total cumulative and 4-week lagged exposure to ambient fine particulate matter.
Air pollution; Fine particulate matter; Preterm birth; Reproductive epidemiology; Spatial survival data
We propose a randomized phase II clinical trial design based on Bayesian adaptive randomization and predictive probability monitoring. Adaptive randomization assigns more patients to a more efficacious treatment arm by comparing the posterior probabilities of efficacy between different arms. We continuously monitor the trial by using the predictive probability. The trial is terminated early when it is shown that one treatment is overwhelmingly superior to others or that all the treatments are equivalent. We develop two methods to compute the predictive probability by considering the uncertainty of the sample size of the future data. We illustrate the proposed Bayesian adaptive randomization and predictive probability design by using a phase II lung cancer clinical trial, and we conduct extensive simulation studies to examine the operating characteristics of the design. By coupling adaptive randomization and predictive probability approaches, the trial can treat more patients with a more efficacious treatment and allow for early stopping whenever sufficient information is obtained to conclude treatment superiority or equivalence. The design proposed also controls both the type I and the type II errors and offers an alternative Bayesian approach to the frequentist group sequential design.
Adaptive randomization; Bayesian inference; Clinical trial ethics; Group sequential method; Posterior predictive distribution; Randomized trial; Type I error; Type II error
Treatment of schizophrenia is notoriously difficult and typically requires personalized adaption of treatment due to lack of efficacy of treatment, poor adherence, or intolerable side effects. The Clinical Antipsychotic Trials in Intervention Effectiveness (CATIE) Schizophrenia Study is a sequential multiple assignment randomized trial comparing the typical antipsychotic medication, perphenazine, to several newer atypical antipsychotics. This paper describes the marginal structural modeling method for estimating optimal dynamic treatment regimes and applies the approach to the CATIE Schizophrenia Study. Missing data and valid estimation of confidence intervals are also addressed.
Adaptive treatment strategies; causal effects; dynamic treatment regimes; inverse probability weighting; marginal structural models; personalized medicine; schizophrenia
In complex survey sampling, a fraction of a finite population is sampled. Often, the survey is conducted so that each subject in the population has a different probability of being selected into the sample. Further, many complex surveys involve stratification and clustering. For generalizability of the sample to the finite population, these features of the design are usually incorporated in the analysis. While the Wilcoxon rank sum test is commonly used to compare an ordinal variable in bivariate analyses, no simple extension of the Wilcoxon rank sum test has been proposed for complex survey data. With multinomial sampling of independent subjects, the Wilcoxon rank-sum test statistic equals the score test statistic for the group effect from a proportional odds cumulative logistic regression model for an ordinal outcome. Using this regression framework, for complex survey data, we formulate a similar proportional odds cumulative logistic regression model for the ordinal variable, and use an estimating equations score statistic for no group effect as an extension of the Wilcoxon test. The proposed method is applied to a complex survey designed to produce national estimates of the health care use, expenditures, sources of payment, and insurance coverage.
Cumulative logistic model; Medical Expenditure Panel Survey; Proportional odds model; Score statistic; Weighted estimating equations
Climate change may lead to changes in several aspects of the distribution of climate variables, including changes in the mean, increased variability, and severity of extreme events. In this paper, we propose using spatiotemporal quantile regression as a flexible and interpretable method for simultaneously detecting changes in several features of the distribution of climate variables. The spatiotemporal quantile regression model assumes that each quantile level changes linearly in time, permitting straight-forward inference on the time trend for each quantile level. Unlike classical quantile regression which uses model-free methods to analyze a single quantile or several quantiles separately, we take a model-based approach which jointly models all quantiles, and thus the entire response distribution. In the spatiotemporal quantile regression model, each spatial location has its own quantile function that evolves over time, and the quantile functions are smoothed spatially using Gaussian process priors. We propose a basis expansion for the quantile function that permits a closed-form for the likelihood, and allows for residual correlation modeling via a Gaussian spatial copula. We illustrate the methods using temperature data for the southeast US from the years 1931–2009. For these data, borrowing information across space identifies more significant time trends than classical non-spatial quantile regression. We find a decreasing time trend for much of the spatial domain for monthly mean and maximum temperatures. For the lower quantiles of monthly minimum temperature, we find a decrease in Georgia and Florida, and an increase in Virginia and the Carolinas.
Bayesian hierarchical model; climate change; non-Gaussian data; US temperature data; warming hole
We describe and analyze a longitudinal diffusion tensor imaging (DTI) study relating changes in the microstructure of intracranial white matter tracts to cognitive disability in multiple sclerosis patients. In this application the scalar outcome and the functional exposure are measured longitudinally. This data structure is new and raises challenges that cannot be addressed with current methods and software. To analyze the data, we introduce a penalized functional regression model and inferential tools designed specifically for these emerging types of data. Our proposed model extends the Generalized Linear Mixed Model by adding functional predictors; this method is computationally feasible and is applicable when the functional predictors are measured densely, sparsely or with error. An online appendix compares two implementations, one likelihood-based and the other Bayesian, and provides the software used in simulations; the likelihood-based implementation is included as the lpfr() function in the R package refund available on CRAN.
Bayesian Inference; Functional Regression; Mixed Models; Smoothing Splines
The prognosis for patients with high grade gliomas is poor, with a median survival of 1 year. Treatment efficacy assessment is typically unavailable until 5-6 months post diagnosis. Investigators hypothesize that quantitative magnetic resonance imaging can assess treatment efficacy 3 weeks after therapy starts, thereby allowing salvage treatments to begin earlier. The purpose of this work is to build a predictive model of treatment efficacy by using quantitative magnetic resonance imaging data and to assess its performance. The outcome is 1-year survival status. We propose a joint, two-stage Bayesian model. In stage I, we smooth the image data with a multivariate spatiotemporal pairwise difference prior. We propose four summary statistics that are functionals of posterior parameters from the first-stage model. In stage II, these statistics enter a generalized non-linear model as predictors of survival status. We use the probit link and a multivariate adaptive regression spline basis. Gibbs sampling and reversible jump Markov chain Monte Carlo methods are applied iteratively between the two stages to estimate the posterior distribution. Through both simulation studies and model performance comparisons we find that we can achieve higher overall correct classification rates by accounting for the spatiotemporal correlation in the images and by allowing for a more complex and flexible decision boundary provided by the generalized non-linear model.
Bayesian analysis; Image analysis; Multivariate adaptive regression splines; Multivariate pairwise difference prior; Quantitative magnetic resonance imaging; Spatiotemporal model
In family-based longitudinal genetic studies, investigators collect repeated measurements on a trait that changes with time along with genetic markers. Since repeated measurements are nested within subjects and subjects are nested within families, both the subject-level and measurement-level correlations must be taken into account in the statistical analysis to achieve more accurate estimation. In such studies, the primary interests include to test for quantitative trait locus (QTL) effect, and to estimate age-specific QTL effect and residual polygenic heritability function. We propose flexible semiparametric models along with their statistical estimation and hypothesis testing procedures for longitudinal genetic designs. We employ penalized splines to estimate nonparametric functions in the models. We find that misspecifying the baseline function or the genetic effect function in a parametric analysis may lead to substantially inflated or highly conservative type I error rate on testing and large mean squared error on estimation. We apply the proposed approaches to examine age-specific effects of genetic variants reported in a recent genome-wide association study of blood pressure collected in the Framingham Heart Study.
Genome-wide association study; Penalized splines; Quantitative trait locus
Epidemiology studies increasingly examine multiple exposures in relation to disease by selecting the exposures of interest in a thematic manner. For example, sun exposure, sunburn, and sun protection behavior could be themes for an investigation of sun-related exposures. Several studies now use pre-defined linear combinations of the exposures pertaining to the themes to estimate the effects of the individual exposures. Such analyses may improve the precision of the exposure effects, but they can lead to inflated bias and type I errors when the linear combinations are inaccurate. We investigate preliminary test estimators and empirical Bayes type shrinkage estimators as alternative approaches when it is desirable to exploit the thematic choice of exposures, but the accuracy of the pre-defined linear combinations is unknown. We show that the two types of estimator are intimately related under certain assumptions. The shrinkage estimator derived under the assumption of an exchangeable prior distribution gives precise estimates and is robust to misspecifications of the user-defined linear combinations. The precision gains and robustness of the shrinkage estimation approach are illustrated using data from the SONIC study, where the exposures are the individual questionnaire items and the outcome is (log) total back nevus count.
Empirical Bayes; Minimum risk; Random effects; Exchangeability
We propose a hierarchical Bayesian model for analyzing gene expression data to identify pathways differentiating between two biological states (e.g., cancer vs. non-cancer and mutant vs. normal). Finding significant pathways can improve our understanding of biological processes. When the biological process of interest is related to a specific disease, eliciting a better understanding of the underlying pathways can lead to designing a more effective treatment. We apply our method to data obtained by interrogating the mutational status of p53 in 50 cancer cell lines (33 mutated and 17 normal). We identify several significant pathways with strong biological connections. We show that our approach provides a natural framework for incorporating prior biological information, and it has the best overall performance in terms of correctly identifying significant pathways compared to several alternative methods.
Biological pathways; Hierarchical Bayesian models; Mixture priors
The outcome dependent sampling scheme has been gaining attention in both the statistical literature and applied fields. Epidemiological and environmental researchers have been using it to select the observations for more powerful and cost-effective studies. Motivated by a study of the effect of in utero exposure to polychlorinated biphenyls on children’s IQ at age 7, in which the effect of an important confounding variable is nonlinear, we consider a semi-parametric regression model for data from an outcome-dependent sampling scheme where the relationship between the response and covariates is only partially parameterized. We propose a penalized spline maximum likelihood estimation (PSMLE) for inference on both the parametric and the nonparametric components and develop their asymptotic properties. Through simulation studies and an analysis of the IQ study, we compare the proposed estimator with several competing estimators. Practical considerations of implementing those estimators are discussed.
Outcome dependent sampling; Estimated likelihood; Semiparametric method; Penalized spline
Acute lung injury (ALI) is a condition characterized by acute onset of severe hypoxemia and bilateral pulmonary infiltrates. ALI patients typically require mechanical ventilation in an intensive care unit. Low tidal volume ventilation (LTVV), a time-varying dynamic treatment regime, has been recommended as an effective ventilation strategy. This recommendation was based on the results of the ARMA study, a randomized clinical trial designed to compare low vs. high tidal volume strategies (The Acute Respiratory Distress Syndrome Network, 2000) . After publication of the trial, some critics focused on the high non-adherence rates in the LTVV arm suggesting that non-adherence occurred because treating physicians felt that deviating from the prescribed regime would improve patient outcomes. In this paper, we seek to address this controversy by estimating the survival distribution in the counterfactual setting where all patients assigned to LTVV followed the regime. Inference is based on a fully Bayesian implementation of Robins’ (1986) G-computation formula. In addition to re-analyzing data from the ARMA trial, we also apply our methodology to data from a subsequent trial (ALVEOLI), which implemented the LTVV regime in both of its study arms and also suffered from non-adherence.
Bayesian inference; Causal inference; Dynamic treatment regime; G-computation formula