Group (pooled) testing is often used to reduce the total number of tests that are needed to screen a large number of individuals for an infectious disease or some other binary characteristic. Traditionally, research in group testing has assumed that each individual is independent with the same risk of positivity. More recently, there has been a growing set of literature generalizing previous work in group testing to include heterogeneous populations so that each individual has a different risk of positivity. We investigate the effect of acknowledging population heterogeneity on a commonly used group testing procedure which is known as ‘halving’. For this procedure, positive groups are successively split into two equal-sized halves until all groups test negatively or until individual testing occurs. We show that heterogeneity does not affect the mean number of tests when individuals are randomly assigned to subgroups. However, when individuals are assigned to subgroups on the basis of their risk probabilities, we show that our proposed procedures reduce the number of tests by taking advantage of the heterogeneity. This is illustrated by using chlamydia and gonorrhoea screening data from the state of Nebraska.
Binary response; Classification; Identification; Pooled testing; Retesting; Screening
A subjective sampling ratio between the case and the control groups is not always an efficient choice to maximize the power or to minimize the total required sample size in comparative diagnostic trials.We derive explicit expressions for an optimal sampling ratio based on a common variance structure shared by several existing summary statistics of the receiver operating characteristic curve. We propose a two-stage procedure to estimate adaptively the optimal ratio without pilot data. We investigate the properties of the proposed method through theoretical proofs, extensive simulation studies and a real example in cancer diagnostic studies.
Area under the curve; Diagnostic accuracy; Partial area under the curve; Power; Receiver operating characteristic curve; Two-stage design
Ecological momentary assessment (EMA) is a method for collecting real-time data in subjects’ environments. It often uses electronic devices to obtain information on psychological state through administration of questionnaires at times selected from a probability-based sampling design. This information can be used to model the impact of momentary variation in psychological state on the lifetimes to events such as smoking lapse. Motivated by this, a probability-sampling framework is proposed for estimating the impact of time-varying covariates on the lifetimes to events. Presented as an alternative to joint modeling of the covariate process as well as event lifetimes, this framework calls for sampling covariates at the event lifetimes and at times selected according to a probability-based sampling design. A design-unbiased estimator for the cumulative hazard is substituted into the log likelihood, and the resulting objective function is maximized to obtain the proposed estimator. This estimator has two quantifiable sources of variation, that due to the survival model and that due to sampling the covariates. Data from a nicotine patch trial are used to illustrate the proposed approach.
Ecological momentary assessment; Estimating equations; Parametric hazard; Smoking
Hierarchical models (HM) have been used extensively in multisite time series studies of air pollution and health to estimate health effects of a single pollutant adjusted for other pollutants and other time-varying factors. Recently, Environmental Protection Agency (EPA) has called for research quantifying health effects of simultaneous exposure to many air pollutants. However, straightforward application of HM in this context is challenged by the need to specify a random-effect distribution on a high-dimensional vector of nuisance parameters. Here we introduce reduced HM as a general statistical approach for analyzing correlated data with many nuisance parameters. For reduced HM we first calculate the integrated likelihood of the parameter of interest (e.g. excess number of deaths attributed to simultaneous exposure to high levels of many pollutants), and we then specify a flexible random-effect distribution directly on this parameter. Simulation studies show that the reduced HM performs comparably to the full HM in many scenarios, and even performs better in some cases, particularly when the multivariate random-effect distribution of the full HM is misspecified. Methods are applied to estimate relative risks of cardiovascular hospital admissions associated with simultaneous exposure to elevated levels of particulate matter and ozone in 51 US counties during 1999–2005.
Air pollution; Multilevel models; Multisite time series data; Nuisance parameters; Random effects
The proportional odds logistic regression model is widely used for relating an ordinal outcome to a set of covariates. When the number of outcome categories is relatively large, the sample size is relatively small, and/or certain outcome categories are rare, maximum likelihood can yield biased estimates of the regression parameters. Firth (1993) and Kosmidis and Firth (2009) proposed a procedure to remove the leading term in the asymptotic bias of the maximum likelihood estimator. Their approach is most easily implemented for univariate outcomes. In this paper, we derive a bias correction that exploits the proportionality between Poisson and multinomial likelihoods for multinomial regression models. Specifically, we describe a bias correction for the proportional odds logistic regression model, based on the likelihood from a collection of independent Poisson random variables whose means are constrained to sum to 1, that is straightforward to implement. The proposed method is motivated by a study of predictors of post-operative complications in patients undergoing colon or rectal surgery (Gawande et al., 2007).
Discrete response; multinomial likelihood; multinomial logistic regression; penalized likelihood; Poisson likelihood
Typical oncology practice often includes not only an initial, frontline treatment, but also subsequent treatments given if the initial treatment fails. The physician chooses a treatment at each stage based on the patient’s baseline covariates and history of previous treatments and outcomes. Such sequentially adaptive medical decision-making processes are known as dynamic treatment regimes, treatment policies, or multi-stage adaptive treatment strategies. Conventional analyses in terms of frontline treatments that ignore subsequent treatments may be misleading, because they actually are an evaluation of more than front-line treatment effects on outcome. We are motivated by data from a randomized trial of four combination chemotherapies given as frontline treatments to patients with acute leukemia. Most patients in the trial also received a second-line treatment, chosen adaptively and subjectively rather than by randomization, either because the initial treatment was ineffective or the patient’s cancer later recurred. We evaluate effects on overall survival time of the 16 two-stage strategies that actually were used. Our methods include a likelihood-based regression approach in which the transition times of all possible multi-stage outcome paths are modeled, and estimating equations with inverse probability of treatment weighting to correct for bias. While the two approaches give different numerical estimates of mean survival time, they lead to the same substantive conclusions when comparing the two-stage regimes.
Causal inference; Clinical trial; Dynamic treatment regime; Treatment policy
We analyse data from a study involving 173 pregnant women. The data are observed values of the β human chorionic gonadotropin hormone measured during the first 80 days of gestational age, including from one up to six longitudinal responses for each woman. The main objective in this study is to predict normal versus abnormal pregnancy outcomes from data that are available at the early stages of pregnancy. We achieve the desired classification with a semiparametric hierarchical model. Specifically, we consider a Dirichlet process mixture prior for the distribution of the random effects in each group. The unknown random-effects distributions are allowed to vary across groups but are made dependent by using a design vector to select different features of a single underlying random probability measure. The resulting model is an extension of the dependent Dirichlet process model, with an additional probability model for group classification. The model is shown to perform better than an alternative model which is based on independent Dirichlet processes for the groups. Relevant posterior distributions are summarized by using Markov chain Monte Carlo methods.
Dependent non-parametric model; Discriminant analysis; Longitudinal data; Markov chain Monte Carlo sampling; Non-parametric modelling; Random-effects models; Species sampling models
The paper describes a Bayesian spatial discrete time survival model to estimate the effect of air pollution on the risk of preterm birth. The standard approach treats prematurity as a binary outcome and cannot effectively examine time varying exposures during pregnancy. Time varying exposures can arise either in short-term lagged exposures due to seasonality in air pollution or long-term cumulative exposures due to changes in length of exposure. Our model addresses this challenge by viewing gestational age as time-to-event data where each pregnancy becomes at risk at a prespecified time (e.g. the 28th week). The pregnancy is then followed until either a birth occurs before the 37th week (preterm), or it reaches the 37th week, and a full-term birth is expected. The model also includes a flexible spatially varying baseline hazard function to control for unmeasured spatial confounders and to borrow information across areal units. The approach proposed is applied to geocoded birth records in Mecklenburg County, North Carolina, for the period 2001–2005.We examine the risk of preterm birth that is associated with total cumulative and 4-week lagged exposure to ambient fine particulate matter.
Air pollution; Fine particulate matter; Preterm birth; Reproductive epidemiology; Spatial survival data
We propose a randomized phase II clinical trial design based on Bayesian adaptive randomization and predictive probability monitoring. Adaptive randomization assigns more patients to a more efficacious treatment arm by comparing the posterior probabilities of efficacy between different arms. We continuously monitor the trial by using the predictive probability. The trial is terminated early when it is shown that one treatment is overwhelmingly superior to others or that all the treatments are equivalent. We develop two methods to compute the predictive probability by considering the uncertainty of the sample size of the future data. We illustrate the proposed Bayesian adaptive randomization and predictive probability design by using a phase II lung cancer clinical trial, and we conduct extensive simulation studies to examine the operating characteristics of the design. By coupling adaptive randomization and predictive probability approaches, the trial can treat more patients with a more efficacious treatment and allow for early stopping whenever sufficient information is obtained to conclude treatment superiority or equivalence. The design proposed also controls both the type I and the type II errors and offers an alternative Bayesian approach to the frequentist group sequential design.
Adaptive randomization; Bayesian inference; Clinical trial ethics; Group sequential method; Posterior predictive distribution; Randomized trial; Type I error; Type II error
Treatment of schizophrenia is notoriously difficult and typically requires personalized adaption of treatment due to lack of efficacy of treatment, poor adherence, or intolerable side effects. The Clinical Antipsychotic Trials in Intervention Effectiveness (CATIE) Schizophrenia Study is a sequential multiple assignment randomized trial comparing the typical antipsychotic medication, perphenazine, to several newer atypical antipsychotics. This paper describes the marginal structural modeling method for estimating optimal dynamic treatment regimes and applies the approach to the CATIE Schizophrenia Study. Missing data and valid estimation of confidence intervals are also addressed.
Adaptive treatment strategies; causal effects; dynamic treatment regimes; inverse probability weighting; marginal structural models; personalized medicine; schizophrenia
In complex survey sampling, a fraction of a finite population is sampled. Often, the survey is conducted so that each subject in the population has a different probability of being selected into the sample. Further, many complex surveys involve stratification and clustering. For generalizability of the sample to the finite population, these features of the design are usually incorporated in the analysis. While the Wilcoxon rank sum test is commonly used to compare an ordinal variable in bivariate analyses, no simple extension of the Wilcoxon rank sum test has been proposed for complex survey data. With multinomial sampling of independent subjects, the Wilcoxon rank-sum test statistic equals the score test statistic for the group effect from a proportional odds cumulative logistic regression model for an ordinal outcome. Using this regression framework, for complex survey data, we formulate a similar proportional odds cumulative logistic regression model for the ordinal variable, and use an estimating equations score statistic for no group effect as an extension of the Wilcoxon test. The proposed method is applied to a complex survey designed to produce national estimates of the health care use, expenditures, sources of payment, and insurance coverage.
Cumulative logistic model; Medical Expenditure Panel Survey; Proportional odds model; Score statistic; Weighted estimating equations
Climate change may lead to changes in several aspects of the distribution of climate variables, including changes in the mean, increased variability, and severity of extreme events. In this paper, we propose using spatiotemporal quantile regression as a flexible and interpretable method for simultaneously detecting changes in several features of the distribution of climate variables. The spatiotemporal quantile regression model assumes that each quantile level changes linearly in time, permitting straight-forward inference on the time trend for each quantile level. Unlike classical quantile regression which uses model-free methods to analyze a single quantile or several quantiles separately, we take a model-based approach which jointly models all quantiles, and thus the entire response distribution. In the spatiotemporal quantile regression model, each spatial location has its own quantile function that evolves over time, and the quantile functions are smoothed spatially using Gaussian process priors. We propose a basis expansion for the quantile function that permits a closed-form for the likelihood, and allows for residual correlation modeling via a Gaussian spatial copula. We illustrate the methods using temperature data for the southeast US from the years 1931–2009. For these data, borrowing information across space identifies more significant time trends than classical non-spatial quantile regression. We find a decreasing time trend for much of the spatial domain for monthly mean and maximum temperatures. For the lower quantiles of monthly minimum temperature, we find a decrease in Georgia and Florida, and an increase in Virginia and the Carolinas.
Bayesian hierarchical model; climate change; non-Gaussian data; US temperature data; warming hole
We describe and analyze a longitudinal diffusion tensor imaging (DTI) study relating changes in the microstructure of intracranial white matter tracts to cognitive disability in multiple sclerosis patients. In this application the scalar outcome and the functional exposure are measured longitudinally. This data structure is new and raises challenges that cannot be addressed with current methods and software. To analyze the data, we introduce a penalized functional regression model and inferential tools designed specifically for these emerging types of data. Our proposed model extends the Generalized Linear Mixed Model by adding functional predictors; this method is computationally feasible and is applicable when the functional predictors are measured densely, sparsely or with error. An online appendix compares two implementations, one likelihood-based and the other Bayesian, and provides the software used in simulations; the likelihood-based implementation is included as the lpfr() function in the R package refund available on CRAN.
Bayesian Inference; Functional Regression; Mixed Models; Smoothing Splines
The prognosis for patients with high grade gliomas is poor, with a median survival of 1 year. Treatment efficacy assessment is typically unavailable until 5-6 months post diagnosis. Investigators hypothesize that quantitative magnetic resonance imaging can assess treatment efficacy 3 weeks after therapy starts, thereby allowing salvage treatments to begin earlier. The purpose of this work is to build a predictive model of treatment efficacy by using quantitative magnetic resonance imaging data and to assess its performance. The outcome is 1-year survival status. We propose a joint, two-stage Bayesian model. In stage I, we smooth the image data with a multivariate spatiotemporal pairwise difference prior. We propose four summary statistics that are functionals of posterior parameters from the first-stage model. In stage II, these statistics enter a generalized non-linear model as predictors of survival status. We use the probit link and a multivariate adaptive regression spline basis. Gibbs sampling and reversible jump Markov chain Monte Carlo methods are applied iteratively between the two stages to estimate the posterior distribution. Through both simulation studies and model performance comparisons we find that we can achieve higher overall correct classification rates by accounting for the spatiotemporal correlation in the images and by allowing for a more complex and flexible decision boundary provided by the generalized non-linear model.
Bayesian analysis; Image analysis; Multivariate adaptive regression splines; Multivariate pairwise difference prior; Quantitative magnetic resonance imaging; Spatiotemporal model
In family-based longitudinal genetic studies, investigators collect repeated measurements on a trait that changes with time along with genetic markers. Since repeated measurements are nested within subjects and subjects are nested within families, both the subject-level and measurement-level correlations must be taken into account in the statistical analysis to achieve more accurate estimation. In such studies, the primary interests include to test for quantitative trait locus (QTL) effect, and to estimate age-specific QTL effect and residual polygenic heritability function. We propose flexible semiparametric models along with their statistical estimation and hypothesis testing procedures for longitudinal genetic designs. We employ penalized splines to estimate nonparametric functions in the models. We find that misspecifying the baseline function or the genetic effect function in a parametric analysis may lead to substantially inflated or highly conservative type I error rate on testing and large mean squared error on estimation. We apply the proposed approaches to examine age-specific effects of genetic variants reported in a recent genome-wide association study of blood pressure collected in the Framingham Heart Study.
Genome-wide association study; Penalized splines; Quantitative trait locus
Epidemiology studies increasingly examine multiple exposures in relation to disease by selecting the exposures of interest in a thematic manner. For example, sun exposure, sunburn, and sun protection behavior could be themes for an investigation of sun-related exposures. Several studies now use pre-defined linear combinations of the exposures pertaining to the themes to estimate the effects of the individual exposures. Such analyses may improve the precision of the exposure effects, but they can lead to inflated bias and type I errors when the linear combinations are inaccurate. We investigate preliminary test estimators and empirical Bayes type shrinkage estimators as alternative approaches when it is desirable to exploit the thematic choice of exposures, but the accuracy of the pre-defined linear combinations is unknown. We show that the two types of estimator are intimately related under certain assumptions. The shrinkage estimator derived under the assumption of an exchangeable prior distribution gives precise estimates and is robust to misspecifications of the user-defined linear combinations. The precision gains and robustness of the shrinkage estimation approach are illustrated using data from the SONIC study, where the exposures are the individual questionnaire items and the outcome is (log) total back nevus count.
Empirical Bayes; Minimum risk; Random effects; Exchangeability
We propose a hierarchical Bayesian model for analyzing gene expression data to identify pathways differentiating between two biological states (e.g., cancer vs. non-cancer and mutant vs. normal). Finding significant pathways can improve our understanding of biological processes. When the biological process of interest is related to a specific disease, eliciting a better understanding of the underlying pathways can lead to designing a more effective treatment. We apply our method to data obtained by interrogating the mutational status of p53 in 50 cancer cell lines (33 mutated and 17 normal). We identify several significant pathways with strong biological connections. We show that our approach provides a natural framework for incorporating prior biological information, and it has the best overall performance in terms of correctly identifying significant pathways compared to several alternative methods.
Biological pathways; Hierarchical Bayesian models; Mixture priors
The outcome dependent sampling scheme has been gaining attention in both the statistical literature and applied fields. Epidemiological and environmental researchers have been using it to select the observations for more powerful and cost-effective studies. Motivated by a study of the effect of in utero exposure to polychlorinated biphenyls on children’s IQ at age 7, in which the effect of an important confounding variable is nonlinear, we consider a semi-parametric regression model for data from an outcome-dependent sampling scheme where the relationship between the response and covariates is only partially parameterized. We propose a penalized spline maximum likelihood estimation (PSMLE) for inference on both the parametric and the nonparametric components and develop their asymptotic properties. Through simulation studies and an analysis of the IQ study, we compare the proposed estimator with several competing estimators. Practical considerations of implementing those estimators are discussed.
Outcome dependent sampling; Estimated likelihood; Semiparametric method; Penalized spline
Acute lung injury (ALI) is a condition characterized by acute onset of severe hypoxemia and bilateral pulmonary infiltrates. ALI patients typically require mechanical ventilation in an intensive care unit. Low tidal volume ventilation (LTVV), a time-varying dynamic treatment regime, has been recommended as an effective ventilation strategy. This recommendation was based on the results of the ARMA study, a randomized clinical trial designed to compare low vs. high tidal volume strategies (The Acute Respiratory Distress Syndrome Network, 2000) . After publication of the trial, some critics focused on the high non-adherence rates in the LTVV arm suggesting that non-adherence occurred because treating physicians felt that deviating from the prescribed regime would improve patient outcomes. In this paper, we seek to address this controversy by estimating the survival distribution in the counterfactual setting where all patients assigned to LTVV followed the regime. Inference is based on a fully Bayesian implementation of Robins’ (1986) G-computation formula. In addition to re-analyzing data from the ARMA trial, we also apply our methodology to data from a subsequent trial (ALVEOLI), which implemented the LTVV regime in both of its study arms and also suffered from non-adherence.
Bayesian inference; Causal inference; Dynamic treatment regime; G-computation formula
Continuous shape change is represented as curves in the shape space. A method for checking the closeness of these curves to a geodesic is presented. Three large databases of short human motions are considered and shown to be well approximated by geodesics. The motions are thus approximated by two shapes on the geodesic and the rate of progress along the path. An analysis of facial motion data taken from a study of subjects with cleft lip or cleft palate is presented that allows the motion to be considered independently from the static shape. Inferential methods for assessing the change in motion are presented. The construction of predicted animated motions is discussed.
Facial motion; Functional data analysis; Geodesics; Landmarks; Principal component analysis; Shape analysis
We propose a mixture modelling framework for both identifying and exploring the nature of genotype–trait associations. This framework extends the classical mixed effects modelling approach for this setting by incorporating a Gaussian mixture distribution for random genotype effects. The primary advantages of this paradigm over existing approaches include that the mixture modelling framework addresses the degrees-of-freedom challenge that is inherent in application of the usual fixed effects analysis of covariance, relaxes the restrictive single normal distribution assumption of the classical mixed effects models and offers an exploratory framework for discovery of underlying structure across multiple genetic loci. An application to data arising from a study of antiretroviral-associated dyslipidaemia in human immunodeficiency virus infection is presented. Extensive simulations studies are also implemented to investigate the performance of this approach.
Genetic associations; Latent class; Mixture models
Prophylaxis of contacts of infectious cases such as household members and treatment of infectious cases are methods to prevent spread of infectious diseases. We develop a method based on maximum likelihood to estimate the efficacy of such interventions and the transmission probabilities. We consider both the design with prospective follow-up of close contact groups and the design with ascertainment of close contact groups by an index case as well as randomization by groups and by individuals. We compare the designs using simulations. We estimate the efficacy of the influenza antiviral agent oseltamivir in reducing susceptibility and infectiousness in two case-ascertained household trials.
Antiviral agent; Community trial; Infectious disease; Intervention efficacy; Left truncation
In oncology, progression-free survival time, which is defined as the minimum of the times to disease progression or death, often is used to characterize treatment and covariate effects. We are motivated by the desire to estimate the progression time distribution on the basis of data from 780 paediatric patients with choroid plexus tumours, which are a rare brain cancer where disease progression always precedes death. In retrospective data on 674 patients, the times to death or censoring were recorded but progression times were missing. In a prospective study of 106 patients, both times were recorded but there were only 20 non-censored progression times and 10 non-censored survival times. Consequently, estimating the progression time distribution is complicated by the problems that, for most of the patients, either the survival time is known but the progression time is not known, or the survival time is right censored and it is not known whether the patient’s disease progressed before censoring. For data with these missingness structures, we formulate a family of Bayesian parametric likelihoods and present methods for estimating the progression time distribution. The underlying idea is that estimating the association between the time to progression and subsequent survival time from patients having complete data provides a basis for utilizing covariates and partial event time data of other patients to infer their missing progression times. We illustrate the methodology by analysing the brain tumour data, and we also present a simulation study.
Latent variables; Missingness at random; Missing values; Survival analysis
This work is motivated by a quantitative Magnetic Resonance Imaging study of the differential tumor/healthy tissue change in contrast uptake induced by radiation. The goal is to determine the time in which there is maximal contrast uptake (a surrogate for permeability) in the tumor relative to healthy tissue. A notable feature of the data is its spatial heterogeneity. Zhang, Johnson, Little, and Cao (2008a and 2008b) discuss two parallel approaches to “denoise” a single image of change in contrast uptake from baseline to one follow-up visit of interest. In this work we extend the image model to explore the longitudinal profile of the tumor/healthy tissue contrast uptake in multiple images over time. We fit a two-stage model. First, we propose a longitudinal image model for each subject. This model simultaneously accounts for the spatial and temporal correlation and denoises the observed images by borrowing strength both across neighboring pixels and over time. We propose to use the Mann-Whitney U statistic to summarize the tumor contrast uptake relative to healthy tissue. In the second stage, we fit a population model to the U statistic and estimate when it achieves its maximum. Our initial findings suggest that the maximal contrast uptake of the tumor core relative to healthy tissue peaks around three weeks after initiation of radiotherapy, though this warrants further investigation.
Mann-Whitney U statistic; Markov random field; population model; quantitative MRI; reversible jump MCMC; spatial-temporal model
A marker's capacity to predict risk of a disease depends on disease prevalence in the target population and its classification accuracy, i.e. its ability to discriminate diseased subjects from non-diseased subjects. The latter is often considered an intrinsic property of the marker; it is independent of disease prevalence and hence more likely to be similar across populations than risk prediction measures. In this paper, we are interested in evaluating the population-specific performance of a risk prediction marker in terms of positive predictive value (PPV) and negative predictive value (NPV) at given thresholds, when samples are available from the target population as well as from another population. A default strategy is to estimate PPV and NPV using samples from the target population only. However, when the marker's classification accuracy as characterized by a specific point on the receiver operating characteristics (ROC) curve is similar across populations, borrowing information across populations allows increased efficiency in estimating PPV and NPV. We develop estimators that optimally combine information across populations. We apply this methodology to a cross-sectional study where we evaluate PCA3 as a risk prediction marker for prostate cancer among subjects with or without previous negative biopsy.
Biomarker; Classification; NPV; PPV; Sensitivity; Specificity