Covariate-specific ROC curves are often used to evaluate the classification accuracy of a medical diagnostic test or a biomarker, when the accuracy of the test is associated with certain covariates. In many large-scale screening tests, the gold standard is subject to missingness due to high cost or harmfulness to the patient. In this paper, we propose a semiparametric estimation of the covariate-specific ROC curves with a partial missing gold standard. A location-scale model is constructed for the test result to model the covariates’ effect, but the residual distributions are left unspecified. Thus the baseline and link functions of the ROC curve both have flexible shapes. With the gold standard missing at random (MAR) assumption, we consider weighted estimating equations for the location-scale parameters, and weighted kernel estimating equations for the residual distributions. Three ROC curve estimators are proposed and compared, namely, imputation-based, inverse probability weighted and doubly robust estimators. We derive the asymptotic normality of the estimated ROC curve, as well as the analytical form the standard error estimator. The proposed method is motivated and applied to the data in an Alzheimer's disease research.
Alzheimer's disease; covariate-specific ROC curve; ignorable missingness; verification bias; weighted estimating equations
Given a randomized treatment Z, a clinical outcome Y, and a biomarker S measured some fixed time after Z is administered, we may be interested in addressing the surrogate endpoint problem by evaluating whether S can be used to reliably predict the effect of Z on Y. Several recent proposals for the statistical evaluation of surrogate value have been based on the framework of principal stratification. In this paper, we consider two principal stratification estimands: joint risks and marginal risks. Joint risks measure causal associations of treatment effects on S and Y, providing insight into the surrogate value of the biomarker, but are not statistically identifiable from vaccine trial data. While marginal risks do not measure causal associations of treatment effects, they nevertheless provide guidance for future research, and we describe a data collection scheme and assumptions under which the marginal risks are statistically identifiable. We show how different sets of assumptions affect the identifiability of these estimands; in particular, we depart from previous work by considering the consequences of relaxing the assumption of no individual treatment effects on Y before S is measured. Based on algebraic relationships between joint and marginal risks, we propose a sensitivity analysis approach for assessment of surrogate value, and show that in many cases the surrogate value of a biomarker may be hard to establish, even when the sample size is large.
Estimated likelihood; Identifiability; Principal stratification; Sensitivity analysis; Surrogate endpoint; Vaccine trials
Cross-sectional HIV incidence estimation based on a sensitive and less-sensitive test offers great advantages over the traditional cohort study. However, its use has been limited due to concerns about the false negative rate of the less-sensitive test, reflecting the phenomenon that some subjects may remain negative permanently on the less-sensitive test. Wang and Lagakos (2010) propose an augmented cross-sectional design which provides one way to estimate the size of the infected population who remain negative permanently and subsequently incorporate this information in the cross-sectional incidence estimator. In an augmented cross-sectional study, subjects who test negative on the less-sensitive test in the cross-sectional survey are followed forward for transition into the nonrecent state, at which time they would test positive on the less-sensitive test. However, considerable uncertainty exists regarding the appropriate length of follow-up and the size of the infected population who remain nonreactive permanently to the less-sensitive test. In this paper, we assess the impact of varying follow-up time on the resulting incidence estimators from an augmented cross-sectional study, evaluate the robustness of cross-sectional estimators to assumptions about the existence and the size of the subpopulation who will remain negative permanently, and propose a new estimator based on abbreviated follow-up time (AF). Compared to the original estimator from an augmented cross-sectional study, the AF Estimator allows shorter follow-up time and does not require estimation of the mean window period, defined as the average time between detectability of HIV infection with the sensitive and less-sensitive tests. It is shown to perform well in a wide range of settings. We discuss when the AF Estimator would be expected to perform well and offer design considerations for an augmented cross-sectional study with abbreviated follow-up.
Augmented; cross-sectional studies; false negative; incidence estimators
Since the early 1940s, group testing (pooled testing) has been used to reduce costs in a variety of applications, including infectious disease screening, drug discovery, and genetics. In such applications, the goal is often to classify individuals as positive or negative using initial group testing results and the subsequent process of decoding of positive pools. Many decoding algorithms have been proposed, but most fail to acknowledge, and to further exploit, the heterogeneous nature of the individuals being screened. In this paper, we use individuals’ risk probabilities to formulate new informative decoding algorithms which implement Dorfman retesting in a heterogeneous population. We introduce the concept of “thresholding” to classify individuals as “high” or “low risk,” so that separate, risk-specific algorithms may be used, while simultaneously identifying pool sizes that minimize the expected number of tests. When compared to competing algorithms which treat the population as homogeneous, we show that significant gains in testing efficiency can be realized with virtually no loss in screening accuracy. An important additional benefit is that our new procedures are easy to implement. We apply our methods to chlamydia and gonorrhea data collected recently in Nebraska as part of the Infertility Prevention Project.
Dorfman retesting; Group testing; Infertility Prevention Project; Pooled testing; Sensitivity; Specificity
The spatial scan statistic is an important and widely used tool for cluster detection. It is based on the simultaneous evaluation of the statistical significance of the maximum likelihood ratio test statistic over a large collection of potential clusters. In most cluster detection problems, there is variation in the extent of local multiplicity across the study region. For example, using a fixed maximum geographic radius for clusters, urban areas typically have many overlapping potential clusters, while rural areas have relatively few. The spatial scan statistic does not account for local multiplicity variation. We describe a previously proposed local multiplicity adjustment based on a nested Bonferroni correction and propose a novel adjustment based on a Gumbel distribution approximation to the distribution of a local scan statistic. We compare the performance of all three statistics in terms of power and a novel unbiased cluster detection criterion. These methods are then applied to the well-known New York leukemia dataset and a Wisconsin breast cancer incidence dataset.
Bonferroni adjustment; Gumbel distribution; Multiple comparisons; Permutation test; Poisson; Scan statistic
The evaluation of surrogate endpoints for primary use in future clinical trials is an increasingly important research area, due to demands for more efficient trials coupled with recent regulatory acceptance of some surrogates as ‘valid.’ However, little consideration has been given to how a trial which utilizes a newly-validated surrogate endpoint as its primary endpoint might be appropriately designed. We propose a novel Bayesian adaptive trial design that allows the new surrogate endpoint to play a dominant role in assessing the effect of an intervention, while remaining realistically cautious about its use. By incorporating multi-trial historical information on the validated relationship between the surrogate and clinical endpoints, then subsequently evaluating accumulating data against this relationship as the new trial progresses, we adaptively guard against an erroneous assessment of treatment based upon a truly invalid surrogate. When the joint outcomes in the new trial seem plausible given similar historical trials, we proceed with the surrogate endpoint as the primary endpoint, and do so adaptively–perhaps stopping the trial for early success or inferiority of the experimental treatment, or for futility. Otherwise, we discard the surrogate and switch adaptive determinations to the original primary endpoint. We use simulation to test the operating characteristics of this new design compared to a standard O’Brien-Fleming approach, as well as the ability of our design to discriminate trustworthy from untrustworthy surrogates in hypothetical future trials. Furthermore, we investigate possible benefits using patient-level data from 18 adjuvant therapy trials in colon cancer, where disease-free survival is considered a newly-validated surrogate endpoint for overall survival.
Bayesian adaptive design; Clinical trials; Surrogate endpoints; Survival analysis
RNA-seq may replace gene expression microarrays in the near future. Using RNA-seq, the expression of a gene can be estimated using the total number of sequence reads mapped to that gene, known as the Total Read Count (TReC). Traditional eQTL mapping methods, such as linear regression, can be applied to TReC measurements after they are properly normalized. In this paper, we show that eQTL mapping, by directly modeling TReC using discrete distributions, has higher statistical power than the two-step approach: data normalization followed by linear regression. In addition, RNA-seq provides information on allele-specific expression (ASE) that is not available from microarrays. By combining the information from TReC and ASE, we can computationally distinguish cis- and trans-eQTL and further improve the power of cis-eQTL mapping. Both simulation and real data studies confirm the improved power of our new methods. We also discuss the design issues of RNA-seq experiments. Specifically, we show that by combining TReC and ASE measurements, it is possible to minimize cost and retain the statistical power of cis-eQTL mapping by reducing sample size while increasing the number of sequence reads per sample. In addition to RNA-seq data, our method can also be employed to study the genetic basis of other types of sequencing data, such as ChIP-seq (chromatin immunoprecipitation followed by DNA sequencing) data. In this paper, we focus on eQTL mapping of a single gene using the association-based method. However, our method establishes a statistical framework for future developments of eQTL mapping methods using RNA-seq data (e.g. linkage-based eQTL mapping), and the joint study of multiple genetic markers and/or multiple genes.
Allele-specific Expression (ASE); eQTL; RNA-seq; Total Read Count (TReC)
We consider the problem of high-dimensional regression under non-constant error variances. Despite being a common phenomenon in biological applications, heteroscedasticity has, so far, been largely ignored in high-dimensional analysis of genomic data sets. We propose a new methodology that allows non-constant error variances for high-dimensional estimation and model selection. Our method incorporates heteroscedasticity by simultaneously modeling both the mean and variance components via a novel doubly regularized approach. Extensive Monte Carlo simulations indicate that our proposed procedure can result in better estimation and variable selection than existing methods when heteroscedasticity arises from the presence of predictors explaining error variances and outliers. Further, we demonstrate the presence of heteroscedasticity in and apply our method to an expression quantitative trait loci (eQTLs) study of 112 yeast segregants. The new procedure can automatically account for heteroscedasticity in identifying the eQTLs that are associated with gene expression variations and lead to smaller prediction errors. These results demonstrate the importance of considering heteroscedasticity in eQTL data analysis.
Generalized least squares; Heteroscedasticity; Large p small n; Model selection; Sparse regression; Variance estimation
Despite recent flourish of proposals on variable selection, genome-wide multiple loci mapping remains to be challenging. The majority of existing variable selection methods impose a model, and often the homoscedastic linear model, prior to selection. However, the true association between the phenotypical trait and the genetic markers is rarely known a priori, and the presence of epistatic interactions makes the association more complex than a linear relation. Model-free variable selection offers a useful alternative in this context, but the fact that the number of markers p often far exceeds the number of experimental units n renders all the existing model-free solutions that require n > p inapplicable. In this article, we examine a number of model-free variable selection methods for small-n-large-p regressions in the context of genome-wide multiple loci mapping. We propose and advocate a multivariate group-wise adaptive penalization (mGAP) solution, which requires no model pre-specification and thus works for complex trait-marker association, and handles one variable at a time so that works for n < p. Effectiveness of the new method is demonstrated through both intensive simulations and a comprehensive real data analysis across 6,100 gene expression traits.
Adaptive Lasso; Epistatic interaction; Grouped Lasso; Model-free variable selection; Multiple loci mapping; Sliced inverse regression; Iterative Adaptive Lasso; Multivariate group-wise adaptive penalization
Using multiple historical trials with surrogate and true endpoints, we consider various models to predict the effect of treatment on a true endpoint in a target trial in which only a surrogate endpoint is observed. This predicted result is computed using (1) a prediction model (mixture, linear, or principal stratification) estimated from historical trials and the surrogate endpoint of the target trial and (2) a random extrapolation error estimated from successively leaving out each trial among the historical trials. The method applies to either binary outcomes or survival to a particular time that is computed from censored survival data. We compute a 95% confidence interval for the predicted result and validate its coverage using simulation. To summarize the additional uncertainty from using a predicted instead of true result for the estimated treatment effect, we compute its multiplier of standard error. Software is available for download.
Randomized trials; Reproducibility; Principal stratification
Many statistical tests have been proposed for case-control data to detect disease association with multiple single nucleotide polymorphisms (SNPs) in linkage disequilibrium (LD). The main reason for the existence of so many tests is that each test aims to detect one or two aspects of many possible distributional differences between cases and controls, largely due to the lack of a general and yet simple model for discrete genotype data. Here we propose a latent variable model to represent SNP data: the observed SNP data are assumed to be obtained by discretizing a latent multivariate Gaussian variate. Since the latent variate is multivariate Gaussian, its distribution is completely characterized by its mean vector and covariance matrix, in contrast to much more complex forms of a general distribution for discrete multivariate SNP data. We propose a composite likelihood approach for parameter estimation. A direct application of this latent variable model is to association testing with multiple SNPs in a candidate gene or region. In contrast to many existing tests that aim to detect only one or two aspects of many possible distributional differences of discrete SNP data, we can exclusively focus on testing the mean and covariance parameters of the latent Gaussian distributions for cases and controls. Our simulation results demonstrate potential power gains of the proposed approach over some existing methods.
Genome-wide association study; GWAS; latent model; logistic regression; multi-marker analysis; multivariate discrete distribution
This article proposes methodology for assessing goodness of fit in Bayesian hierarchical models. The methodology is based on comparing values of the pivotal discrepancy measures, computed using parameter values drawn from the posterior distribution, versus known reference distributions. Because the resulting diagnostics can be calculated from standard output of Markov chain Monte Carlo algorithms, their computational costs are minimal. Several simulation studies are provided, each of which suggests that diagnostics based on pivotal discrepancy measures have higher statistical power than comparable posterior-predictive diagnostic checks in detecting model departures. The proposed methodology is illustrated in a clinical application; an application to discrete data is described in supplementary material.
Model checking; model criticism; model hierarchy; discrepancy measures; Markov chain Monte Carlo; posterior-predictive density
Double censoring often occurs in registry studies when left censoring is present in addition to right censoring. In this work, we propose a new analysis strategy for such doubly censored data by adopting a quantile regression model. We develop computationally simple estimation and inference procedures by appropriately using the embedded martingale structure. Asymptotic properties, including the uniform consistency and weak convergence, are established for the resulting estimators. Moreover, we propose conditional inference to address the special identifiability issues attached to the doubly censoring setting. We further show that the proposed method can be readily adapted to handle left truncation. Simulation studies demonstrate good finite-sample performance of the new inferential procedures. The practical utility of our method is illustrated by an analysis of the onset of the most commonly investigated respiratory infection, Pseudomonas aeruginosa, in children with cystic fibrosis through the use of the US Cystic Fibrosis Registry.
Conditional inference; Double censoring; Empirical process; Martingale; Regression quantile; Truncation
This article develops semiparametric approaches for estimation of propensity scores and causal survival functions from prevalent survival data. The analytical problem arises when the prevalent sampling is adopted for collecting failure times and, as a result, the covariates are incompletely observed due to their association with failure time. The proposed procedure for estimating propensity scores shares interesting features similar to the likelihood formulation in case-control study, but in our case it requires additional consideration in the intercept term. The result shows that the corrected propensity scores in logistic regression setting can be obtained through standard estimation procedure with specific adjustments on the intercept term. For causal estimation, two different types of missing sources are encountered in our model: one can be explained by potential outcome framework; the other is caused by the prevalent sampling scheme. Statistical analysis without adjusting bias from both sources of missingness will lead to biased results in causal inference. The proposed methods were partly motivated by and applied to the Surveillance, Epidemiology, and End Results (SEER)-Medicare linked data for women diagnosed with breast cancer.
Case-control study; Prevalent sampling; Propensity scores
High density tiling arrays are designed to blanket an entire genomic region of interest using tiled oligonucleotides at very high resolution and are widely used in various biological applications. Experiments are usually conducted in multiple stages, in which unwanted technical variations may be introduced. As tiling arrays become more popular and are adopted by many research labs, it is pressing to develop quality control tools as what were done for expression microarrays. We propose a set of statistical quality metrics analogous to those in expression microarrays with application to tiling array data. We also develop a method to estimate the significance level of an observed quality measurement using randomization tests. These methods have been applied to multiple real data sets, including three independent ChIP-chip experiments and one transcriptom mapping study, and they have successfully identified good quality chips as well as outliers in each study.
Quality control; Randomization test; Robust linear models; Tiling arrays
In this paper, we develop a nonparametric method, called adjusted exponentially tilted likelihood, and apply it to the analysis of morphometric measures. The adjusted exponential tilting estimator is shown to have the same first order asymptotic properties as that of the original exponentially tilted likelihood. The adjusted exponentially tilted likelihood ratio statistic is applied to test linear hypotheses of unknown parameters, such as the associations of brain measures (e.g., cortical and subcortical surfaces) with covariates of interest, such as age, gender, and gene. Simulation studies show that the adjusted exponential tilted likelihood ratio statistic performs as well as the t-test when the imaging data are symmetrically distributed, while it is superior when the imaging data have skewed distribution. We demonstrate the application of our new statistical methods to the detection of statistically significant differences in the morphology of the hippocampus between two schizophrenia groups and healthy subjects.
Adjusted exponential tilted likelihood; Hypothesis testing; M-rep; Morphometric measure
Recent guidance from the Food and Drug Administration for the evaluation of new therapies in the treatment of type 2 diabetes (T2DM) calls for a program-wide meta-analysis of cardiovascular (CV) outcomes. In this context, we develop a new Bayesian meta-analysis approach using survival regression models to assess whether the size of a clinical development program is adequate to evaluate a particular safety endpoint. We propose a Bayesian sample size determination methodology for meta-analysis clinical trial design with a focus on controlling the type I error and power. We also propose the partial borrowing power prior to incorporate the historical survival meta data into the statistical design. Various properties of the proposed methodology are examined and an efficient Markov chain Monte Carlo sampling algorithm is developed to sample from the posterior distributions. In addition, we develop a simulation-based algorithm for computing various quantities, such as the power and the type I error in the Bayesian meta-analysis trial design. The proposed methodology is applied to the design of a phase 2/3 development program including a noninferiority clinical trial for CV risk assessment in T2DM studies.
Fitting prior; Partial borrowing power prior; Sampling prior; Simulation; Survival data
A treatment regime is a rule that assigns a treatment, among a set of possible treatments, to a patient as a function of his/her observed characteristics, hence “personalizing” treatment to the patient. The goal is to identify the optimal treatment regime that, if followed by the entire population of patients, would lead to the best outcome on average. Given data from a clinical trial or observational study, for a single treatment decision, the optimal regime can be found by assuming a regression model for the expected outcome conditional on treatment and covariates, where, for a given set of covariates, the optimal treatment is the one that yields the most favorable expected outcome. However, treatment assignment via such a regime is suspect if the regression model is incorrectly specified. Recognizing that, even if misspecified, such a regression model defines a class of regimes, we instead consider finding the optimal regime within such a class by finding the regime the optimizes an estimator of overall population mean outcome. To take into account possible confounding in an observational study and to increase precision, we use a doubly robust augmented inverse probability weighted estimator for this purpose. Simulations and application to data from a breast cancer clinical trial demonstrate the performance of the method.
Doubly robust estimator; Inverse probability weighting; Outcome regression; Personalized medicine; Potential outcomes; Propensity score
In analysis of longitudinal data, it is not uncommon that observation times of repeated measurements are subject-specific and correlated with underlying longitudinal outcomes. Taking account of the dependence between observation times and longitudinal outcomes is critical under these situations to assure the validity of statistical inference. In this article, we propose a flexible joint model for longitudinal data analysis in the presence of informative observation times. In particular, the new procedure considers the shared random-effect model and assume a time-varying coefficient for the latent variable, allowing a flexible way of modeling longitudinal outcomes while adjusting their association with observation times. Estimating equations are developed for parameter estimation. We show that the resulting estimators are consistent and asymptotically normal, with variance-covariance matrix that has a closed form and can be consistently estimated by the usual plug-in method. One additional advantage of the procedure is that, it provides a unified framework to test whether the effect of the latent variable is zero, constant, or time-varying. Simulation studies show that the proposed approach is appropriate for practical use. An application to a bladder cancer data is also given to illustrate the methodology.
Estimating equation method; Informative observation times; Longitudinal data analysis; Time-varying effect
In epidemics of infectious diseases such as influenza, an individual may have one of four possible final states: prior immune, escaped from infection, infected with symptoms, and infected asymptomatically. The exact state is often not observed. In addition, the unobserved transmission times of asymptomatic infections further complicate analysis. Under the assumption of missing at random, data-augmentation techniques can be used to integrate out such uncertainties. We adapt an importance-sampling-based Monte Carlo EM (MCEM) algorithm to the setting of an infectious disease transmitted in close contact groups. Assuming the independence between close contact groups, we propose a hybrid EM-MCEM algorithm that applies the MCEM or the traditional EM algorithms to each close contact group depending on the dimension of missing data in that group, and discuss the variance estimation for this practice. In addition, we propose a bootstrap approach to assess the total Monte Carlo error and factor that error into the variance estimation. The proposed methods are evaluated using simulation studies. We use the hybrid EM-MCEM algorithm to analyze two influenza epidemics in the late 1970s to assess the effects of age and pre-season antibody levels on the transmissibility and pathogenicity of the viruses.
Data augmentation; EM algorithm; Infectious disease; Missing data; Monte Carlo
Restricted mean lifetime is often of direct interest in epidemiologic studies involving censored survival times. Differences in this quantity can be used as a basis for comparing several groups. For example, transplant surgeons, nephrologists and of course patients are interested in comparing post-transplant lifetimes among various types of kidney transplants in order to assist in clinical decision-making. As the factor of interest is not randomized, covariate adjustment is needed in order to account for imbalances in confounding factors. In this report, we use semiparametric theory to develop an estimator for differences in restricted mean lifetimes while accounting for confounding factors. The proposed method involves building working models for the time-to-event and coarsening mechanism (i.e., group assignment and censoring). We show that the proposed estimator possesses the double robust property; i.e., when either the time-to-event or coarsening process is modeled correctly, the estimator is consistent and asymptotically normal. Simulation studies are conducted to assess its finite-sample performance and the method is applied to national kidney transplant data.
Average causal effect; Cox regression; Cumulative treatment effect; Double robust estimator; Inverse weighting
A prevalent sample consists of individuals who have experienced disease incidence but not failure event at the sampling time. We discuss methods for estimating the distribution function of a random vector defined at baseline for an incident disease population when data are collected by prevalent sampling. Prevalent sampling design is often more focused and economical than incident study design for studying the survival distribution of a diseased population, but prevalent samples are biased by design. Subjects with longer survival time are more likely to be included in a prevalent cohort, and other baseline variables of interests that are correlated with survival time are also subject to sampling bias induced by the prevalent sampling scheme. Without recognition of the bias, applying empirical distribution function to estimate the population distribution of baseline variables can lead to serious bias. In this article, nonparametric and semiparametric methods are developed for distribution estimation of baseline variables using prevalent data.
Accelerated failure time model; Cross-sectional sampling; Left truncation; Proportional hazards model
We propose in this paper a powerful testing procedure for detecting a gene effect on a continuous outcome in the presence of possible gene-gene interactions (epistasis) in a gene set, e.g. a genetic pathway or network. Traditional tests for this purpose require a large number of degrees of freedom by testing the main effect and all the corresponding interactions under a parametric assumption, and hence suffer from low power. In this paper, we propose a powerful kernel machine based test. Specifically, our test is based on a garrote kernel method and is constructed as a score test. Here, the term garrote refers to an extra nonnegative parameter which is multiplied to the covariate of interest so that our score test can be formulated in terms of this nonnegative parameter. A key feature of the proposed test is that it is flexible and developed for both parametric and nonparametric models within a unified framework, and is more powerful than the standard test by accounting for the correlation among genes and hence often uses a much smaller degrees of freedom. We investigate the theoretical properties of the proposed test. We evaluate its finite sample performance using simulation studies, and apply the method to the Michigan prostate cancer gene expression data.
Garrote; Gene-gene interaction; Kernel machine; Mixed models; Restricted maximum likelihood; Score test; Semiparametric regression
Quantitative procedures for evaluating added values from new markers over a conventional risk scoring system for predicting event rates at specific time points have been extensively studied. However, a single summary statistic, for example, the area under the receiver operating characteristic curve or its derivatives, may not provide a clear picture about the relationship between the conventional and the new risk scoring systems. When there are no censored event time observations in the data, two simple scatterplots with individual conventional and new scores for “cases” and “controls” provide valuable information regarding the overall and the subject-specific level incremental values from the new markers. Unfortunately, in the presence of censoring, it is not clear how to construct such plots. In this paper, we propose a nonparametric estimation procedure for the distributions of the differences between two risk scores conditional on the conventional score. The resulting quantile curves of these differences over the subject-specific conventional score provide extra information about the overall added value from the new marker. They also help us to identify a subgroup of future subjects who need the new predictors, especially when there is no unified utility function available for cost-risk-benefit decision making. The procedure is illustrated with two data sets. The first is from a well-known Mayo Clinic PBC liver study. The second is from a recent breast cancer study on evaluating the added value from a gene score, which is relatively expensive to measure compared with the routinely used clinical biomarkers for predicting the patient's survival after surgery.
Discriminant analysis; Nonparametric function estimation; Prediction; Receiver operating characteristic curve