It has recently been proposed that variation in DNA methylation at specific genomic locations may play an important role in the development of complex diseases such as cancer. Here, we develop 1- and 2-group multiple testing procedures for identifying and quantifying regions of DNA methylation variability. Our method is the first genome-wide statistical significance calculation for increased or differential variability, as opposed to the traditional approach of testing for mean changes. We apply these procedures to genome-wide methylation data obtained from biological and technical replicates and provide the first statistical proof that variably methylated regions exist and are due to interindividual variation. We also show that differentially variable regions in colon tumor and normal tissue show enrichment of genes regulating gene expression, cell morphogenesis, and development, supporting a biological role for DNA methylation variability in cancer.
Bump finding; Functional data analysis; Multiple testing; Preprocessing; Variably methylation regions (VMRs)
Mixed models are commonly used to represent longitudinal or repeated measures data. An additional complication arises when the response is censored, for example, due to limits of quantification of the assay used. While Gaussian random effects are routinely assumed, little work has characterized the consequences of misspecifying the random-effects distribution nor has a more flexible distribution been studied for censored longitudinal data. We show that, in general, maximum likelihood estimators will not be consistent when the random-effects density is misspecified, and the effect of misspecification is likely to be greatest when the true random-effects density deviates substantially from normality and the number of noncensored observations on each subject is small. We develop a mixed model framework for censored longitudinal data in which the random effects are represented by the flexible seminonparametric density and show how to obtain estimates in SAS procedure NLMIXED. Simulations show that this approach can lead to reduction in bias and increase in efficiency relative to assuming Gaussian random effects. The methods are demonstrated on data from a study of hepatitis C virus.
Censoring; HCV; HIV; Limit of quantification; Longitudinal data; Random effects
Nested case–control (NCC) design is used frequently in epidemiological studies as a cost-effective subcohort sampling strategy to conduct biomarker research. Sampling strategy, on the other hoand, creates challenges for data analysis because of outcome-dependent missingness in biomarker measurements. In this paper, we propose inverse probability weighted (IPW) methods for making inference about the prognostic accuracy of a novel biomarker for predicting future events with data from NCC studies. The consistency and asymptotic normality of these estimators are derived using the empirical process theory and convergence theorems for sequences of weakly dependent random variables. Simulation and analysis using Framingham Offspring Study data suggest that the proposed methods perform well in finite samples.
Inverse probability weighting; Nested case–control study; Time-dependent accuracy
Sensitivity and specificity are common measures of the accuracy of a diagnostic test. The usual estimators of these quantities are unbiased if data on the diagnostic test result and the true disease status are obtained from all subjects in an appropriately selected sample. In some studies, verification of the true disease status is performed only for a subset of subjects, possibly depending on the result of the diagnostic test and other characteristics of the subjects. Estimators of sensitivity and specificity based on this subset of subjects are typically biased; this is known as verification bias. Methods have been proposed to correct verification bias under the assumption that the missing data on disease status are missing at random (MAR), that is, the probability of missingness depends on the true (missing) disease status only through the test result and observed covariate information. When some of the covariates are continuous, or the number of covariates is relatively large, the existing methods require parametric models for the probability of disease or the probability of verification (given the test result and covariates), and hence are subject to model misspecification. We propose a new method for correcting verification bias based on the propensity score, defined as the predicted probability of verification given the test result and observed covariates. This is estimated separately for those with positive and negative test results. The new method classifies the verified sample into several subsamples that have homogeneous propensity scores and allows correction for verification bias. Simulation studies demonstrate that the new estimators are more robust to model misspecification than existing methods, but still perform well when the models for the probability of disease and probability of verification are correctly specified.
Diagnostic test; Model misspecification; Propensity score; Sensitivity; Specificity
Many applications of biomedical science involve unobservable constructs, from measurement of health states to severity of complex diseases. The primary aim of measurement is to identify relevant pieces of observable information that thoroughly describe the construct of interest. Validation of the construct is often performed separately. Noting the increasing popularity of latent variable methods in biomedical research, we propose a Multiple Indicator Multiple Cause (MIMIC) latent variable model that combines item reduction and validation. Our joint latent variable model accounts for the bias that occurs in the traditional 2-stage process. The methods are motivated by an example from the Physical Activity and Lymphedema clinical trial in which the objectives were to describe lymphedema severity through self-reported Likert scale symptoms and to determine the relationship between symptom severity and a “gold standard” diagnostic measure of lymphedema. The MIMIC model identified 1 symptom as a potential candidate for removal. We present this paper as an illustration of the advantages of joint latent variable models and as an example of the applicability of these models for biomedical research.
Factor analysis; Latent variable models; Lymphedema; Multiple Indicator Multiple Cause models
We present a new method for Bayesian Markov Chain Monte Carlo–based inference in certain types of stochastic models, suitable for modeling noisy epidemic data. We apply the so-called uniformization representation of a Markov process, in order to efficiently generate appropriate conditional distributions in the Gibbs sampler algorithm. The approach is shown to work well in various data-poor settings, that is, when only partial information about the epidemic process is available, as illustrated on the synthetic data from SIR-type epidemics and the Center for Disease Control and Prevention data from the onset of the H1N1 pandemic in the United States.
Gibbs sampler; Kinetic constants; Maximum likelihood; SIR model; Stochastic kinetics network
Understanding conception probabilities is important not only for helping couples to achieve pregnancy but also in identifying acute or chronic reproductive toxicants that affect the highly timed and interrelated processes underlying hormonal profiles, ovulation, libido, and conception during menstrual cycles. Currently, 2 statistical approaches are available for estimating conception probabilities depending upon the research question and extent of data collection during the menstrual cycle: a survival approach when interested in modeling time-to-pregnancy (TTP) in relation to women or couples' purported exposure(s), or a hierarchical Bayesian approach when one is interested in modeling day-specific conception probabilities during the estimated fertile window. We propose a biologically valid discrete survival model that unifies the above 2 approaches while relaxing some assumptions that may not be consistent with human reproduction or behavior. This approach combines both the survival and the hierarchical models allowing investigators to obtain the distribution of TTP and day-specific probabilities during the fertile window in a single model. Our model allows for the consideration of covariate effects at both the cycle and the daily level while accounting for daily variation in conception. We conduct extensive simulations and utilize the New York State Angler Prospective Pregnancy Cohort Study to illustrate our approach. We also provide the code to implement the model in R software in the supplemental section of the supplementary material available at Biostatistics online.
Censoring; Conception; Discrete survival; Fecundity; Random effects; Time-varying covariates
High-dimensional biomarker data are often collected in epidemiological studies when assessing the association between biomarkers and human disease is of interest. We develop a latent class modeling approach for joint analysis of high-dimensional semicontinuous biomarker data and a binary disease outcome. To model the relationship between complex biomarker expression patterns and disease risk, we use latent risk classes to link the 2 modeling components. We characterize complex biomarker-specific differences through biomarker-specific random effects, so that different biomarkers can have different baseline (low-risk) values as well as different between-class differences. The proposed approach also accommodates data features that are common in environmental toxicology and other biomarker exposure data, including a large number of biomarkers, numerous zero values, and complex mean–variance relationship in the biomarkers levels. A Monte Carlo EM (MCEM) algorithm is proposed for parameter estimation. Both the MCEM algorithm and model selection procedures are shown to work well in simulations and applications. In applying the proposed approach to an epidemiological study that examined the relationship between environmental polychlorinated biphenyl (PCB) exposure and the risk of endometriosis, we identified a highly significant overall effect of PCB concentrations on the risk of endometriosis.
Categorical data; Chemical exposure biomarkers; Latent variables; Monte Carlo EM algorithm; Random effects
Clinical demand for individualized “adaptive” treatment policies in diverse fields has spawned development of clinical trial methodology for their experimental evaluation via multistage designs, building upon methods intended for the analysis of naturalistically observed strategies. Because often there is no need to parametrically smooth multistage trial data (in contrast to observational data for adaptive strategies), it is possible to establish direct connections among different methodological approaches. We show by algebraic proof that the maximum likelihood (ML) and optimal semiparametric (SP) estimators of the population mean of the outcome of a treatment policy and its standard error are equal under certain experimental conditions. This result is used to develop a unified and efficient approach to design and inference for multistage trials of policies that adapt treatment according to discrete responses. We derive a sample size formula expressed in terms of a parametric version of the optimal SP population variance. Nonparametric (sample-based) ML estimation performed well in simulation studies, in terms of achieved power, for scenarios most likely to occur in real studies, even though sample sizes were based on the parametric formula. ML outperformed the SP estimator; differences in achieved power predominately reflected differences in their estimates of the population mean (rather than estimated standard errors). Neither methodology could mitigate the potential for overestimated sample sizes when strong nonlinearity was purposely simulated for certain discrete outcomes; however, such departures from linearity may not be an issue for many clinical contexts that make evaluation of competitive treatment policies meaningful.
Adaptive treatment strategy; Efficient SP estimation; Maximum likelihood; Multi-stage design; Sample size formula
Semiparametric transformation models provide a very general framework for studying the effects of (possibly time-dependent) covariates on survival time and recurrent event times. Assessing the adequacy of these models is an important task because model misspecification affects the validity of inference and the accuracy of prediction. In this paper, we introduce appropriate time-dependent residuals for these models and consider the cumulative sums of the residuals. Under the assumed model, the cumulative sum processes converge weakly to zero-mean Gaussian processes whose distributions can be approximated through Monte Carlo simulation. These results enable one to assess, both graphically and numerically, how unusual the observed residual patterns are in reference to their null distributions. The residual patterns can also be used to determine the nature of model misspecification. Extensive simulation studies demonstrate that the proposed methods perform well in practical situations. Three medical studies are provided for illustrations.
Goodness of fit; Martingale residuals; Model checking; Model misspecification; Model selection; Recurrent events; Survival data; Time-dependent covariate
Association studies in environmental statistics often involve exposure and outcome data that are misaligned in space. A common strategy is to employ a spatial model such as universal kriging to predict exposures at locations with outcome data and then estimate a regression parameter of interest using the predicted exposures. This results in measurement error because the predicted exposures do not correspond exactly to the true values. We characterize the measurement error by decomposing it into Berkson-like and classical-like components. One correction approach is the parametric bootstrap, which is effective but computationally intensive since it requires solving a nonlinear optimization problem for the exposure model parameters in each bootstrap sample. We propose a less computationally intensive alternative termed the “parameter bootstrap” that only requires solving one nonlinear optimization problem, and we also compare bootstrap methods to other recently proposed methods. We illustrate our methodology in simulations and with publicly available data from the Environmental Protection Agency.
Environmental epidemiology; Environmental statistics; Exposure modeling; Kriging; Measurement error
Accurate risk prediction is an important step in developing optimal strategies for disease prevention and treatment. Based on the predicted risks, patients can be stratified to different risk categories where each category corresponds to a particular clinical intervention. Incorrect or suboptimal interventions are likely to result in unnecessary financial and medical consequences. It is thus essential to account for the costs associated with the clinical interventions when developing and evaluating risk stratification (RS) rules for clinical use. In this article, we propose to quantify the value of an RS rule based on the total expected cost attributed to incorrect assignment of risk groups due to the rule. We have established the relationship between cost parameters and optimal threshold values used in the stratification rule that minimizes the total expected cost over the entire population of interest. Statistical inference procedures are developed for evaluating and comparing given RS rules and examined through simulation studies. The proposed procedures are illustrated with an example from the Cardiovascular Health Study.
Disease prognosis; Optimal risk stratification; Risk prediction
In high-throughput -omics studies, markers identified from analysis of single data sets often suffer from a lack of reproducibility because of sample limitation. A cost-effective remedy is to pool data from multiple comparable studies and conduct integrative analysis. Integrative analysis of multiple -omics data sets is challenging because of the high dimensionality of data and heterogeneity among studies. In this article, for marker selection in integrative analysis of data from multiple heterogeneous studies, we propose a 2-norm group bridge penalization approach. This approach can effectively identify markers with consistent effects across multiple studies and accommodate the heterogeneity among studies. We propose an efficient computational algorithm and establish the asymptotic consistency property. Simulations and applications in cancer profiling studies show satisfactory performance of the proposed approach.
High-dimensional data; Integrative analysis; 2-norm group bridge
We consider here the problem of classifying a macro-level object based on measurements of embedded (micro-level) observations within each object, for example, classifying a patient based on measurements on a collection of a random number of their cells. Classification problems with this hierarchical, nested structure have not received the same statistical understanding as the general classification problem. Some heuristic approaches have been developed and a few authors have proposed formal statistical models. We focus on the problem where heterogeneity exists between the macro-level objects within a class. We propose a model-based statistical methodology that models the log-odds of the macro-level object belonging to a class using a latent-class variable model to account for this heterogeneity. The latent classes are estimated by clustering the macro-level object density estimates. We apply this method to the detection of patients with cervical neoplasia based on quantitative cytology measurements on cells in a Papanicolaou smear. Quantitative cytology is much cheaper and potentially can take less time than the current standard of care. The results show that the automated quantitative cytology using the proposed method is roughly equivalent to clinical cytopathology and shows significant improvement over a statistical model that does not account for the heterogeneity of the data.
Automating cervical neoplasia screening; Clustering densities; Cumulative log-odds; Functional data clustering; Macro-level classification; Quantitative cytology
Development of human immunodeficiency virus resistance mutations is a major cause of failure of antiretroviral treatment. We develop a recursive partitioning method to correlate high-dimensional viral sequences with repeatedly measured outcomes. The splitting criterion of this procedure is based on a class of U-type score statistics. The proposed method is flexible enough to apply to a broad range of problems involving longitudinal outcomes. Simulation studies are performed to explore the finite-sample properties of the proposed method, which is also illustrated through analysis of data collected in 3 phase II clinical trials testing the antiretroviral drug efavirenz.
Antiretroviral drugs; Longitudinal data; Recursive partitioning; Repeated measurements; Resistance mutations; Tree method
Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.
Cancer; DNA copy number; False discovery rate; Mutation
We investigate a change-point approach for modeling and estimating the regression effects caused by a concomitant intervention in a longitudinal study. Since a concomitant intervention is often introduced when a patient's health status exhibits undesirable trends, statistical models without properly incorporating the intervention and its starting time may lead to biased estimates of the intervention effects. We propose a shared parameter change-point model to evaluate the pre- and postintervention time trends of the response and develop a likelihood-based method for estimating the intervention effects and other parameters. Application and statistical properties of our method are demonstrated through a longitudinal clinical trial in depression and heart disease and a simulation study.
Change-point model; Concomitant intervention; Likelihood; Longitudinal study; Shared parameter model
In air pollution epidemiology, there is a growing interest in estimating the health effects of coarse particulate matter (PM) with aerodynamic diameter between 2.5 and 10 μm. Coarse PM concentrations can exhibit considerable spatial heterogeneity because the particles travel shorter distances and do not remain suspended in the atmosphere for an extended period of time. In this paper, we develop a modeling approach for estimating the short-term effects of air pollution in time series analysis when the ambient concentrations vary spatially within the study region. Specifically, our approach quantifies the error in the exposure variable by characterizing, on any given day, the disagreement in ambient concentrations measured across monitoring stations. This is accomplished by viewing monitor-level measurements as error-prone repeated measurements of the unobserved population average exposure. Inference is carried out in a Bayesian framework to fully account for uncertainty in the estimation of model parameters. Finally, by using different exposure indicators, we investigate the sensitivity of the association between coarse PM and daily hospital admissions based on a recent national multisite time series analysis. Among Medicare enrollees from 59 US counties between the period 1999 and 2005, we find a consistent positive association between coarse PM and same-day admission for cardiovascular diseases.
Air pollution; Coarse particulate matter; Exposure measurement error; Multisite time series analysis
Our research is motivated by 2 methodological problems in assessing diagnostic accuracy of traditional Chinese medicine (TCM) doctors in detecting a particular symptom whose true status has an ordinal scale and is unknown—imperfect gold standard bias and ordinal scale symptom status. In this paper, we proposed a nonparametric maximum likelihood method for estimating and comparing the accuracy of different doctors in detecting a particular symptom without a gold standard when the true symptom status had an ordered multiple class. In addition, we extended the concept of the area under the receiver operating characteristic curve to a hyper-dimensional overall accuracy for diagnostic accuracy and alternative graphs for displaying a visual result. The simulation studies showed that the proposed method had good performance in terms of bias and mean squared error. Finally, we applied our method to our motivating example on assessing the diagnostic abilities of 5 TCM doctors in detecting symptoms related to Chills disease.
Bootstrap; Diagnostic accuracy; EM algorithm; MSE; Ordinal tests; Traditional Chinese medicine (TCM); Volume under the ROC surface (VUS)
The resampling-based test, which often relies on permutation or bootstrap procedures, has been widely used for statistical hypothesis testing when the asymptotic distribution of the test statistic is unavailable or unreliable. It requires repeated calculations of the test statistic on a large number of simulated data sets for its significance level assessment, and thus it could become very computationally intensive. Here, we propose an efficient p-value evaluation procedure by adapting the stochastic approximation Markov chain Monte Carlo algorithm. The new procedure can be used easily for estimating the p-value for any resampling-based test. We show through numeric simulations that the proposed procedure can be 100–500 000 times as efficient (in term of computing time) as the standard resampling-based procedure when evaluating a test statistic with a small p-value (e.g. less than 10 − 6). With its computational burden reduced by this proposed procedure, the versatile resampling-based test would become computationally feasible for a much wider range of applications. We demonstrate the application of the new method by applying it to a large-scale genetic association study of prostate cancer.
Bootstrap procedures; Genetic association studies; p-value; Resampling-based tests; Stochastic approximation Markov chain Monte Carlo
Two-stage design has long been recognized to be a cost-effective way for conducting biomedical studies. In many trials, auxiliary covariate information may also be available, and it is of interest to exploit these auxiliary data to improve the efficiency of inferences. In this paper, we propose a 2-stage design with continuous outcome where the second-stage data is sampled with an “outcome-auxiliary-dependent sampling” (OADS) scheme. We propose an estimator which is the maximizer for an estimated likelihood function. We show that the proposed estimator is consistent and asymptotically normally distributed. The simulation study indicates that greater study efficiency gains can be achieved under the proposed 2-stage OADS design by utilizing the auxiliary covariate information when compared with other alternative sampling schemes. We illustrate the proposed method by analyzing a data set from an environmental epidemiologic study.
Auxiliary covariate; Kernel smoothing; Outcome-auxiliary-dependent sampling; 2-stage sampling design
When the true end points (T) are difficult or costly to measure, surrogate markers (S) are often collected in clinical trials to help predict the effect of the treatment (Z). There is great interest in understanding the relationship among S, T, and Z. A principal stratification (PS) framework has been proposed by Frangakis and Rubin (2002) to study their causal associations. In this paper, we extend the framework to a multiple trial setting and propose a Bayesian hierarchical PS model to assess surrogacy. We apply the method to data from a large collection of colon cancer trials in which S and T are binary. We obtain the trial-specific causal measures among S, T, and Z, as well as their overall population-level counterparts that are invariant across trials. The method allows for information sharing across trials and reduces the nonidentifiability problem. We examine the frequentist properties of our model estimates and the impact of the monotonicity assumption using simulations. We also illustrate the challenges in evaluating surrogacy in the counterfactual framework that result from nonidentifiability.
Bayesian estimation; Counterfactual model; Identifiability; Multiple trials; Principal stratification; Surrogate marker
Within the pattern-mixture modeling framework for informative dropout, conditional linear models (CLMs) are a useful approach to deal with dropout that can occur at any point in continuous time (not just at observation times). However, in contrast with selection models, inferences about marginal covariate effects in CLMs are not readily available if nonidentity links are used in the mean structures. In this article, we propose a CLM for long series of longitudinal binary data with marginal covariate effects directly specified. The association between the binary responses and the dropout time is taken into account by modeling the conditional mean of the binary response as well as the dependence between the binary responses given the dropout time. Specifically, parameters in both the conditional mean and dependence models are assumed to be linear or quadratic functions of the dropout time; and the continuous dropout time distribution is left completely unspecified. Inference is fully Bayesian. We illustrate the proposed model using data from a longitudinal study of depression in HIV-infected women, where the strategy of sensitivity analysis based on the extrapolation method is also demonstrated.
Bayesian analysis; HIV/AIDS; Marginal model; Missing data; Sensitivity analysis
Understanding infectious disease dynamics and the effect on prevalence and incidence is crucial for public health policies. Disease incidence and prevalence are typically not observed directly and increasingly are estimated through the synthesis of indirect information from multiple data sources. We demonstrate how an evidence synthesis approach to the estimation of human immunodeficiency virus (HIV) prevalence in England and Wales can be extended to infer the underlying HIV incidence. Diverse time series of data can be used to obtain yearly “snapshots” (with associated uncertainty) of the proportion of the population in 4 compartments: not at risk, susceptible, HIV positive but undiagnosed, and diagnosed HIV positive. A multistate model for the infection and diagnosis processes is then formulated by expressing the changes in these proportions by a system of differential equations. By parameterizing incidence in terms of prevalence and contact rates, HIV transmission is further modeled. Use of additional data or prior information on demographics, risk behavior change and contact parameters allows simultaneous estimation of the transition rates, compartment prevalences, contact rates, and transmission probabilities.
Bayesian; Dynamic transmission model; Evidence synthesis; HIV; Incidence; Prevalence
Food records, including 24-hour recalls and diet diaries, are considered to provide generally superior measures of long-term dietary intake relative to questionnaire-based methods. Despite the expense of processing food records, they are increasingly used as the main dietary measurement in nutritional epidemiology, in particular in sub-studies nested within prospective cohorts. Food records are, however, subject to excess reports of zero intake. Measurement error is a serious problem in nutritional epidemiology because of the lack of gold standard measurements and results in biased estimated diet–disease associations. In this paper, a 3-part measurement error model, which we call the never and episodic consumers (NEC) model, is outlined for food records. It allows for both real zeros, due to never consumers, and excess zeros, due to episodic consumers (EC). Repeated measurements are required for some study participants to fit the model. Simulation studies are used to compare the results from using the proposed model to correct for measurement error with the results from 3 alternative approaches: a crude approach using the mean of repeated food record measurements as the exposure, a linear regression calibration (RC) approach, and an EC model which does not allow real zeros. The crude approach results in badly attenuated odds ratio estimates, except in the unlikely situation in which a large number of repeat measurements is available for all participants. Where repeat measurements are available for all participants, the 3 correction methods perform equally well. However, when only a subset of the study population has repeat measurements, the NEC model appears to provide the best method for correcting for measurement error, with the 2 alternative correction methods, in particular the linear RC approach, resulting in greater bias and loss of coverage. The NEC model is extended to include adjustment for measurements from food frequency questionnaires, enabling better estimation of the proportion of never consumers when the number of repeat measurements is small. The methods are applied to 7-day diary measurements of alcohol intake in the EPIC-Norfolk study.
Excess zeros; Measurement error; Nutritional epidemiology; Repeated measures