We argue that the time from the onset of infectiousness to infectious contact, which we call the “contact interval,” is a better basis for inference in epidemic data than the generation or serial interval. Since contact intervals can be right censored, survival analysis is the natural approach to estimation. Estimates of the contact interval distribution can be used to estimate R0 in both mass-action and network-based models. We apply these methods to 2 data sets from the 2009 influenza A(H1N1) pandemic.
Basic reproductive number (R0); Epidemic data; Generation intervals; Survival analysis
The outcome-dependent sampling (ODS) design, which allows observation of exposure variable to depend on the outcome, has been shown to be cost efficient. In this article, we propose a new statistical inference method, an estimated penalized likelihood method, for a partial linear model in the setting of a 2-stage ODS with a continuous outcome. We develop the asymptotic properties and conduct simulation studies to demonstrate the performance of the proposed estimator. A real environmental study data set is used to illustrate the proposed method.
Biased sampling; Partial linear model; P-spline; Validation sample; 2-stage
Nucleosomes are units of chromatin structure, consisting of DNA sequence wrapped around proteins called “histones.” Nucleosomes occur at variable intervals throughout genomic DNA and prevent transcription factor (TF) binding by blocking TF access to the DNA. A map of nucleosomal locations would enable researchers to detect TF binding sites with greater efficiency. Our objective is to construct an accurate genomic map of nucleosome-free regions (NFRs) based on data from high-throughput genomic tiling arrays in yeast. These high-volume data typically have a complex structure in the form of dependence on neighboring probes as well as underlying DNA sequence, variable-sized gaps, and missing data. We propose a novel continuous-index model appropriate for non-equispaced tiling array data that simultaneously incorporates DNA sequence features relevant to nucleosome formation. Simulation studies and an application to a yeast nucleosomal assay demonstrate the advantages of using the new modeling framework, as well as its robustness to distributional misspecifications. Our results reinforce the previous biological hypothesis that higher-order nucleotide combinations are important in distinguishing nucleosomal regions from NFRs.
Chromatin structure; Data augmentation; FAIRE; Tiling arrays
For medical classification problems, it is often desirable to have a probability associated with each class. Probabilistic classifiers have received relatively little attention for small n large p classification problems despite of their importance in medical decision making. In this paper, we introduce 2 criteria for assessment of probabilistic classifiers: well-calibratedness and refinement and develop corresponding evaluation measures. We evaluated several published high-dimensional probabilistic classifiers and developed 2 extensions of the Bayesian compound covariate classifier. Based on simulation studies and analysis of gene expression microarray data, we found that proper probabilistic classification is more difficult than deterministic classification. It is important to ensure that a probabilistic classifier is well calibrated or at least not “anticonservative” using the methods developed here. We provide this evaluation for several probabilistic classifiers and also evaluate their refinement as a function of sample size under weak and strong signal conditions. We also present a cross-validation method for evaluating the calibration and refinement of any probabilistic classifier on any data set.
Gene expression analysis; High-dimensional data; Microarray; Probabilistic classification
Recurrent events are the natural outcome in many medical and epidemiology studies. To assess covariate effects on the gaps between consecutive recurrent events, the Cox proportional hazards model is frequently employed in data analysis. The validity of statistical inference, however, depends on the appropriateness of the Cox model. In this paper, we propose a class of graphical techniques and formal tests for checking the Cox model with recurrent gap time data. The building block of our model checking method is an averaged martingale-like process, based on which a class of multiparameter stochastic processes is proposed. This maneuver is very general and can be used to assess different aspects of model fit. Numerical simulations are conducted to examine finite-sample performance, and the proposed model checking techniques are illustrated with data from the Danish Psychiatric Central Register.
Correlated failure times; Induced-dependent censoring; Kaplan–Meier estimator; Renewal processes
Genetic mutations may interact to increase the risk of human complex diseases. Mapping of multiple interacting disease loci in the human genome has recently shown promise in detecting genes with little main effects. The power of interaction association mapping, however, can be greatly influenced by the set of single nucleotide polymorphism (SNP) genotyped in a case–control study. Previous imputation methods only focus on imputation of individual SNPs without considering their joint distribution of possible interactions. We present a new method that simultaneously detects multilocus interaction associations and imputes missing SNPs from a full Bayesian model. Our method treats both the case–control sample and the reference data as random observations. The output of our method is the posterior probabilities of SNPs for their marginal and interacting associations with the disease. Using simulations, we show that the method produces accurate and robust imputation with little overfitting problems. We further show that, with the type I error rate maintained at a common level, SNP imputation can consistently and sometimes substantially improve the power of detecting disease interaction associations. We use a data set of inflammatory bowel disease to demonstrate the application of our method.
Bayesian analysis; Case–control studies; Missing data
The Superior Yield of the New Strategy of Enoxaparin, Revascularization, and GlYcoprotein IIb/IIIa inhibitors (SYNERGY) was a randomized, open-label, multicenter clinical trial comparing 2 anticoagulant drugs on the basis of time-to-event endpoints. In contrast to other studies of these agents, the primary, intent-to-treat analysis did not find evidence of a difference, leading to speculation that premature discontinuation of the study agents by some subjects may have attenuated the apparent treatment effect and thus to interest in inference on the difference in survival distributions were all subjects in the population to follow the assigned regimens, with no discontinuation. Such inference is often attempted via ad hoc analyses that are not based on a formal definition of this treatment effect. We use SYNERGY as a context in which to describe how this effect may be conceptualized and to present a statistical framework in which it may be precisely identified, which leads naturally to inferential methods based on inverse probability weighting.
Dynamic treatment regime; Inverse probability weighting; Potential outcomes; Proportional hazards model
DNA methylation is a key regulator of gene function in a multitude of both normal and abnormal biological processes, but tools to elucidate its roles on a genome-wide scale are still in their infancy. Methylation sensitive restriction enzymes and microarrays provide a potential high-throughput, low-cost platform to allow methylation profiling. However, accurate absolute methylation estimates have been elusive due to systematic errors and unwanted variability. Previous microarray preprocessing procedures, mostly developed for expression arrays, fail to adequately normalize methylation-related data since they rely on key assumptions that are violated in the case of DNA methylation. We develop a normalization strategy tailored to DNA methylation data and an empirical Bayes percentage methylation estimator that together yield accurate absolute methylation estimates that can be compared across samples. We illustrate the method on data generated to detect methylation differences between tissues and between normal and tumor colon samples.
DNA methylation; Epigenetics; Microarray
Suppose that under the conventional randomized clinical trial setting, a new therapy is compared with a standard treatment. In this article, we propose a systematic, 2-stage estimation procedure for the subject-level treatment differences for future patient's disease management and treatment selections. To construct this procedure, we first utilize a parametric or semiparametric method to estimate individual-level treatment differences, and use these estimates to create an index scoring system for grouping patients. We then consistently estimate the average treatment difference for each subgroup of subjects via a nonparametric function estimation method. Furthermore, pointwise and simultaneous interval estimates are constructed to make inferences about such subgroup-specific treatment differences. The new proposal is illustrated with the data from a clinical trial for evaluating the efficacy and toxicity of a 3-drug combination versus a standard 2-drug combination for treating HIV-1–infected patients.
Cross-validation; HIV infection; Non-parametric function estimation; Personalized medicine; Subgroup analysis
The hazard ratio provides a natural target for assessing a treatment effect with survival data, with the Cox proportional hazards model providing a widely used special case. In general, the hazard ratio is a function of time and provides a visual display of the temporal pattern of the treatment effect. A variety of nonproportional hazards models have been proposed in the literature. However, available methods for flexibly estimating a possibly time-dependent hazard ratio are limited. Here, we investigate a semiparametric model that allows a wide range of time-varying hazard ratio shapes. Point estimates as well as pointwise confidence intervals and simultaneous confidence bands of the hazard ratio function are established under this model. The average hazard ratio function is also studied to assess the cumulative treatment effect. We illustrate corresponding inference procedures using coronary heart disease data from the Women's Health Initiative estrogen plus progestin clinical trial.
Clinical trial; Empirical process; Gaussian process; Hazard ratio; Simultaneous inference; Survival analysis; Treatment–time interaction
This paper addresses the dose-finding problem in cancer trials in which we are concerned with the gradation of severe toxicities that are considered dose limiting. In order to differentiate the tolerance for different toxicity types and grades, we propose a novel extension of the continual reassessment method that explicitly accounts for multiple toxicity constraints. We apply the proposed methods to redesign a bortezomib trial in lymphoma patients and compare their performance with that of the existing methods. Based on simulations, our proposed methods achieve comparable accuracy in identifying the maximum tolerated dose but have better control of the erroneous allocation and recommendation of an overdose.
Design calibration; Dose-finding cancer trials; Toxicity grades and types
A major goal of genetic association studies concerned with single nucleotide polymorphisms (SNPs) is the detection of SNPs exhibiting an impact on the risk of developing a disease. Typically, this problem is approached by testing each of the SNPs individually. This, however, can lead to an inaccurate measurement of the influence of the SNPs on the disease risk, in particular, if SNPs only show an effect when interacting with other SNPs, as the multivariate structure of the data is ignored. In this article, we propose a testing procedure based on logic regression that takes this structure into account and therefore enables a more appropriate quantification of importance and ranking of the SNPs than marginal testing. Since even SNP interactions often exhibit only a moderate effect on the disease risk, it can be helpful to also consider sets of SNPs (e.g. SNPs belonging to the same gene or pathway) to borrow strength across these SNP sets and to identify those genes or pathways comprising SNPs that are most consistently associated with the response. We show how the proposed procedure can be adapted for testing SNP sets, and how it can be applied to blocks of SNPs in linkage disequilibrium (LD) to overcome problems caused by LD.
Feature selection; GENICA; Importance measure; logicFS; Logic regression
Submicroscopic changes in chromosomal DNA copy number dosage are common and have been implicated in many heritable diseases and cancers. Recent high-throughput technologies have a resolution that permits the detection of segmental changes in DNA copy number that span thousands of base pairs in the genome. Genomewide association studies (GWAS) may simultaneously screen for copy number phenotype and single nucleotide polymorphism (SNP) phenotype associations as part of the analytic strategy. However, genomewide array analyses are particularly susceptible to batch effects as the logistics of preparing DNA and processing thousands of arrays often involves multiple laboratories and technicians, or changes over calendar time to the reagents and laboratory equipment. Failure to adjust for batch effects can lead to incorrect inference and requires inefficient post hoc quality control procedures to exclude regions that are associated with batch. Our work extends previous model-based approaches for copy number estimation by explicitly modeling batch and using shrinkage to improve locus-specific estimates of copy number uncertainty. Key features of this approach include the use of biallelic genotype calls from experimental data to estimate batch-specific and locus-specific parameters of background and signal without the requirement of training data. We illustrate these ideas using a study of bipolar disease and a study of chromosome 21 trisomy. The former has batch effects that dominate much of the observed variation in the quantile-normalized intensities, while the latter illustrates the robustness of our approach to a data set in which approximately 27% of the samples have altered copy number. Locus-specific estimates of copy number can be plotted on the copy number scale to investigate mosaicism and guide the choice of appropriate downstream approaches for smoothing the copy number as a function of physical position. The software is open source and implemented in the R package crlmm at Bioconductor (http:www.bioconductor.org).
Bioinformatics; Hierarchical models; DNA copy number variations; Single nucleotide polymorphism array
The diagnostic likelihood ratio function, DLR, is a statistical measure used to evaluate risk prediction markers. The goal of this paper is to develop new methods to estimate the DLR function. Furthermore, we show how risk prediction markers can be compared using rank-invariant DLR functions. Various estimators are proposed that accommodate cohort or case–control study designs. Performances of the estimators are compared using simulation studies. The methods are illustrated by comparing a lung function measure and a nutritional status measure for predicting subsequent onset of major pulmonary infection in children suffering from cystic fibrosis. For continuous markers, the DLR function is mathematically related to the slope of the receiver operating characteristic (ROC) curve, an entity used to evaluate diagnostic markers. We show that our methodology can be used to estimate the slope of the ROC curve and illustrate use of the estimated ROC derivative in variance and sample size calculations for a diagnostic biomarker study.
Biomarker; density estimation; diagnosis; logistic regression; rank invariant; risk prediction; ROC–GLM
We use the term “index predictor” to denote a score that consists of K binary rules such as “age > 60” or “blood pressure > 120 mm Hg.” The index predictor is the sum of these binary scores, yielding a value from 0 to K. Such indices as often used in clinical studies to stratify population risk: They are usually derived from subject area considerations. In this paper, we propose a fast data-driven procedure for automatically constructing such indices for linear, logistic, and Cox regression models. We also extend the procedure to create indices for detecting treatment–marker interactions. The methods are illustrated on a study with protein biomarkers as well as a large microarray gene expression study.
Degree of freedom; Index predictor; International prognostic index
A recurrent statistical problem in cell biology is to draw inference about cell kinetics from observations collected at discrete time points. We investigate this problem when multiple cell clones are observed longitudinally over time. The theory of age-dependent branching processes provides an appealing framework for the quantitative analysis of such data. Likelihood inference being difficult in this context, we propose an alternative composite likelihood approach, where the estimation function is defined from the marginal or conditional distributions of the number of cells of each observable cell type. These distributions have generally no closed-form expressions but they can be approximated using simulations. We construct a bias-corrected version of the estimating function, which also offers computational advantages. Two algorithms are discussed to compute parameter estimates. Large sample properties of the estimator are presented. The performance of the proposed method in finite samples is investigated in simulation studies. An application to the analysis of the generation of oligodendrocytes from oligodendrocyte type-2 astrocyte progenitor cells cultured in vitro reveals the effect of neurothrophin-3 on these cells. Our work demonstrates also that the proposed approach outperforms the existing ones.
Bias correction; Cell differentiation; Composite likelihood; Discrete data; Monte Carlo; Neurotrophin-3; Oligodendrocytes; Precursor cell; Stochastic model
While the commonly used log-rank test for survival times between 2 groups enjoys many desirable properties, sometimes the log-rank test and its related linear rank tests perform poorly when sample sizes are small. Similar concerns apply to interval estimates for treatment differences in this setting, though their properties are less well known. Standard permutation tests are one option, but these are not in general valid when the underlying censoring distributions in the comparison groups are unequal. We develop 2 methods for testing and interval estimation, for use with small samples and possibly unequal censoring, based on first imputing survival and censoring times and then applying permutation methods. One provides a heuristic justification for the approach proposed recently by Heinze and others (2003, Exact log-rank tests for unequal follow-up. Biometrics 59, 1151–1157). Simulation studies show that the proposed methods have good Type I error and power properties. For accelerated failure time models, compared to the asymptotic methods of Jin and others (2003, Rank-based inference for the accelerated failure time model. Biometrika 90, 341–353), the proposed methods yield confidence intervals with better coverage probabilities in small-sample settings and similar efficiency when sample sizes are large. The proposed methods are illustrated with data from a cancer study and an AIDS clinical trial.
Accelerated failure time models; Imputation; Log-rank test; Permutation tests
There are many more strategies for early detection of cancer than can be evaluated with randomized trials. Consequently, model-projected outcomes under different strategies can be useful for developing cancer control policy provided that the projections are representative of the population. To project population-representative disease progression outcomes and to demonstrate their value in assessing competing early detection strategies, we implement a model linking prostate-specific antigen (PSA) levels and prostate cancer progression and calibrate it to disease incidence in the US population. PSA growth is linear on the logarithmic scale with a higher slope after disease onset and with random effects on intercepts and slopes; parameters are estimated using data from the Prostate Cancer Prevention Trial. Disease onset, metastatic spread, and clinical detection are governed by hazard functions that depend on age or PSA levels; parameters are estimated by comparing projected incidence under observed screening and biopsy patterns with incidence observed in the Surveillance, Epidemiology, and End Results registries. We demonstrate implications of the model for policy development by projecting early detections, overdiagnoses, and mean lead times for PSA cutoffs 4.0 and 2.5 ng/mL and for screening ages 50–74 or 50–84. The calibrated model validates well, quantifies the tradeoffs involved across policies, and indicates that PSA screening with cutoff 4.0 ng/mL and screening ages 50–74 performs best in terms of overdiagnoses per early detection. The model produces representative outcomes for selected PSA screening policies and is shown to be useful for informing the development of sound cancer control policy.
Decision analysis; Population health; Prostatic neoplasm; Screening
Time series studies of environmental exposures often involve comparing daily changes in a toxicant measured at a point in space with daily changes in an aggregate measure of health. Spatial misalignment of the exposure and response variables can bias the estimation of health risk, and the magnitude of this bias depends on the spatial variation of the exposure of interest. In air pollution epidemiology, there is an increasing focus on estimating the health effects of the chemical components of particulate matter (PM). One issue that is raised by this new focus is the spatial misalignment error introduced by the lack of spatial homogeneity in many of the PM components. Current approaches to estimating short-term health risks via time series modeling do not take into account the spatial properties of the chemical components and therefore could result in biased estimation of those risks. We present a spatial–temporal statistical model for quantifying spatial misalignment error and show how adjusted health risk estimates can be obtained using a regression calibration approach and a 2-stage Bayesian model. We apply our methods to a database containing information on hospital admissions, air pollution, and weather for 20 large urban counties in the United States.
Acute health effects; Cardiovascular disease; Chemical speciation; Measurement error; Particulate matter; Spatial modeling
Missing data arise in genetic association studies when genotypes are unknown or when
haplotypes are of direct interest. We provide a general likelihood-based framework for
making inference on genetic effects and gene–environment interactions with such
missing data. We allow genetic and environmental variables to be correlated while leaving
the distribution of environmental variables completely unspecified. We consider 3 major
study designs—cross-sectional, case–control, and cohort designs—and
construct appropriate likelihood functions for all common phenotypes (e.g.
case–control status, quantitative traits, and potentially censored ages at onset of
disease). The likelihood functions involve both finite- and infinite-dimensional
parameters. The maximum likelihood estimators are shown to be consistent, asymptotically
normal, and asymptotically efficient. Expectation–Maximization (EM) algorithms are
developed to implement the corresponding inference procedures. Extensive simulation
studies demonstrate that the proposed inferential and numerical methods perform well in
practical settings. Illustration with a genome-wide association study of lung cancer is
Association studies; EM algorithm; Genotype; Haplotype; Hardy–Weinberg equilibrium; Maximum likelihood; Semiparametric efficiency; Single nucleotide polymorphisms; Untyped SNPs
The nested case–control (NCC) design is a cost-effective sampling method to study
the relationship between a disease and its risk factors in epidemiologic studies. NCC data
are commonly analyzed using Thomas' partial likelihood approach under Cox's
proportional hazards model with constant covariate effects. Here, we are interested in
studying the potential time-varying effects of covariates in NCC studies and propose an
estimation approach based on a kernel-weighted Thomas' partial likelihood. We
establish asymptotic properties of the proposed estimator, propose a numerical approach to
construct simultaneous confidence bands for time-varying coefficients, and develop a
hypothesis testing procedure to detect time-varying coefficients. The proposed inference
procedure is evaluated in simulations and applied to an NCC study of breast cancer in the
New York University Women's Health Study.
Kernel estimation; Martingale; Nested case–control study; Proportional hazards model; Risk-set sampling; Time-varying coefficient
We propose a formal statistical inference framework for the evaluation of the penetrance of a rare genetic mutation using family data generated under a kin–cohort type of design, where phenotype and genotype information from first-degree relatives (sibs and/or offspring) of case probands carrying the targeted mutation are collected. Our approach is built upon a likelihood model with some minor assumptions, and it can be used for age-dependent penetrance estimation that permits adjustment for covariates. Furthermore, the derived likelihood allows unobserved risk factors that are correlated within family members. The validity of the approach is confirmed by simulation studies. We apply the proposed approach to estimating the age-dependent cancer risk among carriers of the MSH2 or MLH1 mutation.
Case–family design; Penetrance; Proportional hazards model; Rare mutation; Unobserved risk factors
Generalized linear mixed models (GLMMs) continue to grow in popularity due to their ability to directly acknowledge multiple levels of dependency and model different data types. For small sample sizes especially, likelihood-based inference can be unreliable with variance components being particularly difficult to estimate. A Bayesian approach is appealing but has been hampered by the lack of a fast implementation, and the difficulty in specifying prior distributions with variance components again being particularly problematic. Here, we briefly review previous approaches to computation in Bayesian implementations of GLMMs and illustrate in detail, the use of integrated nested Laplace approximations in this context. We consider a number of examples, carefully specifying prior distributions on meaningful quantities in each case. The examples cover a wide range of data types including those requiring smoothing over time and a relatively complicated spline model for which we examine our prior specification in terms of the implied degrees of freedom. We conclude that Bayesian inference is now practically feasible for GLMMs and provides an attractive alternative to likelihood-based approaches such as penalized quasi-likelihood. As with likelihood-based approaches, great care is required in the analysis of clustered binary data since approximation strategies may be less accurate for such data.
Integrated nested Laplace approximations; Longitudinal data; Penalized quasi-likelihood; Prior specification; Spline models
With clinical trials under pressure to produce more convincing results faster, we reexamine relative efficiencies for the semiparametric comparison of cause-specific rather than all-cause mortality events, observing that in many settings misclassification of cause of failure is not negligible. By incorporating known misclassification rates, we derive an adapted logrank test that optimizes power when the alternative treatment effect is confined to the cause-specific hazard. We derive sample size calculations for this test as well as for the corresponding all-cause mortality and naive cause-specific logrank test which ignores the misclassification. This may lead to new options at the design stage which we discuss. We reexamine a recently closed vaccine trial in this light and find the sample size needed for the new test to be 32% smaller than for the equivalent all-cause analysis, leading to a reduction of 41 224 participants.
Cause-specific analysis; Clinical trials; Competing risks; Misclassification; Sample size; Survival analysis; Verbal autopsy
With the analysis of complex, messy data sets, the statistics community has recently focused attention on “reproducible research,” namely research that can be readily replicated by others. One standard that has been proposed is the availability of data sets and computer code. However, in some situations, raw data cannot be disseminated for reasons of confidentiality or because the data are so messy as to make dissemination impractical. For one such situation, we propose 2 steps for reproducible research: (i) presentation of a table of data and (ii) presentation of a formula to estimate key quantities from the table of data. We illustrate this strategy in the analysis of data from the Prostate Cancer Prevention Trial, which investigated the effect of the drug finasteride versus placebo on the period prevalence of prostate cancer. With such an important result at stake, a transparent analysis was important.
Categorical data; Maximum likelihood; Missing data; Multinomial–Poisson transformation; Propensity-to-be-missing score; Randomized trials