In most analyses of large-scale genomic data sets, differential expression analysis is typically assessed by testing for differences in the mean of the distributions between 2 groups. A recent finding by Tomlins and others (2005) is of a different type of pattern of differential expression in which a fraction of samples in one group have overexpression relative to samples in the other group. In this work, we describe a general mixture model framework for the assessment of this type of expression, called outlier profile analysis. We start by considering the single-gene situation and establishing results on identifiability. We propose 2 nonparametric estimation procedures that have natural links to familiar multiple testing procedures. We then develop multivariate extensions of this methodology to handle genome-wide measurements. The proposed methodologies are compared using simulation studies as well as data from a prostate cancer gene expression study.
doi:10.1093/biostatistics/kxn015
PMCID: PMC2605210
PMID: 18539648
Bonferroni correction; DNA microarray; False discovery rate; Goodness of fit; Multiple comparisons; Uniform distribution
SUMMARY
In most analyses of large-scale genomic data sets, differential expression analysis is typically assessed by testing for differences in the mean of the distributions between 2 groups. A recent finding by Tomlins and others (2005) is of a different type of pattern of differential expression in which a fraction of samples in one group have overexpression relative to samples in the other group. In this work, we describe a general mixture model framework for the assessment of this type of expression, called outlier profile analysis. We start by considering the single-gene situation and establishing results on identifiability. We propose 2 nonparametric estimation procedures that have natural links to familiar multiple testing procedures. We then develop multivariate extensions of this methodology to handle genome-wide measurements. The proposed methodologies are compared using simulation studies as well as data from a prostate cancer gene expression study.
doi:10.1093/biostatistics/kxn015
PMCID: PMC2605210
PMID: 18539648
Bonferroni correction; DNA microarray; False discovery rate; Goodness of fit; Multiple comparisons; Uniform distribution
Copy number variants (CNVs) constitute an important class of genetic variants in human genome and are shown to be associated with complex diseases. Whole-genome sequencing provides an unbiased way of identifying all the CNVs that an individual carries. In this paper, we consider parametric modeling of the read depth (RD) data from whole-genome sequencing with the aim of identifying the CNVs, including both Poisson and negative-binomial modeling of such count data. We propose a unified approach of using a mean-matching variance stabilizing transformation to turn the relatively complicated problem of sparse segment identification for count data into a sparse segment identification problem for a sequence of Gaussian data. We apply the optimal sparse segment identification procedure to the transformed data in order to identify the CNV segments. This provides a computationally efficient approach for RD-based CNV identification. Simulation results show that this approach often results in a small number of false identifications of the CNVs and has similar or better performances in identifying the true CNVs when compared with other RD-based approaches. We demonstrate the methods using the trio data from the 1000 Genomes Project.
doi:10.1093/biostatistics/kxt060
PMCID: PMC4059462
PMID: 24478395
Natural exponential family; Sparse segment identification; Variance stabilization
Mediation analysis serves to quantify the effect of an exposure on an outcome mediated by a certain intermediate and to quantify the extent to which the effect is direct. When the mediator is misclassified, the validity of mediation analysis can be severely undermined. The contribution of the present work is to study the effects of non-differential misclassification of a binary mediator in the estimation of direct and indirect causal effects when the outcome is either continuous or binary and exposure–mediator interaction can be present, and to allow the correction of misclassification. A hybrid of likelihood-based and predictive value weighting method for misclassification correction coupled with sensitivity analysis is proposed and a second approach using the expectation–maximization algorithm is developed. The correction strategy requires knowledge of a plausible range of sensitivity and specificity parameters. The approaches are applied to a perinatal epidemiological study of the determinants of pre-term birth.
doi:10.1093/biostatistics/kxu007
PMCID: PMC4059465
PMID: 24671909
EM algorithm; Iteratively re-weighted least squares; Mediation analysis; Misclassification; Predictive value weighting; Pre-eclampsia; Pre-term birth; Sensitivity analysis
It is widely recognized that the three-dimensional (3D) architecture of eukaryotic chromatin plays an important role in processes such as gene regulation and cancer-driving gene fusions. Observing or inferring this 3D structure at even modest resolutions had been problematic, since genomes are highly condensed and traditional assays are coarse. However, recently devised high-throughput molecular techniques have changed this situation. Notably, the development of a suite of chromatin conformation capture (CCC) assays has enabled elicitation of contacts—spatially close chromosomal loci—which have provided insights into chromatin architecture. Most analysis of CCC data has focused on the contact level, with less effort directed toward obtaining 3D reconstructions and evaluating the accuracy and reproducibility thereof. While questions of accuracy must be addressed experimentally, questions of reproducibility can be addressed statistically—the purpose of this paper. We use a constrained optimization technique to reconstruct chromatin configurations for a number of closely related yeast datasets and assess reproducibility using four metrics that measure the distance between 3D configurations. The first of these, Procrustes fitting, measures configuration closeness after applying reflection, rotation, translation, and scaling-based alignment of the structures. The others base comparisons on the within-configuration inter-point distance matrix. Inferential results for these metrics rely on suitable permutation approaches. Results indicate that distance matrix-based approaches are preferable to Procrustes analysis, not because of the metrics per se but rather on account of the ability to customize permutation schemes to handle within-chromosome contiguity. It has recently been emphasized that the use of constrained optimization approaches to 3D architecture reconstruction are prone to being trapped in local minima. Our methods of reproducibility assessment provide a means for comparing 3D reconstruction solutions so that we can discern between local and global optima by contrasting solutions under perturbed inputs.
doi:10.1093/biostatistics/kxu003
PMCID: PMC4059464
PMID: 24519450
Chromatin conformation; Distance matrix; Genome architecture; Procrustes analysis
When there is evidence of long-term survivors, cure models are often used to model the survival curve. A cure model is a mixture model consisting of a cured fraction and an uncured fraction. Traditional cure models assume that the cured or uncured status in the censored set cannot be distinguished. But in many practices, some diagnostic procedures may provide partial information about the cured or uncured status relative to certain sensitivity and specificity. The traditional cure model does not take advantage of this additional information. Motivated by a clinical study on bone injury in pediatric patients, we propose a novel extension of a traditional Cox proportional hazards (PH) cure model that incorporates the additional information about the cured status. This extension can be applied when the latency part of the cure model is modeled by the Cox PH model. Extensive simulations demonstrated that the proposed extension provides more efficient and less biased estimations, and the higher efficiency and smaller bias is associated with higher sensitivity and specificity of diagnostic procedures. When the proposed extended Cox PH cure model was applied to the motivating example, there was a substantial improvement in the estimation.
doi:10.1093/biostatistics/kxu002
PMCID: PMC4059463
PMID: 24511081
Cure model; Expectation-maximization (EM) algorithm; Proportional hazards; Relative efficiency; Sensitivity and specificity
In event history studies concerning recurrent events, two types of data have been extensively discussed. One is recurrent-event data (Cook and Lawless, 2007. The Analysis of Recurrent Event Data. New York: Springer), and the other is panel-count data (Zhao and others, 2010. Nonparametric inference based on panel-count data. Test
20, 1–42). In the former case, all study subjects are monitored continuously; thus, complete information is available for the underlying recurrent-event processes of interest. In the latter case, study subjects are monitored periodically; thus, only incomplete information is available for the processes of interest. In reality, however, a third type of data could occur in which some study subjects are monitored continuously, but others are monitored periodically. When this occurs, we have mixed recurrent-event and panel-count data. This paper discusses regression analysis of such mixed data and presents two estimation procedures for the problem. One is a maximum likelihood estimation procedure, and the other is an estimating equation procedure. The asymptotic properties of both resulting estimators of regression parameters are established. Also, the methods are applied to a set of mixed recurrent-event and panel-count data that arose from a Childhood Cancer Survivor Study and motivated this investigation.
doi:10.1093/biostatistics/kxu009
PMCID: PMC4059466
PMID: 24648408
Estimating equation-based approach; Maximum likelihood approach; Regression analysis
In cancer studies the disease natural history process is often observed only at a fixed, random point of diagnosis (a survival time), leading to a current status observation (Sun (2006). The statistical analysis of interval-censored failure time data. Berlin: Springer.) representing a surrogate (a mark) (Jacobsen (2006). Point process theory and applications: marked point and piecewise deterministic processes. Basel: Birkhauser.) attached to the observed survival time. Examples include time to recurrence and stage (local vs. metastatic). We study a simple model that provides insights into the relationship between the observed marked endpoint and the latent disease natural history leading to it. A semiparametric regression model is developed to assess the covariate effects on the observed marked endpoint explained by a latent disease process. The proposed semiparametric regression model can be represented as a transformation model in terms of mark-specific hazards, induced by a process-based mixed effect. Large-sample properties of the proposed estimators are established. The methodology is illustrated by Monte Carlo simulation studies, and an application to a randomized clinical trial of adjuvant therapy for breast cancer.
doi:10.1093/biostatistics/kxt056
PMCID: PMC4102917
PMID: 24379192
Disease natural history; Marked endpoints; Semiparametric regression
Distributed lag (DL) models relate lagged covariates to a response and are a popular statistical model used in a wide variety of disciplines to analyze exposure–response data. However, classical DL models do not account for possible interactions between lagged predictors. In the presence of interactions between lagged covariates, the total effect of a change on the response is not merely a sum of lagged effects as is typically assumed. This article proposes a new class of models, called high-degree DL models, that extend basic DL models to incorporate hypothesized interactions between lagged predictors. The modeling strategy utilizes Gaussian processes to counterbalance predictor collinearity and as a dimension reduction tool. To choose the degree and maximum lags used within the models, a computationally manageable model comparison method is proposed based on maximum a posteriori estimators. The models and methods are illustrated via simulation and application to investigating the effect of heat exposure on mortality in Los Angeles and New York.
doi:10.1093/biostatistics/kxt031
PMCID: PMC3944968
PMID: 23990524
Dimension reduction; Gaussian process; Heat exposure; Lagged interaction; NMMAPS dataset
Survival analysis endures as an old, yet active research field with applications that spread across many domains. Continuing improvements in data acquisition techniques pose constant challenges in applying existing survival analysis methods to these emerging data sets. In this paper, we present tools for fitting regularized Cox survival analysis models on high-dimensional, massive sample-size (HDMSS) data using a variant of the cyclic coordinate descent optimization technique tailored for the sparsity that HDMSS data often present. Experiments on two real data examples demonstrate that efficient analyses of HDMSS data using these tools result in improved predictive performance and calibration.
doi:10.1093/biostatistics/kxt043
PMCID: PMC3944969
PMID: 24096388
Big data; Cox proportional hazards; Regularized regression; Survival analysis
Receiver operating characteristic (ROC) curves are widely used to measure the discriminating power of medical tests and other classification procedures. In many practical applications, the performance of these procedures can depend on covariates such as age, naturally leading to a collection of curves associated with different covariate levels. This paper develops a Bayesian heteroscedastic semiparametric regression model and applies it to the estimation of covariate-dependent ROC curves. More specifically, our approach uses Gaussian process priors to model the conditional mean and conditional variance of the biomarker of interest for each of the populations under study. The model is illustrated through an application to the evaluation of prostate-specific antigen for the diagnosis of prostate cancer, which contrasts the performance of our model against alternative models.
doi:10.1093/biostatistics/kxt044
PMCID: PMC3944970
PMID: 24174579
Bayesian inference; Gaussian process; Non-parametric regression; Receiver operating characteristic curve
A classical approach to combine independent test statistics is Fisher's combination of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$p$\end{document}-values, which follows the \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$\chi ^2$\end{document} distribution. When the test statistics are dependent, the gamma distribution (GD) is commonly used for the Fisher's combination test (FCT). We propose to use two generalizations of the GD: the generalized and the exponentiated GDs. We study some properties of mis-using the GD for the FCT to combine dependent statistics when one of the two proposed distributions are true. Our results show that both generalizations have better control of type I error rates than the GD, which tends to have inflated type I error rates at more extreme tails. In practice, common model selection criteria (e.g. Akaike information criterion/Bayesian information criterion) can be used to help select a better distribution to use for the FCT. A simple strategy of the two generalizations of the GD in genome-wide association studies is discussed. Applications of the results to genetic pleiotrophic associations are described, where multiple traits are tested for association with a single marker.
doi:10.1093/biostatistics/kxt045
PMCID: PMC3944971
PMID: 24174580
Dependent tests; Fisher's combination; Gamma distributions; Genetic pleiotropic associations; Genome-wide association studies; Type I error
We consider inference for longitudinal data based on mixed-effects models with a non-parametric Bayesian prior on the treatment effect. The proposed non-parametric Bayesian prior is a random partition model with a regression on patient-specific covariates. The main feature and motivation for the proposed model is the use of covariates with a mix of different data formats and possibly high-order interactions in the regression. The regression is not explicitly parameterized. It is implied by the random clustering of subjects. The motivating application is a study of the effect of an anticancer drug on a patient's blood pressure. The study involves blood pressure measurements taken periodically over several 24-h periods for 54 patients. The 24-h periods for each patient include a pretreatment period and several occasions after the start of therapy.
doi:10.1093/biostatistics/kxt049
PMCID: PMC3944972
PMID: 24285773
Clustering; Mixed-effects model; Non-parametric Bayesian model; Random partition; Repeated measurement data
For designing, monitoring, and analyzing a longitudinal study with an event time as the outcome variable, the restricted mean event time (RMET) is an easily interpretable, clinically meaningful summary of the survival function in the presence of censoring. The RMET is the average of all potential event times measured up to a time point τ and can be estimated consistently by the area under the Kaplan–Meier curve over \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$[0, \tau ]$\end{document}. In this paper, we study a class of regression models, which directly relates the RMET to its “baseline” covariates for predicting the future subjects’ RMETs. Since the standard Cox and the accelerated failure time models can also be used for estimating such RMETs, we utilize a cross-validation procedure to select the “best” among all the working models considered in the model building and evaluation process. Lastly, we draw inferences for the predicted RMETs to assess the performance of the final selected model using an independent data set or a “hold-out” sample from the original data set. All the proposals are illustrated with the data from the an HIV clinical trial conducted by the AIDS Clinical Trials Group and the primary biliary cirrhosis study conducted by the Mayo Clinic.
doi:10.1093/biostatistics/kxt050
PMCID: PMC3944973
PMID: 24292992
Accelerated failure time model; Cox model; Cross-validation; Hold-out sample; Personalized medicine; Perturbation-resampling method
Principal surrogate (PS) endpoints are relatively inexpensive and easy to measure study outcomes that can be used to reliably predict treatment effects on clinical endpoints of interest. Few statistical methods for assessing the validity of potential PSs utilize time-to-event clinical endpoint information and to our knowledge none allow for the characterization of time-varying treatment effects. We introduce the time-dependent and surrogate-dependent treatment efficacy curve, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}${\mathrm {TE}}(t|s)$\end{document}, and a new augmented trial design for assessing the quality of a biomarker as a PS. We propose a novel Weibull model and an estimated maximum likelihood method for estimation of the \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}${\mathrm {TE}}(t|s)$\end{document} curve. We describe the operating characteristics of our methods via simulations. We analyze data from the Diabetes Control and Complications Trial, in which we find evidence of a biomarker with value as a PS.
doi:10.1093/biostatistics/kxt055
PMCID: PMC3944974
PMID: 24337534
Case–control study; Causal inference; Clinical trials; Principal stratification; Survival analysis; Treatment efficacy curve; Weibull model
In clinical trials, a surrogate outcome variable (S) can be measured before the outcome of interest (T) and may provide early information regarding the treatment (Z) effect on T. Using the principal surrogacy framework introduced by Frangakis and Rubin (2002. Principal stratification in causal inference. Biometrics
58, 21–29), we consider an approach that has a causal interpretation and develop a Bayesian estimation strategy for surrogate validation when the joint distribution of potential surrogate and outcome measures is multivariate normal. From the joint conditional distribution of the potential outcomes of T, given the potential outcomes of S, we propose surrogacy validation measures from this model. As the model is not fully identifiable from the data, we propose some reasonable prior distributions and assumptions that can be placed on weakly identified parameters to aid in estimation. We explore the relationship between our surrogacy measures and the surrogacy measures proposed by Prentice (1989. Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in Medicine
8, 431–440). The method is applied to data from a macular degeneration study and an ovarian cancer study.
doi:10.1093/biostatistics/kxt051
PMCID: PMC4023321
PMID: 24285772
Bayesian estimation; Principal stratification; Surrogate endpoints
Cause-specific proportional hazards models are commonly used for analyzing competing risks data in clinical studies. Motivated by the objective to assess differential vaccine protection against distinct pathogen types in randomized preventive vaccine efficacy trials, we present an alternative case-only method to standard maximum partial likelihood estimation that applies to a rare failure event, e.g. acquisition of HIV infection. A logistic regression model is fit to the counts of cause-specific events (infecting pathogen type) within study arms, with an offset adjusting for the randomization ratio. This formulation of cause-specific hazard ratio estimation permits immediate incorporation of host-genetic factors to be assessed as effect modifiers, an important area of vaccine research for identifying immune correlates of protection, thus inheriting the estimation efficiency, and cost benefits of the case-only estimator commonly used for assessing gene–treatment interactions. The method is used to reassess HIV genotype-specific vaccine efficacy in the RV144 trial, providing nearly identical results to standard Cox methods, and to assess if and how this vaccine efficacy depends on Fc-γ receptor genes.
doi:10.1093/biostatistics/kxt018
PMCID: PMC3862206
PMID: 23813283
Gene–treatment interaction; Sieve analysis; Vaccine efficacy
Blood and tissue are composed of many functionally distinct cell subsets. In immunological studies, these can be measured accurately only using single-cell assays. The characterization of these small cell subsets is crucial to decipher system-level biological changes. For this reason, an increasing number of studies rely on assays that provide single-cell measurements of multiple genes and proteins from bulk cell samples. A common problem in the analysis of such data is to identify biomarkers (or combinations of biomarkers) that are differentially expressed between two biological conditions (e.g. before/after stimulation), where expression is defined as the proportion of cells expressing that biomarker (or biomarker combination) in the cell subset(s) of interest. Here, we present a Bayesian hierarchical framework based on a beta-binomial mixture model for testing for differential biomarker expression using single-cell assays. Our model allows the inference to be subject specific, as is typically required when assessing vaccine responses, while borrowing strength across subjects through common prior distributions. We propose two approaches for parameter estimation: an empirical-Bayes approach using an Expectation–Maximization algorithm and a fully Bayesian one based on a Markov chain Monte Carlo algorithm. We compare our method against classical approaches for single-cell assays including Fisher’s exact test, a likelihood ratio test, and basic log-fold changes. Using several experimental assays measuring proteins or genes at single-cell level and simulations, we show that our method has higher sensitivity and specificity than alternative methods. Additional simulations show that our framework is also robust to model misspecification. Finally, we demonstrate how our approach can be extended to testing multivariate differential expression across multiple biomarker combinations using a Dirichlet-multinomial model and illustrate this approach using single-cell gene expression data and simulations.
doi:10.1093/biostatistics/kxt024
PMCID: PMC3862207
PMID: 23887981
Bayesian modeling; Expectation–Maximization; Flow cytometry; Hierarchical modeling; Immunology; Marginal likelihood; Markov Chain Monte Carlo; MIMOSA; Single-cell gene expression
Analyzing the failure times of multiple events is of interest in many fields. Estimating the joint distribution of the failure times in a non-parametric way is not straightforward because some failure times are often right-censored and only known to be greater than observed follow-up times. Although it has been studied, there is no universally optimal solution for this problem. It is still challenging and important to provide alternatives that may be more suitable than existing ones in specific settings. Related problems of the existing methods are not only limited to infeasible computations, but also include the lack of optimality and possible non-monotonicity of the estimated survival function. In this paper, we proposed a non-parametric Bayesian approach for directly estimating the density function of multivariate survival times, where the prior is constructed based on the optional Pólya tree. We investigated several theoretical aspects of the procedure and derived an efficient iterative algorithm for implementing the Bayesian procedure. The empirical performance of the method was examined via extensive simulation studies. Finally, we presented a detailed analysis using the proposed method on the relationship among organ recovery times in severely injured patients. From the analysis, we suggested interesting medical information that can be further pursued in clinics.
doi:10.1093/biostatistics/kxt025
PMCID: PMC3862208
PMID: 23902636
Multivariate survival analysis; Non-parametric Bayesian; Optional Pólya tree
Empirical Bayes methods have been extensively used for microarray data analysis by modeling the large number of unknown parameters as random effects. Empirical Bayes allows borrowing information across genes and can automatically adjust for multiple testing and selection bias. However, the standard empirical Bayes model can perform poorly if the assumed working prior deviates from the true prior. This paper proposes a new rank-conditioned inference in which the shrinkage and confidence intervals are based on the distribution of the error conditioned on rank of the data. Our approach is in contrast to a Bayesian posterior, which conditions on the data themselves. The new method is almost as efficient as standard Bayesian methods when the working prior is close to the true prior, and it is much more robust when the working prior is not close. In addition, it allows a more accurate (but also more complex) non-parametric estimate of the prior to be easily incorporated, resulting in improved inference. The new method’s prior robustness is demonstrated via simulation experiments. Application to a breast cancer gene expression microarray dataset is presented. Our R package rank.Shrinkage provides a ready-to-use implementation of the proposed methodology.
doi:10.1093/biostatistics/kxt026
PMCID: PMC3862209
PMID: 23934072
Bayesian shrinkage; Confidence intervals; Ranking bias; Robust multiple estimation
Censored quantile regression provides a useful alternative to the Cox proportional hazards model for analyzing survival data. It directly models the conditional quantile of the survival time and hence is easy to interpret. Moreover, it relaxes the proportionality constraint on the hazard function associated with the popular Cox model and is natural for modeling heterogeneity of the data. Recently, Wang and Wang (2009. Locally weighted censored quantile regression. Journal of the American Statistical Association
103, 1117–1128) proposed a locally weighted censored quantile regression approach that allows for covariate-dependent censoring and is less restrictive than other censored quantile regression methods. However, their kernel smoothing-based weighting scheme requires all covariates to be continuous and encounters practical difficulty with even a moderate number of covariates. We propose a new weighting approach that uses recursive partitioning, e.g. survival trees, that offers greater flexibility in handling covariate-dependent censoring in moderately high dimensions and can incorporate both continuous and discrete covariates. We prove that this new weighting scheme leads to consistent estimation of the quantile regression coefficients and demonstrate its effectiveness via Monte Carlo simulations. We also illustrate the new method using a widely recognized data set from a clinical trial on primary biliary cirrhosis.
doi:10.1093/biostatistics/kxt027
PMCID: PMC3862210
PMID: 23975800
Censored quantile regression; Recursive partitioning; Survival analysis; Survival ensembles
doi:10.1093/biostatistics/kxt037
PMCID: PMC3862211
PMID: 24068252
Modern case–control studies typically involve the collection of data on a large number of outcomes, often at considerable logistical and monetary expense. These data are of potentially great value to subsequent researchers, who, although not necessarily concerned with the disease that defined the case series in the original study, may want to use the available information for a regression analysis involving a secondary outcome. Because cases and controls are selected with unequal probability, regression analysis involving a secondary outcome generally must acknowledge the sampling design. In this paper, the author presents a new framework for the analysis of secondary outcomes in case–control studies. The approach is based on a careful re-parameterization of the conditional model for the secondary outcome given the case–control outcome and regression covariates, in terms of (a) the population regression of interest of the secondary outcome given covariates and (b) the population regression of the case–control outcome on covariates. The error distribution for the secondary outcome given covariates and case–control status is otherwise unrestricted. For a continuous outcome, the approach sometimes reduces to extending model (a) by including a residual of (b) as a covariate. However, the framework is general in the sense that models (a) and (b) can take any functional form, and the methodology allows for an identity, log or logit link function for model (a).
doi:10.1093/biostatistics/kxt041
PMCID: PMC3983430
PMID: 24152770
Case–control studies; Generalized linear models; Statistical genetics; Secondary outcomes
We introduce an explicit set of metrics for human activity based on high-density acceleration recordings from a hip-worn tri-axial accelerometer. These metrics are based on two concepts: (i) Time Active, a measure of the length of time when activity is distinguishable from rest and (ii) AI, a measure of the relative amplitude of activity relative to rest. All measurements are normalized (have the same interpretation across subjects and days), easy to explain and implement, and reproducible across platforms and software implementations. Metrics were validated by visual inspection of results and quantitative in-lab replication studies, and by an association study with health outcomes
doi:10.1093/biostatistics/kxt029
PMCID: PMC4072911
PMID: 23999141
Activity intensity; Movelets; Movement; Signal processing; Time active; Tri-axial accelerometer
Estimation of the period length of time-course data from cyclical biological processes, such as those driven by the circadian pacemaker, is crucial for inferring the properties of the biological clock found in many living organisms. We propose a methodology for period estimation based on spectrum resampling (SR) techniques. Simulation studies show that SR is superior and more robust to non-sinusoidal and noisy cycles than a currently used routine based on Fourier approximations. In addition, a simple fit to the oscillations using linear least squares is available, together with a non-parametric test for detecting changes in period length which allows for period estimates with different variances, as frequently encountered in practice. The proposed methods are motivated by and applied to various data examples from chronobiology.
doi:10.1093/biostatistics/kxt020
PMCID: PMC3988453
PMID: 23743206
Circadian rhythms; Non-parametric testing; Period estimation; Resampling; Spectrum