We propose a class of estimation techniques for scalar-on-function regression where both outcomes and functional predictors may be observed at multiple visits. Our methods are motivated by a longitudinal brain diffusion tensor imaging tractography study. One of the study’s primary goals is to evaluate the contemporaneous association between human function and brain imaging over time. The complexity of the study requires the development of methods that can simultaneously incorporate: (1) multiple functional (and scalar) regressors; (2) longitudinal outcome and predictor measurements per patient; (3) Gaussian or non-Gaussian outcomes; and (4) missing values within functional predictors. We propose two versions of a new method, longitudinal functional principal components regression (PCR). These methods extend the well-known functional PCR and allow for different effects of subject-specific trends in curves and of visit-specific deviations from that trend. The new methods are compared with existing approaches, and the most promising techniques are used for analyzing the tractography data.
doi:10.1093/biostatistics/kxs051
PMCID: PMC3677735
PMID: 23292804
Diffusion tensor imaging; Functional principal components; Functional regression; Longitudinal functional principal components regression; Multiple sclerosis; Repeated measurements
Independent component analysis (ICA) is a widely used technique for blind source separation, used heavily in several scientific research areas including acoustics, electrophysiology, and functional neuroimaging. We propose a scalable two-stage iterative true group ICA methodology for analyzing population level functional magnetic resonance imaging (fMRI) data where the number of subjects is very large. The method is based on likelihood estimators of the underlying source densities and the mixing matrix. As opposed to many commonly used group ICA algorithms, the proposed method does not require significant data reduction by a 2-fold singular value decomposition. In addition, the method can be applied to a large group of subjects since the memory requirements are not restrictive. The performance of our approach is compared with a commonly used group ICA algorithm via simulation studies. Furthermore, the proposed method is applied to a large collection of resting state fMRI datasets. The results show that established brain networks are well recovered by the proposed algorithm.
doi:10.1093/biostatistics/kxs055
PMCID: PMC3677736
PMID: 23314416
Functional MRI; Signal processing; Source separation
Random effects models are commonly used to analyze longitudinal categorical data. Marginalized random effects models are a class of models that permit direct estimation of marginal mean parameters and characterize serial correlation for longitudinal categorical data via random effects (Heagerty, 1999). Marginally specified logistic-normal models for longitudinal binary data. Biometrics
55, 688–698; Lee and Daniels, 2008. Marginalized models for longitudinal ordinal data with application to quality of life studies. Statistics in Medicine
27, 4359–4380). In this paper, we propose a Kronecker product (KP) covariance structure to capture the correlation between processes at a given time and the correlation within a process over time (serial correlation) for bivariate longitudinal ordinal data. For the latter, we consider a more general class of models than standard (first-order) autoregressive correlation models, by re-parameterizing the correlation matrix using partial autocorrelations (Daniels and Pourahmadi, 2009). Modeling covariance matrices via partial autocorrelations. Journal of Multivariate Analysis
100, 2352–2363). We assess the reasonableness of the KP structure with a score test. A maximum marginal likelihood estimation method is proposed utilizing a quasi-Newton algorithm with quasi-Monte Carlo integration of the random effects. We examine the effects of demographic factors on metabolic syndrome and C-reactive protein using the proposed models.
doi:10.1093/biostatistics/kxs058
PMCID: PMC3677737
PMID: 23365416
Kronecker product; Metabolic syndrome; Partial autocorrelation
Assumptions regarding the true underlying genetic model, or mode of inheritance, are necessary when quantifying genetic associations with disease phenotypes. Here we propose new methods to ascertain the underlying genetic model from parental data in family-based association studies. Specifically, for parental mating-type data, we propose a novel statistic to test whether the underlying genetic model is additive, dominant, or recessive; for parental genotype–phenotype data, we propose three strategies to determine the true mode of inheritance. We illustrate how to incorporate the information gleaned from these strategies into family-based association tests. Because family-based association tests are conducted conditional on parental genotypes, the type I error rate of these procedures is not inflated by the information learned from parental data. This result holds even if such information is weak or when the assumption of Hardy–Weinberg equilibrium is violated. Our simulations demonstrate that incorporating parental data into family-based association tests can improve power under common inheritance models. The application of our proposed methods to a candidate-gene study of type 1 diabetes successfully detects a recessive effect in MGAT5 that would otherwise be missed by conventional family-based association tests.
doi:10.1093/biostatistics/kxs048
PMCID: PMC3732025
PMID: 23266418
Case–parents; Dominant; Mode of inheritance; Nuclear families; Recessive; Robust
Cost-effectiveness analysis (CEA) is an important component of the economic evaluation of new treatment options. In many clinical and observational studies of costs, censored data pose challenges to the CEA. We consider a special situation where the terminating events for the survival time and costs are different. Traditional methods for statistical inference offer no means for dealing with censored data in these circumstances. To address this gap, we propose a new method for deriving the confidence interval for the incremental cost-effectiveness ratio. The simulation studies and real data example show that our method performs very well for some practical settings, revealing a great potential for application to actual settings in which terminating events for the survival time and costs differ.
doi:10.1093/biostatistics/kxt002
PMCID: PMC3732026
PMID: 23426525
Censored data; Cost-effectiveness analysis; Different terminating events; Fieller method; Survival analysis
doi:10.1093/biostatistics/kxu022
PMCID: PMC4059467
PMID: 24936601
The tumor-node-metastasis staging system has been the lynchpin of cancer diagnosis, treatment, and prognosis for many years. For meaningful clinical use, an orderly grouping of the T and N categories into a staging system needs to be defined, usually with respect to a time-to-event outcome. This can be reframed as a model selection problem with respect to features arranged on a partially ordered two-way grid, and a penalized regression method is proposed for selecting the optimal grouping. Instead of penalizing the L1-norm of the coefficients like lasso, in order to enforce the stage grouping, we place L1 constraints on the differences between neighboring coefficients. The underlying mechanism is the sparsity-enforcing property of the L1 penalty, which forces some estimated coefficients to be the same and hence leads to stage grouping. Partial ordering constraints is also required as both the T and N categories are ordinal. A series of optimal groupings with different numbers of stages can be obtained by varying the tuning parameter, which gives a tree-like structure offering a visual aid on how the groupings are progressively made. We hence call the proposed method the lasso tree. We illustrate the utility of our method by applying it to the staging of colorectal cancer using survival outcomes. Simulation studies are carried out to examine the finite sample performance of the selection procedure. We demonstrate that the lasso tree is able to give the right grouping with moderate sample size, is stable with regard to changes in the data, and is not affected by random censoring.
doi:10.1093/biostatistics/kxs044
PMCID: PMC3590926
PMID: 23221681
Cancer staging; Cox model; Lasso; Lasso tree; Model selection
Group testing is widely used to reduce the cost of screening individuals for infectious diseases. There is an extensive literature on group testing, most of which traditionally has focused on estimating the probability of infection in a homogeneous population. More recently, this research area has shifted towards estimating individual-specific probabilities in a regression context. However, existing regression approaches have assumed that the sensitivity and specificity of pooled biospecimens are constant and do not depend on the pool sizes. For those applications, where this assumption may not be realistic, these existing approaches can lead to inaccurate inference, especially when pool sizes are large. Our new approach, which exploits the information readily available from underlying continuous biomarker distributions, provides reliable inference in settings where pooling would be most beneficial and does so even for larger pool sizes. We illustrate our methodology using hepatitis B data from a study involving Irish prisoners.
doi:10.1093/biostatistics/kxs045
PMCID: PMC3590921
PMID: 23197382
Binary response; Biomarker; Maximum likelihood; Pooled testing; Sensitivity; Specificity
With advancement in genomic technologies, it is common that two high-dimensional datasets are available, both measuring the same underlying biological phenomenon with different techniques. We consider predicting a continuous outcome Y using X, a set of p markers which is the best available measure of the underlying biological process. This same biological process may also be measured by W, coming from a prior technology but correlated with X. On a moderately sized sample, we have (Y,X,W), and on a larger sample we have (Y,W). We utilize the data on W to boost the prediction of Y by X. When p is large and the subsample containing X is small, this is a p>n situation. When p is small, this is akin to the classical measurement error problem; however, ours is not the typical goal of calibrating W for use in future studies. We propose to shrink the regression coefficients β of Y on X toward different targets that use information derived from W in the larger dataset. We compare these proposals with the classical ridge regression of Y on X, which does not use W. We also unify all of these methods as targeted ridge estimators. Finally, we propose a hybrid estimator which is a linear combination of multiple estimators of β. With an optimal choice of weights, the hybrid estimator balances efficiency and robustness in a data-adaptive way to theoretically yield a smaller prediction error than any of its constituents. The methods, including a fully Bayesian alternative, are evaluated via simulation studies. We also apply them to a gene-expression dataset. mRNA expression measured via quantitative real-time polymerase chain reaction is used to predict survival time in lung cancer patients, with auxiliary information from microarray technology available on a larger sample.
doi:10.1093/biostatistics/kxs036
PMCID: PMC3590922
PMID: 23087411
Cross-validation; Generalized ridge; Mean squared prediction error; Measurement error
Motivated by studying the association between nutrient intake and human gut microbiome composition, we developed a method for structure-constrained sparse canonical correlation analysis (ssCCA) in a high-dimensional setting. ssCCA takes into account the phylogenetic relationships among bacteria, which provides important prior knowledge on evolutionary relationships among bacterial taxa. Our ssCCA formulation utilizes a phylogenetic structure-constrained penalty function to impose certain smoothness on the linear coefficients according to the phylogenetic relationships among the taxa. An efficient coordinate descent algorithm is developed for optimization. A human gut microbiome data set is used to illustrate this method. Both simulations and real data applications show that ssCCA performs better than the standard sparse CCA in identifying meaningful variables when there are structures in the data.
doi:10.1093/biostatistics/kxs038
PMCID: PMC3590923
PMID: 23074263
Dimension reduction; Graph; Phylogenetic tree; Regularization; Variable selection
Individual patient-data meta-analysis of randomized controlled trials is the gold standard for investigating how patient factors modify the effectiveness of treatment. Because participant data from primary studies might not be available, reliable alternatives using published data are needed. In this paper, I show that the maximum likelihood estimates of a participant-level linear random effects meta-analysis with a patient covariate-treatment interaction can be determined exactly from aggregate data when the model's variance components are known. I provide an equivalent aggregate-data EM algorithm and supporting software with the R package ipdmeta for the estimation of the “interaction meta-analysis” when the variance components are unknown. The properties of the methodology are assessed with simulation studies. The usefulness of the methods is illustrated with analyses of the effect modification of cholesterol and age on pravastatin in the multicenter placebo-controlled regression growth evaluation statin study. When a participant-level meta-analysis cannot be performed, aggregate-data interaction meta-analysis is a useful alternative for exploring individual-level sources of treatment effect heterogeneity.
doi:10.1093/biostatistics/kxs035
PMCID: PMC3590924
PMID: 23001065
Clinical trials; Meta-analysis; Random effects models; Statistical methods in Health service research
In epidemiological and medical studies, covariate misclassification may occur when the observed categorical variables are not perfect measurements for an unobserved categorical latent predictor. It is well known that covariate measurement error in Cox regression may lead to biased estimation. Misclassification in covariates will cause bias, and adjustment for misclassification will be challenging when the gold standard variables are not available. In general, statistical modeling for misclassification is very different from that of the measurement error. In this paper, we investigate an approximate induced hazard estimator and propose an expected estimating equation estimator via an expectation–maximization algorithm to accommodate covariate misclassification when multiple surrogate variables are available. Finite sample performance is examined via simulation studies. The proposed method and other methods are applied to a human immunodeficiency virus clinical trial in which a few behavior variables from questionnaires are used as surrogates for a latent behavior variable.
doi:10.1093/biostatistics/kxs046
PMCID: PMC3590925
PMID: 23178735
EM algorithm; Estimating equation; Measurement error; Misclassification; Surrogate covariate
In genome-wide association studies, penalization is an important approach for identifying genetic markers associated with disease. Motivated by the fact that there exists natural grouping structure in single nucleotide polymorphisms and, more importantly, such groups are correlated, we propose a new penalization method for group variable selection which can properly accommodate the correlation between adjacent groups. This method is based on a combination of the group Lasso penalty and a quadratic penalty on the difference of regression coefficients of adjacent groups. The new method is referred to as smoothed group Lasso (SGL). It encourages group sparsity and smoothes regression coefficients for adjacent groups. Canonical correlations are applied to the weights between groups in the quadratic difference penalty. We first derive a GCD algorithm for computing the solution path with linear regression model. The SGL method is further extended to logistic regression for binary response. With the assistance of the majorize–minimization algorithm, the SGL penalized logistic regression turns out to be an iteratively penalized least-square problem. We also suggest conducting principal component analysis to reduce the dimensionality within groups. Simulation studies are used to evaluate the finite sample performance. Comparison with group Lasso shows that SGL is more effective in selecting true positives. Two datasets are analyzed using the SGL method.
doi:10.1093/biostatistics/kxs034
PMCID: PMC3590928
PMID: 22988281
Group selection; Regularization; SNP; Smoothing
We recently proposed two novel criteria to assess the usefulness of risk prediction models for public health applications. The proportion of cases followed, PCF(p), is the proportion of individuals who will develop disease who are included in the proportion p of individuals in the population at highest risk. The proportion needed to follow-up, PNF(q), is the proportion of the general population at highest risk that one needs to follow in order that a proportion q of those destined to become cases will be followed (Pfeiffer, R.M. and Gail, M.H., 2011. Two criteria for evaluating risk prediction models. Biometrics 67, 1057–1065). Here, we extend these criteria in two ways. First, we introduce two new criteria by integrating PCF and PNF over a range of values of q or p to obtain iPCF, the integrated PCF, and iPNF, the integrated PNF. A key assumption in the previous work was that the risk model is well calibrated. This assumption also underlies novel estimates of iPCF and iPNF based on observed risks in a population alone. The second extension is to propose and study estimates of PCF, PNF, iPCF, and iPNF that are consistent even if the risk models are not well calibrated. These new estimates are obtained from case–control data when the outcome prevalence in the population is known, and from cohort data, with baseline covariates and observed health outcomes. We study the efficiency of the various estimates and propose and compare tests for comparing two risk models, both of which were evaluated in the same validation data.
doi:10.1093/biostatistics/kxs037
PMCID: PMC3695651
PMID: 23087412
Area under the receiver operator characteristics curve (ROC); AUC; Discrimination; Discriminatory accuracy; Risk models; Study design
Survival data can contain an unknown fraction of subjects who are “cured” in the sense of not being at risk of failure. We describe such data with cure-mixture models, which separately model cure status and the hazard of failure among non-cured subjects. No diagnostic currently exists for evaluating the fit of such models; the popular Schoenfeld residual (Schoenfeld, 1982. Partial residuals for the proportional hazards regression-model. Biometrika
69, 239–241) is not applicable to data with cures. In this article, we propose a pseudo-residual, modeled on Schoenfeld's, to assess the fit of the survival regression in the non-cured fraction. Unlike Schoenfeld's approach, which tests the validity of the proportional hazards (PH) assumption, our method uses the full hazard and is thus also applicable to non-PH models. We derive the asymptotic distribution of the residuals and evaluate their performance by simulation in a range of parametric models. We apply our approach to data from a smoking cessation drug trial.
doi:10.1093/biostatistics/kxs043
PMCID: PMC3695652
PMID: 23197383
Accelerated failure time; Long-term survivors; Proportional hazards; Residual analysis
In this paper, we extend the definitions of the net reclassification improvement (NRI) and the integrated discrimination improvement (IDI) in the context of multicategory classification. Both measures were proposed in Pencina and others (2008. Evaluating the added predictive ability of a new marker: from area under the receiver operating characteristic (ROC) curve to reclassification and beyond. Statistics in Medicine
27, 157–172) as numeric characterizations of accuracy improvement for binary diagnostic tests and were shown to have certain advantage over analyses based on ROC curves or other regression approaches. Estimation and inference procedures for the multiclass NRI and IDI are provided in this paper along with necessary asymptotic distributional results. Simulations are conducted to study the finite-sample properties of the proposed estimators. Two medical examples are considered to illustrate our methodology.
doi:10.1093/biostatistics/kxs047
PMCID: PMC3695653
PMID: 23197381
Area under the ROC curve; Integrated discrimination improvement; Multicategory classification; Multinomial logistic regression; Net reclassification improvement
The concordance probability is a widely used measure to assess discrimination of prognostic models with binary and survival endpoints. We formally define the concordance probability for a prognostic model of the absolute risk of an event of interest in the presence of competing risks and relate it to recently proposed time-dependent area under the receiver operating characteristic curve measures. For right-censored data, we investigate inverse probability of censoring weighted (IPCW) estimates of a truncated concordance index based on a working model for the censoring distribution. We demonstrate consistency and asymptotic normality of the IPCW estimate if the working model is correctly specified and derive an explicit formula for the asymptotic variance under independent censoring. The small sample properties of the estimator are assessed in a simulation study also against misspecification of the working model. We further illustrate the methods by computing the concordance probability for a prognostic model of coronary heart disease (CHD) events in the presence of the competing risk of non-CHD death.
doi:10.1093/biostatistics/kxt059
PMCID: PMC4059461
PMID: 24493091
C index; Competing risks; Concordance probability; Coronary heart disease; Prognostic models; Time-dependent AUC
RNA-sequencing (RNA-seq) is a flexible technology for measuring genome-wide expression that is rapidly replacing microarrays as costs become comparable. Current differential expression analysis methods for RNA-seq data fall into two broad classes: (1) methods that quantify expression within the boundaries of genes previously published in databases and (2) methods that attempt to reconstruct full length RNA transcripts. The first class cannot discover differential expression outside of previously known genes. While the second approach does possess discovery capabilities, statistical analysis of differential expression is complicated by the ambiguity and variability incurred while assembling transcripts and estimating their abundances. Here, we propose a novel method that first identifies differentially expressed regions (DERs) of interest by assessing differential expression at each base of the genome. The method then segments the genome into regions comprised of bases showing similar differential expression signal, and then assigns a measure of statistical significance to each region. Optionally, DERs can be annotated using a reference database of genomic features. We compare our approach with leading competitors from both current classes of differential expression methods and highlight the strengths and weaknesses of each. A software implementation of our method is available on github (https://github.com/alyssafrazee/derfinder).
doi:10.1093/biostatistics/kxt053
PMCID: PMC4059460
PMID: 24398039
Bioinformatics; Differential expression; False discovery rate; Genomics; RNA sequencing
A major biomedical goal associated with evaluating a candidate biomarker or developing a predictive model score for event-time outcomes is to accurately distinguish between incident cases from the controls surviving beyond t throughout the entire study period. Extensions of standard binary classification measures like time-dependent sensitivity, specificity, and receiver operating characteristic (ROC) curves have been developed in this context (Heagerty, P. J., and others, 2000. Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics
56, 337–344). We propose a direct, non-parametric method to estimate the time-dependent Area under the curve (AUC) which we refer to as the weighted mean rank (WMR) estimator. The proposed estimator performs well relative to the semi-parametric AUC curve estimator of Heagerty and Zheng (2005. Survival model predictive accuracy and ROC curves. Biometrics
61, 92–105). We establish the asymptotic properties of the proposed estimator and show that the accuracy of markers can be compared very simply using the difference in the WMR statistics. Estimators of pointwise standard errors are provided.
doi:10.1093/biostatistics/kxs021
PMCID: PMC3520498
PMID: 22734044
AUC curve; Survival analysis; Time-dependent ROC
For time-to-event data with finitely many competing risks, the proportional hazards model has been a popular tool for relating the cause-specific outcomes to covariates (Prentice and others, 1978. The analysis of failure time in the presence of competing risks. Biometrics 34, 541–554). Inspired by previous research in HIV vaccine efficacy trials, the cause of failure is replaced by a continuous mark observed only in subjects who fail. This article studies an extension of this approach to allow a multivariate continuum of competing risks, to better account for the fact that the candidate HIV vaccines tested in efficacy trials have contained multiple HIV sequences, with a purpose to elicit multiple types of immune response that recognize and block different types of HIV viruses. We develop inference for the proportional hazards model in which the regression parameters depend parametrically on the marks, to avoid the curse of dimensionality, and the baseline hazard depends nonparametrically on both time and marks. Goodness-of-fit tests are constructed based on generalized weighted martingale residuals. The finite-sample performance of the proposed methods is examined through extensive simulations. The methods are applied to a vaccine efficacy trial to examine whether and how certain antigens represented inside the vaccine are relevant for protection or anti-protection against the exposing HIVs.
doi:10.1093/biostatistics/kxs022
PMCID: PMC3520499
PMID: 22764174
Competing risks; Failure time data; Goodness-of-fit test; HIV vaccine trial; Hypothesis testing; Mark-specific relative risk; Multivariate data; Partial likelihood estimation; Semiparametric model; STEP trial
In the case-cohort studies conducted within the Atherosclerosis Risk in Communities (ARIC) study, it is of interest to assess and compare the effect of high-sensitivity C-reactive protein (hs-CRP) on the increased risks of incident coronary heart disease and incident ischemic stroke. Empirical cumulative hazards functions for different levels of hs-CRP reveal an additive structure for the risks for each disease outcome. Additionally, we are interested in estimating the difference in the risk for the different hs-CRP groups. Motivated by this, we consider fitting marginal additive hazards regression models for case-cohort studies with multiple disease outcomes. We consider a weighted estimating equations approach for the estimation of model parameters. The asymptotic properties of the proposed estimators are derived and their finite-sample properties are assessed via simulation studies. The proposed method is applied to analyze the ARIC Study.
doi:10.1093/biostatistics/kxs025
PMCID: PMC3520500
PMID: 22826550
Additive hazards model; ARIC study; Case-cohort study; Multivariate failure times; Weighted estimating equations
Classifying patients into different risk groups based on their genomic measurements can help clinicians design appropriate clinical treatment plans. To produce such a classification, gene expression data were collected on a cohort of burn patients, who were monitored across multiple time points. This led us to develop a new classification method using time-course gene expressions. Our results showed that making good use of time-course information of gene expression improved the performance of classification compared with using gene expression from individual time points only. Our method is implemented into an R-package: time-course prediction analysis using microarray.
doi:10.1093/biostatistics/kxs027
PMCID: PMC3520502
PMID: 22926914
Classification; Gene expression; Longitudinal; Time-course
The receiver operating characteristic (ROC) curve is often used to evaluate the performance of a biomarker measured on continuous scale to predict the disease status or a clinical condition. Motivated by the need for novel study designs with better estimation efficiency and reduced study cost, we consider a biased sampling scheme that consists of a SRC and a supplemental TDC. Using this approach, investigators can oversample or undersample subjects falling into certain regions of the biomarker measure, yielding improved precision for the estimation of the ROC curve with a fixed sample size. Test-result-dependent sampling will introduce bias in estimating the predictive accuracy of the biomarker if standard ROC estimation methods are used. In this article, we discuss three approaches for analyzing data of a test-result-dependent structure with a special focus on the empirical likelihood method. We establish asymptotic properties of the empirical likelihood estimators for covariate-specific ROC curves and covariate-independent ROC curves and give their corresponding variance estimators. Simulation studies show that the empirical likelihood method yields good properties and is more efficient than alternative methods. Recommendations on number of regions, cutoff points, and subject allocation is made based on the simulation results. The proposed methods are illustrated with a data example based on an ongoing lung cancer clinical trial.
doi:10.1093/biostatistics/kxs020
PMCID: PMC3577107
PMID: 22723502
Binormal model; Covariate-independent ROC curve; Covariate-specific ROC curve; Empirical likelihood method; Test-result-dependent sampling
Many prognostic models for cancer use biomarkers that have utility in early detection. For example, in prostate cancer, models predicting disease-specific survival use serum prostate-specific antigen levels. These models typically show that higher marker levels are associated with poorer prognosis. Consequently, they are often interpreted as indicating that detecting disease at a lower threshold of the biomarker is likely to generate a survival benefit. However, lowering the threshold of the biomarker is tantamount to early detection. For survival benefit to not be simply an artifact of starting the survival clock earlier, we must account for the lead time of early detection. It is not known whether the existing prognostic models imply a survival benefit under early detection once lead time has been accounted for. In this article, we investigate survival benefit implied by prognostic models where the predictor(s) of disease-specific survival are age and/or biomarker level at disease detection. We show that the benefit depends on the rate of biomarker change, the lead time, and the biomarker level at the original date of diagnosis as well as on the parameters of the prognostic model. Even if the prognostic model indicates that lowering the threshold of the biomarker is associated with longer disease-specific survival, this does not necessarily imply that early detection will confer an extension of life expectancy.
doi:10.1093/biostatistics/kxs018
PMCID: PMC3577108
PMID: 22730510
Disease-specific survival; Early Detection; Proportional hazards model
Immunological experiments that record primary molecular sequences of T-cell receptors produce moderate to high-dimensional categorical data, some of which may be subject to extra-multinomial variation caused by technical constraints of cell-based assays. Motivated by such experiments in melanoma research, we develop a statistical procedure for testing the equality of two discrete populations, where one population delivers multinomial data and the other is subject to a specific form of overdispersion. The procedure computes a conditional-predictive p-value by splitting the data set into two, obtaining a predictive distribution for one piece given the other, and using the observed predictive ordinate to generate a p-value. The procedure has a simple interpretation, requires fewer modeling assumptions than would be required of a fully Bayesian analysis, and has reasonable operating characteristics as evidenced empirically and by asymptotic analysis.
doi:10.1093/biostatistics/kxt039
PMCID: PMC3862212
PMID: 24096387
Bayesian p-value; Dirichlet multinomial; Double overdispersion; Fisher's exact test; HPRT assay; Mass culture experiments; Molecular sequence data; T-cell receptor