Latent class models provide a useful framework for clustering observations based on several features. Application of latent class methodology to correlated, high-dimensional ordinal data poses many challenges. Unconstrained analyses may not result in an estimable model. Thus, information contained in ordinal variables may not be fully exploited by researchers. We develop a penalized latent class model to facilitate analysis of high-dimensional ordinal data. By stabilizing maximum likelihood estimation, we are able to fit an ordinal latent class model that would otherwise not be identifiable without application of strict constraints. We illustrate our methodology in a study of schwannoma, a peripheral nerve sheath tumor, that included three clinical subtypes and 23 ordinal histological measures.
doi:10.1093/biostatistics/kxm026
PMCID: PMC4878392
PMID: 17626225
SUMMARY
We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop a simple algorithm—the graphical lasso—that is remarkably fast: It solves a 1000-node problem (~500 000 parameters) in at most a minute and is 30–4000 times faster than competing methods. It also provides a conceptual link between the exact problem and the approximation suggested by Meinshausen and Bühlmann (2006). We illustrate the method on some cell-signaling data from proteomics.
doi:10.1093/biostatistics/kxm045
PMCID: PMC3019769
PMID: 18079126
Gaussian covariance; Graphical model; L1; Lasso
Summary
Positive and negative affect data are often collected over time in psychiatric care settings, yet no generally accepted means are available to relate these data to useful diagnoses or treatments. Latent class analysis attempts data reduction by classifying subjects into one of K unobserved classes based on observed data. Latent class models have recently been extended to accommodate longitudinally observed data. We extend these approaches in a Bayesian framework to accommodate trajectories of both continuous and discrete data. We consider whether latent class models might be used to distinguish patients on the basis of trajectories of observed affect scores, reported events, and presence or absence of clinical depression.
doi:10.1093/biostatistics/kxh022
PMCID: PMC2827342
PMID: 15618532
Cardiovascular disease; Depression; DIC; General growth mixture modeling; Gibbs sampling; Label switching; Model choice
In most analyses of large-scale genomic data sets, differential expression analysis is typically assessed by testing for differences in the mean of the distributions between 2 groups. A recent finding by Tomlins and others (2005) is of a different type of pattern of differential expression in which a fraction of samples in one group have overexpression relative to samples in the other group. In this work, we describe a general mixture model framework for the assessment of this type of expression, called outlier profile analysis. We start by considering the single-gene situation and establishing results on identifiability. We propose 2 nonparametric estimation procedures that have natural links to familiar multiple testing procedures. We then develop multivariate extensions of this methodology to handle genome-wide measurements. The proposed methodologies are compared using simulation studies as well as data from a prostate cancer gene expression study.
doi:10.1093/biostatistics/kxn015
PMCID: PMC2605210
PMID: 18539648
Bonferroni correction; DNA microarray; False discovery rate; Goodness of fit; Multiple comparisons; Uniform distribution
Summary
Monitoring health care quality involves combining continuous and discrete outcomes measured on subjects across health care units over time. This article describes a Bayesian approach to jointly modeling multilevel multidimensional continuous and discrete outcomes with serial dependence. The overall goal is to characterize trajectories of traits of each unit. Underlying normal regression models for each outcome are used and dependence among different outcomes is induced through latent variables. Serial dependence is accommodated through modeling the pairwise correlations of the latent variables. Methods are illustrated to assess trends in quality of health care units using continuous and discrete outcomes from a sample of adult veterans discharged from 1 of 22 Veterans Integrated Service Networks with a psychiatric diagnosis between 1993 and 1998.
doi:10.1093/biostatistics/kxi036
PMCID: PMC2791405
PMID: 15917373
Bayesian hierarchical model; Correlation matrix; Informative priors; Latent variable; Mental health
SUMMARY
Using validation sets for outcomes can greatly improve the estimation of vaccine efficacy (VE) in the field (Halloran and Longini, 2001; Halloran and others, 2003). Most statistical methods for using validation sets rely on the assumption that outcomes on those with no cultures are missing at random (MAR). However, often the validation sets will not be chosen at random. For example, confirmational cultures are often done on people with influenza-like illness as part of routine influenza surveillance. VE estimates based on such non-MAR validation sets could be biased. Here we propose frequentist and Bayesian approaches for estimating VE in the presence of validation bias. Our work builds on the ideas of Rotnitzky and others (1998, 2001), Scharfstein and others (1999, 2003), and Robins and others (2000). Our methods require expert opinion about the nature of the validation selection bias. In a re-analysis of an influenza vaccine study, we found, using the beliefs of a flu expert, that within any plausible range of selection bias the VE estimate based on the validation sets is much higher than the point estimate using just the non-specific case definition. Our approach is generally applicable to studies with missing binary outcomes with categorical covariates.
doi:10.1093/biostatistics/kxj031
PMCID: PMC2766283
PMID: 16556610
Bayesian; Expert opinion; Identifiability; Influenza; Missing data; Selection model; Vaccine efficacy
Summary
In randomized studies with missing outcomes, non-identifiable assumptions are required to hold for valid data analysis. As a result, statisticians have been advocating the use of sensitivity analysis to evaluate the effect of varying asssumptions on study conclusions. While this approach may be useful in assessing the sensitivity of treatment comparisons to missing data assumptions, it may be dissatisfying to some researchers/decision makers because a single summary is not provided. In this paper, we present a fully Bayesian methodology that allows the investigator to draw a ‘single’ conclusion by formally incorporating prior beliefs about non-identifiable, yet interpretable, selection bias parameters. Our Bayesian model provides robustness to prior specification of the distributional form of the continuous outcomes.
doi:10.1093/biostatistics/4.4.495
PMCID: PMC2748253
PMID: 14557107
Dirichlet process prior; Identifiability; MCHC; Non-parametric Bayes; Selection model; Sensitivity analysis
This article focuses on parameter estimation of multi-levels nonlinear mixed effects models (MNLMEMs). These models are used to analyze data presenting multiple hierarchical levels of grouping (cluster data, clinical trials with several observation periods,…). The variability of the individual parameters of the regression function is thus decomposed as a between-subject variability and higher levels of variability (for example within-subject variability). We propose maximum likelihood estimates of parameters of those MNLMEMs with two levels of random effects, using an extension of the SAEM-MCMC algorithm. The extended SAEM algorithm is split into an explicit direct EM algorithm and a stochastic EM part. Compared to the original algorithm, additional sufficient statistics have to be approximated by relying on the conditional distribution of the second level of random effects. This estimation method is evaluated on pharmacokinetic cross-over simulated trials, mimicking theophyllin concentration data. Results obtained on those datasets with either the SAEM algorithm or the FOCE algorithm (implemented in the nlme function of R software) are compared: biases and RMSEs of almost all the SAEM estimates are smaller than the FOCE ones. Finally, we apply the extended SAEM algorithm to analyze the pharmacokinetic interaction of tenofovir on atazanavir, a novel protease inhibitor, from the ANRS 107-Puzzle 2 study. A significant decrease of the area under the curve of atazanavir is found in patients receiving both treatments.
doi:10.1093/biostatistics/kxn020
PMCID: PMC2722900
PMID: 18583352
Algorithms; Anti-HIV Agents; therapeutic use; Area Under Curve; Bias (Epidemiology); Biometry; methods; Cluster Analysis; Cross-Over Studies; Drug Interactions; Humans; Likelihood Functions; Markov Chains; Monte Carlo Method; Nonlinear Dynamics; Oligopeptides; pharmacokinetics; Pyridines; pharmacokinetics; Regression Analysis; Theophylline; pharmacokinetics; Therapeutic Equivalency; Time Factors; Multilevel nonlinear mixed effects models; SAEM algorithm; Multiple periods; Cross-over trial; Bioequivalence trials.
SUMMARY
For many diseases, it is difficult or impossible to establish a definitive diagnosis because a perfect “gold standard” may not exist or may be too costly to obtain. In this paper, we propose a method to use continuous test results to estimate prevalence of disease in a given population and to estimate the effects of factors that may influence prevalence. Motivated by a study of human herpesvirus 8 among children with sickle-cell anemia in Uganda, where 2 enzyme immunoassays were used to assess infection status, we fit 2-component multivariate mixture models. We model the component densities using parametric densities that include data transformation as well as flexible transformed models. In addition, we model the mixing proportion, the probability of a latent variable corresponding to the true unknown infection status, via a logistic regression to incorporate covariates. This model includes mixtures of multivariate normal densities as a special case and is able to accommodate unusual shapes and skewness in the data. We assess model performance in simulations and present results from applying various parameterizations of the model to the Ugandan study.
doi:10.1093/biostatistics/kxm018
PMCID: PMC2710882
PMID: 17566074
Diagnostic tests; Mixture models; Semi-nonparametric densities; Semiparametrics; Sensitivity; Specificity; Transformations
We consider the problem of estimating sparse graphs by a lasso penalty applied to the
inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop
a simple algorithm—the graphical lasso—that is remarkably
fast: It solves a 1000-node problem (∼500000 parameters) in at most a minute and is
30–4000 times faster than competing methods. It also provides a conceptual link
between the exact problem and the approximation suggested by Meinshausen and Bühlmann
(2006). We illustrate the method on some cell-signaling data from proteomics.
doi:10.1093/biostatistics/kxm045
PMCID: PMC3019769
PMID: 18079126
Gaussian covariance; Graphical model; L1; Lasso
Summary
Genetic epidemiologic studies often involve investigation of the association of a disease with a genomic region in terms of the underlying haplotypes, that is the combination of alleles at multiple loci along homologous chromosomes. In this article, we consider the problem of estimating haplotype–environment interactions from case–control studies when some of the environmental exposures themselves may be influenced by genetic susceptibility. We specify the distribution of the diplotypes (haplotype pair) given environmental exposures for the underlying population based on a novel semiparametric model that allows haplotypes to be potentially related with environmental exposures, while allowing the marginal distribution of the diplotypes to maintain certain population genetics constraints such as Hardy–Weinberg equilibrium. The marginal distribution of the environmental exposures is allowed to remain completely nonparametric. We develop a semiparametric estimating equation methodology and related asymptotic theory for estimation of the disease odds ratios associated with the haplotypes, environmental exposures, and their interactions, parameters that characterize haplotype–environment associations and the marginal haplotype frequencies. The problem of phase ambiguity of genotype data is handled using a suitable expectation–maximization algorithm. We study the finite-sample performance of the proposed methodology using simulated data. An application of the methodology is illustrated using a case–control study of colorectal adenoma, designed to investigate how the smoking-related risk of colorectal adenoma can be modified by “NAT2,” a smoking-metabolism gene that may potentially influence susceptibility to smoking itself.
doi:10.1093/biostatistics/kxm011
PMCID: PMC2683243
PMID: 17490987
Case-control studies; EM algorithm; Gene-environment interactions; Haplotype; Semiparametric methods
Summary
We propose a simple and general resampling strategy to estimate variances for parameter estimators derived from nonsmooth estimating functions. This approach applies to a wide variety of semiparametric and nonparametric problems in biostatistics. It does not require solving estimating equations and is thus much faster than the existing resampling procedures. Its usefulness is illustrated with heteroscedastic quantile regression and censored data rank regression. Numerical results based on simulated and real data are provided.
doi:10.1093/biostatistics/kxm034
PMCID: PMC2673016
PMID: 17925303
Bootstrap; Censoring; Quantile regression; Rank regression; Robustness; Variance estimation
SUMMARY
In most analyses of large-scale genomic data sets, differential expression analysis is typically assessed by testing for differences in the mean of the distributions between 2 groups. A recent finding by Tomlins and others (2005) is of a different type of pattern of differential expression in which a fraction of samples in one group have overexpression relative to samples in the other group. In this work, we describe a general mixture model framework for the assessment of this type of expression, called outlier profile analysis. We start by considering the single-gene situation and establishing results on identifiability. We propose 2 nonparametric estimation procedures that have natural links to familiar multiple testing procedures. We then develop multivariate extensions of this methodology to handle genome-wide measurements. The proposed methodologies are compared using simulation studies as well as data from a prostate cancer gene expression study.
doi:10.1093/biostatistics/kxn015
PMCID: PMC2605210
PMID: 18539648
Bonferroni correction; DNA microarray; False discovery rate; Goodness of fit; Multiple comparisons; Uniform distribution
SUMMARY
Microsatellite instability (MSI) testing is a common screening procedure used to identify families that may harbor mutations of a mismatch repair (MMR) gene and therefore may be at high risk for hereditary colorectal cancer. A reliable estimate of sensitivity and specificity of MSI for detecting germline mutations of MMR genes is critical in genetic counseling and colorectal cancer prevention. Several studies published results of both MSI and mutation analysis on the same subjects. In this article we perform a meta-analysis of these studies and obtain estimates that can be directly used in counseling and screening. In particular, we estimate the sensitivity of MSI for detecting mutations of MSH2 and MLH1 to be 0.81 (0.73–0.89). Statistically, challenges arise from the following: (a) traditional mutation analysis methods used in these studies cannot be considered a gold standard for the identification of mutations; (b) studies are heterogeneous in both the design and the populations considered; and (c) studies may include different patterns of missing data resulting from partial testing of the populations sampled. We address these challenges in the context of a Bayesian meta-analytic implementation of the Hui–Walter design, tailored to account for various forms of incomplete data. Posterior inference is handled via a Gibbs sampler.
doi:10.1093/biostatistics/kxi021
PMCID: PMC2274000
PMID: 15831578
Diagnostic test; Hereditary nonpolyposis colorectal cancer; Microsatellite instability; Sensitivity; Specificity