SUMMARY
We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop a simple algorithm—the graphical lasso—that is remarkably fast: It solves a 1000-node problem (~500 000 parameters) in at most a minute and is 30–4000 times faster than competing methods. It also provides a conceptual link between the exact problem and the approximation suggested by Meinshausen and Bühlmann (2006). We illustrate the method on some cell-signaling data from proteomics.
doi:10.1093/biostatistics/kxm045
PMCID: PMC3019769
PMID: 18079126
Gaussian covariance; Graphical model; L1; Lasso
SUMMARY
For many diseases, it is difficult or impossible to establish a definitive diagnosis because a perfect “gold standard” may not exist or may be too costly to obtain. In this paper, we propose a method to use continuous test results to estimate prevalence of disease in a given population and to estimate the effects of factors that may influence prevalence. Motivated by a study of human herpesvirus 8 among children with sickle-cell anemia in Uganda, where 2 enzyme immunoassays were used to assess infection status, we fit 2-component multivariate mixture models. We model the component densities using parametric densities that include data transformation as well as flexible transformed models. In addition, we model the mixing proportion, the probability of a latent variable corresponding to the true unknown infection status, via a logistic regression to incorporate covariates. This model includes mixtures of multivariate normal densities as a special case and is able to accommodate unusual shapes and skewness in the data. We assess model performance in simulations and present results from applying various parameterizations of the model to the Ugandan study.
doi:10.1093/biostatistics/kxm018
PMCID: PMC2710882
PMID: 17566074
Diagnostic tests; Mixture models; Semi-nonparametric densities; Semiparametrics; Sensitivity; Specificity; Transformations
We consider the problem of estimating sparse graphs by a lasso penalty applied to the
inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop
a simple algorithm—the graphical lasso—that is remarkably
fast: It solves a 1000-node problem (∼500000 parameters) in at most a minute and is
30–4000 times faster than competing methods. It also provides a conceptual link
between the exact problem and the approximation suggested by Meinshausen and Bühlmann
(2006). We illustrate the method on some cell-signaling data from proteomics.
doi:10.1093/biostatistics/kxm045
PMCID: PMC3019769
PMID: 18079126
Gaussian covariance; Graphical model; L1; Lasso
The Cochran–Armitage trend test (CATT) is well suited for testing association between a marker and a disease in case–control studies. When the underlying genetic model for the disease is known, the CATT optimal for the genetic model is used. For complex diseases, however, the genetic models of the true disease loci are unknown. In this situation, robust tests are preferable. We propose a two-phase analysis with model selection for the case–control design. In the first phase, we use the difference of Hardy–Weinberg disequilibrium coefficients between the cases and the controls for model selection. Then, an optimal CATT corresponding to the selected model is used for testing association. The correlation of the statistics used for selection and the test for association is derived to adjust the two-phase analysis with control of the Type-I error rate. The simulation studies show that this new approach has greater efficiency robustness than the existing methods.
doi:10.1093/biostatistics/kxm039
PMCID: PMC3294316
PMID: 18003629
Cochran–Armitage trend test; Disease risk; Efficiency robustness; Hardy–Weinberg disequilibrium; SNP
Late-onset (LO) toxicities are a serious concern in many phase I trials. Since most dose-limiting toxicities occur soon after therapy begins, most dose-finding methods use a binary indicator of toxicity occurring within a short initial time period. If an agent causes LO toxicities, however, an undesirably large number of patients may be treated at toxic doses before any toxicities are observed. A method addressing this problem is the time-to-event continual reassessment method (TITE-CRM, Cheung and Chappell, 2000). We propose a Bayesian dose-finding method similar to the TITE-CRM in which doses are chosen using time-to-toxicity data. The new aspect of our method is a set of rules, based on predictive probabilities, that temporarily suspend accrual if the risk of toxicity at prospective doses for future patients is unacceptably high. If additional follow-up data reduce the predicted risk of toxicity to an acceptable level, then accrual is restarted, and this process may be repeated several times during the trial. A simulation study shows that the proposed method provides a greater degree of safety than the TITE-CRM, while still reliably choosing the preferred dose. This advantage increases with accrual rate, but the price of this additional safety is that the trial takes longer to complete on average.
doi:10.1093/biostatistics/kxm044
PMCID: PMC3294317
PMID: 18084008
Adaptive design; Bayesian inference; Dose finding; Isotonic regression; Latent variables; Markov chain Monte Carlo; Ordinal modeling; Predictive probability
When applying hierarchical clustering algorithms to cluster patient samples from microarray data, the clustering patterns generated by most algorithms tend to be dominated by groups of highly differentially expressed genes that have closely related expression patterns. Sometimes, these genes may not be relevant to the biological process under study or their functions may already be known. The problem is that these genes can potentially drown out the effects of other genes that are relevant or have novel functions. We propose a procedure called complementary hierarchical clustering that is designed to uncover the structures arising from these novel genes that are not as highly expressed. Simulation studies show that the procedure is effective when applied to a variety of examples. We also define a concept called relative gene importance that can be used to identify the influential genes in a given clustering. Finally, we analyze a microarray data set from 295 breast cancer patients, using clustering with the correlation-based distance measure. The complementary clustering reveals a grouping of the patients which is uncorrelated with a number of known prognostic signatures and significantly differing distant metastasis-free probabilities.
doi:10.1093/biostatistics/kxm046
PMCID: PMC3294318
PMID: 18093965
Hierarchical clustering; Microarray; Principal components; Relative gene importance
When testing large numbers of null hypotheses, one needs to assess the evidence against the global null hypothesis that none of the hypotheses is false. Such evidence typically is based on the test statistic of the largest magnitude, whose statistical significance is evaluated by permuting the sample units to simulate its null distribution. Efron (2007) has noted that correlation among the test statistics can induce substantial interstudy variation in the shapes of their histograms, which may cause misleading tail counts. Here, we show that permutation-based estimates of the overall significance level also can be misleading when the test statistics are correlated. We propose that such estimates be conditioned on a simple measure of the spread of the observed histogram, and we provide a method for obtaining conditional significance levels. We justify this conditioning using the conditionality principle described by Cox and Hinkley (1974). Application of the method to gene expression data illustrates the circumstances when conditional significance levels are needed.
doi:10.1093/biostatistics/kxm047
PMCID: PMC3294319
PMID: 18089626
Conditional p-value; Gene expression data; Genome-wide association data; Multiple testing; Overall p-value
Summary
Genetic epidemiologic studies often involve investigation of the association of a disease with a genomic region in terms of the underlying haplotypes, that is the combination of alleles at multiple loci along homologous chromosomes. In this article, we consider the problem of estimating haplotype–environment interactions from case–control studies when some of the environmental exposures themselves may be influenced by genetic susceptibility. We specify the distribution of the diplotypes (haplotype pair) given environmental exposures for the underlying population based on a novel semiparametric model that allows haplotypes to be potentially related with environmental exposures, while allowing the marginal distribution of the diplotypes to maintain certain population genetics constraints such as Hardy–Weinberg equilibrium. The marginal distribution of the environmental exposures is allowed to remain completely nonparametric. We develop a semiparametric estimating equation methodology and related asymptotic theory for estimation of the disease odds ratios associated with the haplotypes, environmental exposures, and their interactions, parameters that characterize haplotype–environment associations and the marginal haplotype frequencies. The problem of phase ambiguity of genotype data is handled using a suitable expectation–maximization algorithm. We study the finite-sample performance of the proposed methodology using simulated data. An application of the methodology is illustrated using a case–control study of colorectal adenoma, designed to investigate how the smoking-related risk of colorectal adenoma can be modified by “NAT2,” a smoking-metabolism gene that may potentially influence susceptibility to smoking itself.
doi:10.1093/biostatistics/kxm011
PMCID: PMC2683243
PMID: 17490987
Case-control studies; EM algorithm; Gene-environment interactions; Haplotype; Semiparametric methods
Summary
We propose a simple and general resampling strategy to estimate variances for parameter estimators derived from nonsmooth estimating functions. This approach applies to a wide variety of semiparametric and nonparametric problems in biostatistics. It does not require solving estimating equations and is thus much faster than the existing resampling procedures. Its usefulness is illustrated with heteroscedastic quantile regression and censored data rank regression. Numerical results based on simulated and real data are provided.
doi:10.1093/biostatistics/kxm034
PMCID: PMC2673016
PMID: 17925303
Bootstrap; Censoring; Quantile regression; Rank regression; Robustness; Variance estimation