We address the problem of sparse selection in linear models. A number of nonconvex penalties have been proposed in the literature for this purpose, along with a variety of convex-relaxation algorithms for finding good solutions. In this article we pursue a coordinate-descent approach for optimization, and study its convergence properties. We characterize the properties of penalties suitable for this approach, study their corresponding threshold functions, and describe a df-standardizing reparametrization that assists our pathwise algorithm. The MC+ penalty is ideally suited to this task, and we use it to demonstrate the performance of our algorithm. Certain technical derivations and experiments related to this article are included in the Supplementary Materials section.
doi:10.1198/jasa.2011.tm09738
PMCID: PMC4286300
PMID: 25580042
Degrees of freedom; LASSO; Nonconvex optimization; Regularization surface; Sparse regression; Variable selection
We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2013) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number B of bootstrap replicates, and working with a large B can be computationally expensive. Direct applications of jackknife and IJ estimators to bagging require B = Θ(n1.5) bootstrap replicates to converge, where n is the size of the training set. We propose improved versions that only require B = Θ(n) replicates. Moreover, we show that the IJ estimator requires 1.7 times less bootstrap replicates than the jackknife to achieve a given accuracy. Finally, we study the sampling distributions of the jackknife and IJ variance estimates themselves. We illustrate our findings with multiple experiments and simulation studies.
PMCID: PMC4286302
PMID: 25580094
bagging; jackknife methods; Monte Carlo noise; variance estimation
The graphical lasso [5] is an algorithm for learning the structure in an undirected Gaussian graphical model, using ℓ1 regularization to control the number of zeros in the precision matrix Θ = Σ−1 [2, 11]. The R package GLASSO [5] is popular, fast, and allows one to efficiently build a path of models for different values of the tuning parameter. Convergence of GLASSO can be tricky; the converged precision matrix might not be the inverse of the estimated covariance, and occasionally it fails to converge with warm starts. In this paper we explain this behavior, and propose new algorithms that appear to outperform GLASSO.
By studying the “normal equations” we see that, GLASSO is solving the dual of the graphical lasso penalized likelihood, by block coordinate ascent; a result which can also be found in [2]. In this dual, the target of estimation is Σ, the covariance matrix, rather than the precision matrix Θ. We propose similar primal algorithms P-GLASSO and DP-GLASSO, that also operate by block-coordinate descent, where Θ is the optimization target. We study all of these algorithms, and in particular different approaches to solving their coordinate sub-problems. We conclude that DP-GLASSO is superior from several points of view.
doi:10.1214/12-EJS740
PMCID: PMC4281944
PMID: 25558297
Graphical lasso; sparse inverse covariance selection; precision matrix; convex analysis/optimization; positive definite matrices; sparsity; semidefinite programming
Summary
We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui and his colleagues have propose ‘SAFE’ rules, based on univariate inner products between each predictor and the outcome, which guarantee that a coefficient will be 0 in the solution vector. This provides a reduction in the number of variables that need to be entered into the optimization. We propose strong rules that are very simple and yet screen out far more predictors than the SAFE rules. This great practical improvement comes at a price: the strong rules are not foolproof and can mistakenly discard active predictors, i.e. predictors that have non-zero coefficients in the solution. We therefore combine them with simple checks of the Karush–Kuhn–Tucker conditions to ensure that the exact solution to the convex problem is delivered. Of course, any (approximate) screening method can be combined with the Karush–Kuhn–Tucker, conditions to ensure the exact solution; the strength of the strong rules lies in the fact that, in practice, they discard a very large number of the inactive predictors and almost never commit mistakes. We also derive conditions under which they are foolproof. Strong rules provide substantial savings in computational time for a variety of statistical optimization problems.
doi:10.1111/j.1467-9868.2011.01004.x
PMCID: PMC4262615
PMID: 25506256
Presence-only data abounds in ecology, often accompanied by a background sample. Although many interesting aspects of the species’ distribution can be learned from such data, one cannot learn the overall species occurrence probability, or prevalence, without making unjustified simplifying assumptions. In this forum article we question the approach of Royle et al. (2012) that claims to be able to do this.
doi:10.1111/j.1600-0587.2013.00321.x
PMCID: PMC4258395
PMID: 25492992
Statistical modeling of presence-only data has attracted much recent attention in the ecological literature, leading to a proliferation of methods, including the inhomogeneous Poisson process (IPP) model, maximum entropy (Maxent) modeling of species distributions and logistic regression models. Several recent articles have shown the close relationships between these methods. We explain why the IPP intensity function is a more natural object of inference in presence-only studies than occurrence probability (which is only defined with reference to quadrat size), and why presence-only data only allows estimation of relative, and not absolute intensity of species occurrence.
All three of the above techniques amount to parametric density estimation under the same exponential family model (in the case of the IPP, the fitted density is multiplied by the number of presence records to obtain a fitted intensity). We show that IPP and Maxent give the exact same estimate for this density, but logistic regression in general yields a different estimate in finite samples. When the model is misspecified—as it practically always is—logistic regression and the IPP may have substantially different asymptotic limits with large data sets. We propose “infinitely weighted logistic regression,” which is exactly equivalent to the IPP in finite samples. Consequently, many already-implemented methods extending logistic regression can also extend the Maxent and IPP models in directly analogous ways using this technique.
doi:10.1214/13-AOAS667
PMCID: PMC4258396
PMID: 25493106
Presence-only data; logistic regression; maximum entropy; Poisson process models; species modeling; case-control sampling
For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept–reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set.
Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients θ*. By contrast, our estimator is consistent for θ* provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE—even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to 1+1c if we multiply the baseline acceptance probabilities by c > 1 (and weight points with acceptance probability greater than 1), taking roughly 1+c2 times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.
doi:10.1214/14-AOS1220
PMCID: PMC4258397
PMID: 25492979
Logistic regression; case-control sampling; subsampling
We consider the sparse inverse covariance regularization problem or graphical lasso with regularization parameter λ. Suppose the sample covariance graph formed by thresholding the entries of the sample covariance matrix at λ is decomposed into connected components. We show that the vertex-partition induced by the connected components of the thresholded sample covariance graph (at λ) is exactly equal to that induced by the connected components of the estimated concentration graph, obtained by solving the graphical lasso problem for the same λ. This characterizes a very interesting property of a path of graphical lasso solutions. Furthermore, this simple rule, when used as a wrapper around existing algorithms for the graphical lasso, leads to enormous performance gains. For a range of values of λ, our proposal splits a large graphical lasso problem into smaller tractable problems, making it possible to solve an otherwise infeasible large-scale problem. We illustrate the graceful scalability of our proposal via synthetic and real-life microarray examples.
PMCID: PMC4225650
PMID: 25392704
sparse inverse covariance selection; sparsity; graphical lasso; Gaussian graphical models; graph connected components; concentration graph; large scale covariance estimation
Summary
We propose a method to test the correlation of two random fields when they are both spatially auto-correlated. In this scenario, the assumption of independence for the pair of observations in the standard test does not hold, and as a result we reject in many cases where there is no effect (the precision of the null distribution is overestimated). Our method recovers the null distribution taking into account the autocorrelation. It uses Monte-Carlo methods, and focuses on permuting, and then smoothing and scaling one of the variables to destroy the correlation with the other, while maintaining at the same time the initial autocorrelation. With this simulation model, any test based on the independence of two (or more) random fields can be constructed. This research was motivated by a project in biodiversity and conservation in the Biology Department at Stanford University.
doi:10.1111/biom.12139
PMCID: PMC4108159
PMID: 24571609
Geostatistics; Monte-Carlo methods; Resampling; Spatial autocorrelation; Spatial statistics; Variogram
Objective
In addition to increased risks for aneurysm-related death, previous studies have determined that all-cause mortality in abdominal aortic aneurysm (AAA) patients is excessive and equivalent to that associated with coronary heart disease (CHD). These studies largely preceded the current era of CHD risk factor management, however, and no recent study has examined contemporary mortality associated with early AAA disease (aneurysm diameter between 3 and 5 cm). As part of an ongoing natural history study of AAA, we report the mortality risk associated with presence of early disease.
Methods
Participants were recruited from three distinct health care systems in Northern California between 2006 and 2011. Aneurysm diameter, demographic information, co-morbidities, medication history, and plasma for biomarker analysis were collected at study entry. Survival status was determined at follow-up. Data were analyzed with t- or chi-square tests where appropriate. Freedom from death was calculated via Cox proportional hazards modeling; the relevance of individual predictors on mortality was determined by log-rank test.
Results
The study enrolled 645 AAA patients; age 76.4 +/− 8.0 years, aortic diameter 3.86+/− 0.7 cm. Participants were mostly male (88.8%) not current smokers (81.6%) and taking statins (76.7%). Mean follow-up was 2.1 +/− 1.0 years. Estimated 1- and 3-year survival was 98.2% and 90.9%, respectively. Factors independently associated with mortality included larger aneurysm size (HR 2.12, 95% CI 1.26 – 3.57 for diameter >4.0 cm) and diabetes (HR 2.24, 95% CI 1.12 – 4.47). After adjusting for patient-level factors, health care system independently predicted mortality.
Conclusion
Contemporary all-cause mortality for patients with early AAA disease is lower than that previously reported. Further research is warranted to determine important factors that contribute to imporved survival in early AAA disease.
doi:10.1016/j.jvs.2012.04.023
PMCID: PMC3478494
PMID: 22832264
Introduction
Gene expression profiling has been extensively used to predict outcome in breast cancer patients. We have previously reported on biological hypothesis-driven analysis of gene expression profiling data and we wished to extend this approach through the combinations of various gene signatures to improve the prediction of outcome in breast cancer.
Methods
We have used gene expression data (25.000 gene probes) from a previously published study of tumours from 295 early stage breast cancer patients from the Netherlands Cancer Institute using updated follow-up. Tumours were assigned to three prognostic groups using the previously reported Wound-response and hypoxia-response signatures, and the outcome in each of these subgroups was evaluated.
Results
We have assigned invasive breast carcinomas from 295 stages I and II breast cancer patients to three groups based on gene expression profiles subdivided by the wound-response signature (WS) and hypoxia-response signature (HS). These three groups are (1) quiescent WS/non-hypoxic HS; (2) activated WS/non-hypoxic HS or quiescent WS/hypoxic tumours and (3) activated WS/hypoxic HS. The overall survival at 15 years for patients with tumours in groups 1, 2 and 3 are 79%, 59% and 27%, respectively. In multivariate analysis, this signature is not only independent of clinical and pathological risk factors; it is also the strongest predictor of outcome. Compared to a previously identified 70-gene prognosis profile, obtained with supervised classification, the combination of signatures performs roughly equally well and might have additional value in the ER-negative subgroup. In the subgroup of lymph node positive patients, the combination signature outperforms the 70-gene signature in multivariate analysis. In addition, in multivariate analysis, the WS/HS combination is a stronger predictor of outcome compared to the recently reported invasiveness gene signature combined with the WS.
Conclusion
A combination of biological gene expression signatures can be used to identify a powerful and independent predictor for outcome in breast cancer patients.
doi:10.1016/j.ejca.2008.07.015
PMCID: PMC3756930
PMID: 18715778
Microarray analysis; Breast cancer; Prognostic markers; Biological gene expression profiles
Shen-Orr, Shai S | Tibshirani, Robert | Khatri, Purvesh | Bodian, Dale L | Staedtler, Frank | Perry, Nicholas M | Hastie, Trevor | Sarwal, Minnie M | Davis, Mark M | Butte, Atul J
We describe cell type–specific significance analysis of microarrays (cssam) for analyzing differential gene expression for each cell type in a biological sample from microarray data and relative cell-type frequencies. first, we validated cssam with predesigned mixtures and then applied it to whole-blood gene expression datasets from stable post-transplant kidney transplant recipients and those experiencing acute transplant rejection, which revealed hundreds of differentially expressed genes that were otherwise undetectable.
doi:10.1038/nmeth.1439
PMCID: PMC3699332
PMID: 20208531
Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.
doi:10.1093/biostatistics/kxr012
PMCID: PMC3169672
PMID: 21642389
Cancer; DNA copy number; False discovery rate; Mutation
We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm Soft-Impute iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices: for example it can obtain a rank-80 approximation of a 106 × 106 incomplete matrix with 105 observed entries in 2.5 hours, and can fit a rank 40 approximation to the full Netflix training set in 6.6 hours. Our methods show very good performance both in training and test error when compared to other competitive state-of-the art techniques.
PMCID: PMC3087301
PMID: 21552465
SUMMARY
We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop a simple algorithm—the graphical lasso—that is remarkably fast: It solves a 1000-node problem (~500 000 parameters) in at most a minute and is 30–4000 times faster than competing methods. It also provides a conceptual link between the exact problem and the approximation suggested by Meinshausen and Bühlmann (2006). We illustrate the method on some cell-signaling data from proteomics.
doi:10.1093/biostatistics/kxm045
PMCID: PMC3019769
PMID: 18079126
Gaussian covariance; Graphical model; L1; Lasso
We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include ℓ1 (the lasso), ℓ2 (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.
PMCID: PMC2929880
PMID: 20808728
Beck, Andrew H | Lee, Cheng-Han | Witten, Daniela M | Gleason, Briana C | Edris, Badreddin | Espinosa, Inigo | Zhu, Shirley | Li, Rui | Montgomery, Kelli D | Marinelli, Robert J | Tibshirani, Robert | Hastie, Trevor | Jablons, David M | Rubin, Brian P | Fletcher, Christopher D | West, Robert B | van de Rijn, Matt
Leiomyosarcoma (LMS) is a soft tissue tumor with a significant degree of morphologic and molecular heterogeneity. We employed integrative molecular profiling to discover and characterize molecular subtypes of LMS. Gene expression profiling was performed on 51 LMS samples. Unsupervised clustering demonstrated 3 reproducible LMS clusters. Array comparative genomic hybridization (aCGH) was performed on 20 LMS samples and demonstrated that the molecular subtypes defined by gene-expression showed distinct genomic changes. Tumors from the “muscle-enriched” cluster showed significantly increased copy number changes (p=0.04). Most muscle-enriched cases showed loss at 16q24 which contains FANCA, known to play an important role in DNA repair, and loss at 1p36 which contains PRDM16, whose loss promotes muscle differentiation. Immunohistochemistry was performed on LMS tissue microarrays (n=377) for five markers with high levels of mRNA in the muscle-enriched cluster (ACTG2, CASQ2, SLMAP,CFL2, MYLK) and demonstrated significantly correlated expression of the 5 proteins (all pairwise p < 0.005). Expression of the 5 markers was associated with improved disease-specific survival (DSS) in a multivariate Cox regression analysis (p < 0.04). In this analysis that combined gene expression profiling, aCGH and immunohistochemistry, we characterized distinct molecular LMS subtypes, provided insight into their pathogenesis, and identified prognostic biomarkers.
doi:10.1038/onc.2009.381
PMCID: PMC2820592
PMID: 19901961
sarcoma; leiomyosarcoma; integrative genomics; gene expression profiling; array comparative genomic hybridization; tissue microarrays
We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as , where dk, uk, and vk minimize the squared Frobenius norm of X, subject to penalties on uk and vk. This results in a regularized version of the singular value decomposition. Of particular interest is the use of L1-penalties on uk and vk, which yields a decomposition of X using sparse vectors. We show that when the PMD is applied using an L1-penalty on vk but not on uk, a method for sparse principal components results. In fact, this yields an efficient algorithm for the “SCoTLASS” proposal (Jolliffe and others 2003) for obtaining sparse principal components. This method is demonstrated on a publicly available gene expression data set. We also establish connections between the SCoTLASS method for sparse principal component analysis and the method of Zou and others (2006). In addition, we show that when the PMD is applied to a cross-products matrix, it results in a method for penalized canonical correlation analysis (CCA). We apply this penalized CCA method to simulated data and to a genomic data set consisting of gene expression and DNA copy number measurements on the same set of samples.
doi:10.1093/biostatistics/kxp008
PMCID: PMC2697346
PMID: 19377034
Canonical correlation analysis; DNA copy number; Integrative genomic analysis; L1; Matrix decomposition; Principal component analysis; Sparse principal component analysis; SVD
Motivation: In ordinary regression, imposition of a lasso penalty makes continuous model selection straightforward. Lasso penalized regression is particularly advantageous when the number of predictors far exceeds the number of observations.
Method: The present article evaluates the performance of lasso penalized logistic regression in case–control disease gene mapping with a large number of SNPs (single nucleotide polymorphisms) predictors. The strength of the lasso penalty can be tuned to select a predetermined number of the most relevant SNPs and other predictors. For a given value of the tuning constant, the penalized likelihood is quickly maximized by cyclic coordinate ascent. Once the most potent marginal predictors are identified, their two-way and higher order interactions can also be examined by lasso penalized logistic regression.
Results: This strategy is tested on both simulated and real data. Our findings on coeliac disease replicate the previous SNP results and shed light on possible interactions among the SNPs.
Availability: The software discussed is available in Mendel 9.0 at the UCLA Human Genetics web site.
Contact: klange@ucla.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp041
PMCID: PMC2732298
PMID: 19176549
Current work in elucidating relationships between diseases has largely been based on pre-existing knowledge of disease genes. Consequently, these studies are limited in their discovery of new and unknown disease relationships. We present the first quantitative framework to compare and contrast diseases by an integrated analysis of disease-related mRNA expression data and the human protein interaction network. We identified 4,620 functional modules in the human protein network and provided a quantitative metric to record their responses in 54 diseases leading to 138 significant similarities between diseases. Fourteen of the significant disease correlations also shared common drugs, supporting the hypothesis that similar diseases can be treated by the same drugs, allowing us to make predictions for new uses of existing drugs. Finally, we also identified 59 modules that were dysregulated in at least half of the diseases, representing a common disease-state “signature”. These modules were significantly enriched for genes that are known to be drug targets. Interestingly, drugs known to target these genes/proteins are already known to treat significantly more diseases than drugs targeting other genes/proteins, highlighting the importance of these core modules as prime therapeutic opportunities.
Author Summary
Many human diseases are related to each other through shared causes or even shared pathology. Knowledge of these relationships has long been exploited to treat similar diseases with the same therapies. However, most of the traditional approaches to discover these relationships have depended on subjective measures, such as similarity in symptoms, or incomplete knowledge, such as genes with mutations. Here we present the first approach integrating high-throughput datasets such as mRNA expression and large-scale protein-protein interaction networks to discover human disease relationships in a systematic and quantitative way. We discover 138 significant pathological similarities between 54 human diseases ranging from lung cancer, schizophrenia, and malaria. We also discovered a set of common pathways and processes within the cell that are dysregulated in at least half of the diseases. We infer that these processes correspond to a common response of the human system to a disease state. Interestingly, we find that many of the proteins in these pathways are already known to be targets of existing drugs. In fact, the drugs corresponding to these proteins are known to treat significantly more diseases than expected by chance highlighting the importance of these common molecular pathological pathways as prime therapeutic opportunities.
doi:10.1371/journal.pcbi.1000662
PMCID: PMC2816673
PMID: 20140234
We consider the problem of estimating sparse graphs by a lasso penalty applied to the
inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop
a simple algorithm—the graphical lasso—that is remarkably
fast: It solves a 1000-node problem (∼500000 parameters) in at most a minute and is
30–4000 times faster than competing methods. It also provides a conceptual link
between the exact problem and the approximation suggested by Meinshausen and Bühlmann
(2006). We illustrate the method on some cell-signaling data from proteomics.
doi:10.1093/biostatistics/kxm045
PMCID: PMC3019769
PMID: 18079126
Gaussian covariance; Graphical model; L1; Lasso
Tutt, Andrew | Wang, Alice | Rowland, Charles | Gillett, Cheryl | Lau, Kit | Chew, Karen | Dai, Hongyue | Kwok, Shirley | Ryder, Kenneth | Shu, Henry | Springall, Robert | Cane, Paul | McCallie, Blair | Kam-Morgan, Lauren | Anderson, Steve | Buerger, Horst | Gray, Joe | Bennington, James | Esserman, Laura | Hastie, Trevor | Broder, Samuel | Sninsky, John | Brandt, Burkhard | Waldman, Fred
Background
Given the large number of genes purported to be prognostic for breast cancer, it would be optimal if the genes identified are not confounded by the continuously changing systemic therapies. The aim of this study was to discover and validate a breast cancer prognostic expression signature for distant metastasis in untreated, early stage, lymph node-negative (N-) estrogen receptor-positive (ER+) patients with extensive follow-up times.
Methods
197 genes previously associated with metastasis and ER status were profiled from 142 untreated breast cancer subjects. A "metastasis score" (MS) representing fourteen differentially expressed genes was developed and evaluated for its association with distant-metastasis-free survival (DMFS). Categorical risk classification was established from the continuous MS and further evaluated on an independent set of 279 untreated subjects. A third set of 45 subjects was tested to determine the prognostic performance of the MS in tamoxifen-treated women.
Results
A 14-gene signature was found to be significantly associated (p < 0.05) with distant metastasis in a training set and subsequently in an independent validation set. In the validation set, the hazard ratios (HR) of the high risk compared to low risk groups were 4.02 (95% CI 1.91–8.44) for the endpoint of DMFS and 1.97 (95% CI 1.28 to 3.04) for overall survival after adjustment for age, tumor size and grade. The low and high MS risk groups had 10-year estimates (95% CI) of 96% (90–99%) and 72% (64–78%) respectively, for DMFS and 91% (84–95%) and 68% (61–75%), respectively for overall survival. Performance characteristics of the signature in the two sets were similar. Ki-67 labeling index (LI) was predictive for recurrent disease in the training set, but lost significance after adjustment for the expression signature. In a study of tamoxifen-treated patients, the HR for DMFS in high compared to low risk groups was 3.61 (95% CI 0.86–15.14).
Conclusion
The 14-gene signature is significantly associated with risk of distant metastasis. The signature has a predominance of proliferation genes which have prognostic significance above that of Ki-67 LI and may aid in prioritizing future mechanistic studies and therapeutic interventions.
doi:10.1186/1471-2407-8-339
PMCID: PMC2631011
PMID: 19025599
In an effort to deconvolute global gene-expression profiles, an interaction between some breast cancer cells and stromal fibroblasts was found to induce an interferon response, which may be associated with a greater propensity for tumor progression.
Background
Perturbations in cell-cell interactions are a key feature of cancer. However, little is known about the systematic effects of cell-cell interaction on global gene expression in cancer.
Results
We used an ex vivo model to simulate tumor-stroma interaction by systematically co-cultivating breast cancer cells with stromal fibroblasts and determined associated gene expression changes with cDNA microarrays. In the complex picture of epithelial-mesenchymal interaction effects, a prominent characteristic was an induction of interferon-response genes (IRGs) in a subset of cancer cells. In close proximity to these cancer cells, the fibroblasts secreted type I interferons, which, in turn, induced expression of the IRGs in the tumor cells. Paralleling this model, immunohistochemical analysis of human breast cancer tissues showed that STAT1, the key transcriptional activator of the IRGs, and itself an IRG, was expressed in a subset of the cancers, with a striking pattern of elevated expression in the cancer cells in close proximity to the stroma. In vivo, expression of the IRGs was remarkably coherent, providing a basis for segregation of 295 early-stage breast cancers into two groups. Tumors with high compared to low expression levels of IRGs were associated with significantly shorter overall survival; 59% versus 80% at 10 years (log-rank p = 0.001).
Conclusion
In an effort to deconvolute global gene expression profiles of breast cancer by systematic characterization of heterotypic interaction effects in vitro, we found that an interaction between some breast cancer cells and stromal fibroblasts can induce an interferon-response, and that this response may be associated with a greater propensity for tumor progression.
doi:10.1186/gb-2007-8-9-r191
PMCID: PMC2375029
PMID: 17868458
Smooth muscle is present in a wide variety of anatomical locations, such as blood vessels, various visceral organs, and hair follicles. Contraction of smooth muscle is central to functions as diverse as peristalsis, urination, respiration, and the maintenance of vascular tone. Despite the varied physiological roles of smooth muscle cells (SMCs), we possess only a limited knowledge of the heterogeneity underlying their functional and anatomic specializations. As a step toward understanding the intrinsic differences between SMCs from different anatomical locations, we used DNA microarrays to profile global gene expression patterns in 36 SMC samples from various tissues after propagation under defined conditions in cell culture. Significant variations were found between the cells isolated from blood vessels, bronchi, and visceral organs. Furthermore, pervasive differences were noted within the visceral organ subgroups that appear to reflect the distinct molecular pathways essential for organogenesis as well as those involved in organ-specific contractile and physiological properties. Finally, we sought to understand how this diversity may contribute to SMC-involving pathology. We found that a gene expression signature of the responses of vascular SMCs to serum exposure is associated with a significantly poorer prognosis in human cancers, potentially linking vascular injury response to tumor progression.
Author Summary
It has been estimated that the human body contains approximately 200–400 distinct cell types. These estimates are largely based on the morphological characteristics of cells and have yielded, among many others, the category of smooth muscle cells, which have a distinct appearance and are present in a wide variety of tissues. By using DNA microarrays to interrogate the gene expression of anatomically varying smooth muscle cells, we were able to accurately tease apart many of the distinct cell subtypes that are classically categorized as smooth muscle cells. Remarkably, genes expressed by these newly identified, distinct subtypes corroborate many of their known biological properties and give clues about their susceptibility to specific disease states, retained developmental programs, and potential drugable targets. Additionally, from a smooth muscle cell model of vascular injury, we were able to extract a gene expression signature that provides prognostic information for human breast cancers. Of particular interest for modeling tumor progression was the finding that this gene expression signature was associated with tumor hypoxia. This study adds much to our ever-growing depth of understanding of cellular diversity and the contributions of this diversity to normal physiology and disease.
doi:10.1371/journal.pgen.0030164
PMCID: PMC1994710
PMID: 17907811
Chi, Jen-Tsan | Wang, Zhen | Nuyten, Dimitry S. A | Rodriguez, Edwin H | Schaner, Marci E | Salim, Ali | Wang, Yun | Kristensen, Gunnar B | Helland, Åslaug | Børresen-Dale, Anne-Lise | Giaccia, Amato | Longaker, Michael T | Hastie, Trevor | Yang, George P | van de Vijver, Marc J | Brown, Patrick O
Background
Inadequate oxygen (hypoxia) triggers a multifaceted cellular response that has important roles in normal physiology and in many human diseases. A transcription factor, hypoxia-inducible factor (HIF), plays a central role in the hypoxia response; its activity is regulated by the oxygen-dependent degradation of the HIF-1α protein. Despite the ubiquity and importance of hypoxia responses, little is known about the variation in the global transcriptional response to hypoxia among different cell types or how this variation might relate to tissue- and cell-specific diseases.
Methods and Findings
We analyzed the temporal changes in global transcript levels in response to hypoxia in primary renal proximal tubule epithelial cells, breast epithelial cells, smooth muscle cells, and endothelial cells with DNA microarrays. The extent of the transcriptional response to hypoxia was greatest in the renal tubule cells. This heightened response was associated with a uniquely high level of HIF-1α RNA in renal cells, and it could be diminished by reducing HIF-1α expression via RNA interference. A gene-expression signature of the hypoxia response, derived from our studies of cultured mammary and renal tubular epithelial cells, showed coordinated variation in several human cancers, and was a strong predictor of clinical outcomes in breast and ovarian cancers. In an analysis of a large, published gene-expression dataset from breast cancers, we found that the prognostic information in the hypoxia signature was virtually independent of that provided by the previously reported wound signature and more predictive of outcomes than any of the clinical parameters in current use.
Conclusions
The transcriptional response to hypoxia varies among human cells. Some of this variation is traceable to variation in expression of the HIF1A gene. A gene-expression signature of the cellular response to hypoxia is associated with a significantly poorer prognosis in breast and ovarian cancer.
The transcriptional response to hypoxia varies between cell types. A gene-expression signature of the cellular response to hypoxia is associated with a significantly poorer prognosis in breast and ovarian cancer.
doi:10.1371/journal.pmed.0030047
PMCID: PMC1334226
PMID: 16417408