Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.
Cancer; DNA copy number; False discovery rate; Mutation
We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm Soft-Impute iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices: for example it can obtain a rank-80 approximation of a 106 × 106 incomplete matrix with 105 observed entries in 2.5 hours, and can fit a rank 40 approximation to the full Netflix training set in 6.6 hours. Our methods show very good performance both in training and test error when compared to other competitive state-of-the art techniques.
We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop a simple algorithm—the graphical lasso—that is remarkably fast: It solves a 1000-node problem (~500 000 parameters) in at most a minute and is 30–4000 times faster than competing methods. It also provides a conceptual link between the exact problem and the approximation suggested by Meinshausen and Bühlmann (2006). We illustrate the method on some cell-signaling data from proteomics.
Gaussian covariance; Graphical model; L1; Lasso
We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include ℓ1 (the lasso), ℓ2 (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.
Leiomyosarcoma (LMS) is a soft tissue tumor with a significant degree of morphologic and molecular heterogeneity. We employed integrative molecular profiling to discover and characterize molecular subtypes of LMS. Gene expression profiling was performed on 51 LMS samples. Unsupervised clustering demonstrated 3 reproducible LMS clusters. Array comparative genomic hybridization (aCGH) was performed on 20 LMS samples and demonstrated that the molecular subtypes defined by gene-expression showed distinct genomic changes. Tumors from the “muscle-enriched” cluster showed significantly increased copy number changes (p=0.04). Most muscle-enriched cases showed loss at 16q24 which contains FANCA, known to play an important role in DNA repair, and loss at 1p36 which contains PRDM16, whose loss promotes muscle differentiation. Immunohistochemistry was performed on LMS tissue microarrays (n=377) for five markers with high levels of mRNA in the muscle-enriched cluster (ACTG2, CASQ2, SLMAP,CFL2, MYLK) and demonstrated significantly correlated expression of the 5 proteins (all pairwise p < 0.005). Expression of the 5 markers was associated with improved disease-specific survival (DSS) in a multivariate Cox regression analysis (p < 0.04). In this analysis that combined gene expression profiling, aCGH and immunohistochemistry, we characterized distinct molecular LMS subtypes, provided insight into their pathogenesis, and identified prognostic biomarkers.
sarcoma; leiomyosarcoma; integrative genomics; gene expression profiling; array comparative genomic hybridization; tissue microarrays
We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as , where dk, uk, and vk minimize the squared Frobenius norm of X, subject to penalties on uk and vk. This results in a regularized version of the singular value decomposition. Of particular interest is the use of L1-penalties on uk and vk, which yields a decomposition of X using sparse vectors. We show that when the PMD is applied using an L1-penalty on vk but not on uk, a method for sparse principal components results. In fact, this yields an efficient algorithm for the “SCoTLASS” proposal (Jolliffe and others 2003) for obtaining sparse principal components. This method is demonstrated on a publicly available gene expression data set. We also establish connections between the SCoTLASS method for sparse principal component analysis and the method of Zou and others (2006). In addition, we show that when the PMD is applied to a cross-products matrix, it results in a method for penalized canonical correlation analysis (CCA). We apply this penalized CCA method to simulated data and to a genomic data set consisting of gene expression and DNA copy number measurements on the same set of samples.
Canonical correlation analysis; DNA copy number; Integrative genomic analysis; L1; Matrix decomposition; Principal component analysis; Sparse principal component analysis; SVD
Motivation: In ordinary regression, imposition of a lasso penalty makes continuous model selection straightforward. Lasso penalized regression is particularly advantageous when the number of predictors far exceeds the number of observations.
Method: The present article evaluates the performance of lasso penalized logistic regression in case–control disease gene mapping with a large number of SNPs (single nucleotide polymorphisms) predictors. The strength of the lasso penalty can be tuned to select a predetermined number of the most relevant SNPs and other predictors. For a given value of the tuning constant, the penalized likelihood is quickly maximized by cyclic coordinate ascent. Once the most potent marginal predictors are identified, their two-way and higher order interactions can also be examined by lasso penalized logistic regression.
Results: This strategy is tested on both simulated and real data. Our findings on coeliac disease replicate the previous SNP results and shed light on possible interactions among the SNPs.
Availability: The software discussed is available in Mendel 9.0 at the UCLA Human Genetics web site.
Supplementary information: Supplementary data are available at Bioinformatics online.
Current work in elucidating relationships between diseases has largely been based on pre-existing knowledge of disease genes. Consequently, these studies are limited in their discovery of new and unknown disease relationships. We present the first quantitative framework to compare and contrast diseases by an integrated analysis of disease-related mRNA expression data and the human protein interaction network. We identified 4,620 functional modules in the human protein network and provided a quantitative metric to record their responses in 54 diseases leading to 138 significant similarities between diseases. Fourteen of the significant disease correlations also shared common drugs, supporting the hypothesis that similar diseases can be treated by the same drugs, allowing us to make predictions for new uses of existing drugs. Finally, we also identified 59 modules that were dysregulated in at least half of the diseases, representing a common disease-state “signature”. These modules were significantly enriched for genes that are known to be drug targets. Interestingly, drugs known to target these genes/proteins are already known to treat significantly more diseases than drugs targeting other genes/proteins, highlighting the importance of these core modules as prime therapeutic opportunities.
Many human diseases are related to each other through shared causes or even shared pathology. Knowledge of these relationships has long been exploited to treat similar diseases with the same therapies. However, most of the traditional approaches to discover these relationships have depended on subjective measures, such as similarity in symptoms, or incomplete knowledge, such as genes with mutations. Here we present the first approach integrating high-throughput datasets such as mRNA expression and large-scale protein-protein interaction networks to discover human disease relationships in a systematic and quantitative way. We discover 138 significant pathological similarities between 54 human diseases ranging from lung cancer, schizophrenia, and malaria. We also discovered a set of common pathways and processes within the cell that are dysregulated in at least half of the diseases. We infer that these processes correspond to a common response of the human system to a disease state. Interestingly, we find that many of the proteins in these pathways are already known to be targets of existing drugs. In fact, the drugs corresponding to these proteins are known to treat significantly more diseases than expected by chance highlighting the importance of these common molecular pathological pathways as prime therapeutic opportunities.
We consider the problem of estimating sparse graphs by a lasso penalty applied to the
inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop
a simple algorithm—the graphical lasso—that is remarkably
fast: It solves a 1000-node problem (∼500000 parameters) in at most a minute and is
30–4000 times faster than competing methods. It also provides a conceptual link
between the exact problem and the approximation suggested by Meinshausen and Bühlmann
(2006). We illustrate the method on some cell-signaling data from proteomics.
Gaussian covariance; Graphical model; L1; Lasso
Given the large number of genes purported to be prognostic for breast cancer, it would be optimal if the genes identified are not confounded by the continuously changing systemic therapies. The aim of this study was to discover and validate a breast cancer prognostic expression signature for distant metastasis in untreated, early stage, lymph node-negative (N-) estrogen receptor-positive (ER+) patients with extensive follow-up times.
197 genes previously associated with metastasis and ER status were profiled from 142 untreated breast cancer subjects. A "metastasis score" (MS) representing fourteen differentially expressed genes was developed and evaluated for its association with distant-metastasis-free survival (DMFS). Categorical risk classification was established from the continuous MS and further evaluated on an independent set of 279 untreated subjects. A third set of 45 subjects was tested to determine the prognostic performance of the MS in tamoxifen-treated women.
A 14-gene signature was found to be significantly associated (p < 0.05) with distant metastasis in a training set and subsequently in an independent validation set. In the validation set, the hazard ratios (HR) of the high risk compared to low risk groups were 4.02 (95% CI 1.91–8.44) for the endpoint of DMFS and 1.97 (95% CI 1.28 to 3.04) for overall survival after adjustment for age, tumor size and grade. The low and high MS risk groups had 10-year estimates (95% CI) of 96% (90–99%) and 72% (64–78%) respectively, for DMFS and 91% (84–95%) and 68% (61–75%), respectively for overall survival. Performance characteristics of the signature in the two sets were similar. Ki-67 labeling index (LI) was predictive for recurrent disease in the training set, but lost significance after adjustment for the expression signature. In a study of tamoxifen-treated patients, the HR for DMFS in high compared to low risk groups was 3.61 (95% CI 0.86–15.14).
The 14-gene signature is significantly associated with risk of distant metastasis. The signature has a predominance of proliferation genes which have prognostic significance above that of Ki-67 LI and may aid in prioritizing future mechanistic studies and therapeutic interventions.
In an effort to deconvolute global gene-expression profiles, an interaction between some breast cancer cells and stromal fibroblasts was found to induce an interferon response, which may be associated with a greater propensity for tumor progression.
Perturbations in cell-cell interactions are a key feature of cancer. However, little is known about the systematic effects of cell-cell interaction on global gene expression in cancer.
We used an ex vivo model to simulate tumor-stroma interaction by systematically co-cultivating breast cancer cells with stromal fibroblasts and determined associated gene expression changes with cDNA microarrays. In the complex picture of epithelial-mesenchymal interaction effects, a prominent characteristic was an induction of interferon-response genes (IRGs) in a subset of cancer cells. In close proximity to these cancer cells, the fibroblasts secreted type I interferons, which, in turn, induced expression of the IRGs in the tumor cells. Paralleling this model, immunohistochemical analysis of human breast cancer tissues showed that STAT1, the key transcriptional activator of the IRGs, and itself an IRG, was expressed in a subset of the cancers, with a striking pattern of elevated expression in the cancer cells in close proximity to the stroma. In vivo, expression of the IRGs was remarkably coherent, providing a basis for segregation of 295 early-stage breast cancers into two groups. Tumors with high compared to low expression levels of IRGs were associated with significantly shorter overall survival; 59% versus 80% at 10 years (log-rank p = 0.001).
In an effort to deconvolute global gene expression profiles of breast cancer by systematic characterization of heterotypic interaction effects in vitro, we found that an interaction between some breast cancer cells and stromal fibroblasts can induce an interferon-response, and that this response may be associated with a greater propensity for tumor progression.
Smooth muscle is present in a wide variety of anatomical locations, such as blood vessels, various visceral organs, and hair follicles. Contraction of smooth muscle is central to functions as diverse as peristalsis, urination, respiration, and the maintenance of vascular tone. Despite the varied physiological roles of smooth muscle cells (SMCs), we possess only a limited knowledge of the heterogeneity underlying their functional and anatomic specializations. As a step toward understanding the intrinsic differences between SMCs from different anatomical locations, we used DNA microarrays to profile global gene expression patterns in 36 SMC samples from various tissues after propagation under defined conditions in cell culture. Significant variations were found between the cells isolated from blood vessels, bronchi, and visceral organs. Furthermore, pervasive differences were noted within the visceral organ subgroups that appear to reflect the distinct molecular pathways essential for organogenesis as well as those involved in organ-specific contractile and physiological properties. Finally, we sought to understand how this diversity may contribute to SMC-involving pathology. We found that a gene expression signature of the responses of vascular SMCs to serum exposure is associated with a significantly poorer prognosis in human cancers, potentially linking vascular injury response to tumor progression.
It has been estimated that the human body contains approximately 200–400 distinct cell types. These estimates are largely based on the morphological characteristics of cells and have yielded, among many others, the category of smooth muscle cells, which have a distinct appearance and are present in a wide variety of tissues. By using DNA microarrays to interrogate the gene expression of anatomically varying smooth muscle cells, we were able to accurately tease apart many of the distinct cell subtypes that are classically categorized as smooth muscle cells. Remarkably, genes expressed by these newly identified, distinct subtypes corroborate many of their known biological properties and give clues about their susceptibility to specific disease states, retained developmental programs, and potential drugable targets. Additionally, from a smooth muscle cell model of vascular injury, we were able to extract a gene expression signature that provides prognostic information for human breast cancers. Of particular interest for modeling tumor progression was the finding that this gene expression signature was associated with tumor hypoxia. This study adds much to our ever-growing depth of understanding of cellular diversity and the contributions of this diversity to normal physiology and disease.
Inadequate oxygen (hypoxia) triggers a multifaceted cellular response that has important roles in normal physiology and in many human diseases. A transcription factor, hypoxia-inducible factor (HIF), plays a central role in the hypoxia response; its activity is regulated by the oxygen-dependent degradation of the HIF-1α protein. Despite the ubiquity and importance of hypoxia responses, little is known about the variation in the global transcriptional response to hypoxia among different cell types or how this variation might relate to tissue- and cell-specific diseases.
Methods and Findings
We analyzed the temporal changes in global transcript levels in response to hypoxia in primary renal proximal tubule epithelial cells, breast epithelial cells, smooth muscle cells, and endothelial cells with DNA microarrays. The extent of the transcriptional response to hypoxia was greatest in the renal tubule cells. This heightened response was associated with a uniquely high level of HIF-1α RNA in renal cells, and it could be diminished by reducing HIF-1α expression via RNA interference. A gene-expression signature of the hypoxia response, derived from our studies of cultured mammary and renal tubular epithelial cells, showed coordinated variation in several human cancers, and was a strong predictor of clinical outcomes in breast and ovarian cancers. In an analysis of a large, published gene-expression dataset from breast cancers, we found that the prognostic information in the hypoxia signature was virtually independent of that provided by the previously reported wound signature and more predictive of outcomes than any of the clinical parameters in current use.
The transcriptional response to hypoxia varies among human cells. Some of this variation is traceable to variation in expression of the HIF1A gene. A gene-expression signature of the cellular response to hypoxia is associated with a significantly poorer prognosis in breast and ovarian cancer.
The transcriptional response to hypoxia varies between cell types. A gene-expression signature of the cellular response to hypoxia is associated with a significantly poorer prognosis in breast and ovarian cancer.
We used DNA microarrays to characterize the global gene expression patterns in surface epithelial cancers of the ovary. We identified groups of genes that distinguished the clear cell subtype from other ovarian carcinomas, grade I and II from grade III serous papillary carcinomas, and ovarian from breast carcinomas. Six clear cell carcinomas were distinguished from 36 other ovarian carcinomas (predominantly serous papillary) based on their gene expression patterns. The differences may yield insights into the worse prognosis and therapeutic resistance associated with clear cell carcinomas. A comparison of the gene expression patterns in the ovarian cancers to published data of gene expression in breast cancers revealed a large number of differentially expressed genes. We identified a group of 62 genes that correctly classified all 125 breast and ovarian cancer specimens. Among the best discriminators more highly expressed in the ovarian carcinomas were PAX8 (paired box gene 8), mesothelin, and ephrin-B1 (EFNB1). Although estrogen receptor was expressed in both the ovarian and breast cancers, genes that are coregulated with the estrogen receptor in breast cancers, including GATA-3, LIV-1, and X-box binding protein 1, did not show a similar pattern of coexpression in the ovarian cancers.
T7 based linear amplification of RNA is used to obtain sufficient antisense RNA for microarray expression profiling. We optimized and systematically evaluated the fidelity and reproducibility of different amplification protocols using total RNA obtained from primary human breast carcinomas and high-density cDNA microarrays.
Using an optimized protocol, the average correlation coefficient of gene expression of 11,123 cDNA clones between amplified and unamplified samples is 0.82 (0.85 when a virtual array was created using repeatedly amplified samples to minimize experimental variation). Less than 4% of genes show changes in expression level by 2-fold or greater after amplification compared to unamplified samples. Most changes due to amplification are not systematic both within one tumor sample and between different tumors. Amplification appears to dampen the variation of gene expression for some genes when compared to unamplified poly(A)+ RNA. The reproducibility between repeatedly amplified samples is 0.97 when performed on the same day, but drops to 0.90 when performed weeks apart. The fidelity and reproducibility of amplification is not affected by decreasing the amount of input total RNA in the 0.3–3 micrograms range. Adding template-switching primer, DNA ligase, or column purification of double-stranded cDNA does not improve the fidelity of amplification. The correlation coefficient between amplified and unamplified samples is higher when total RNA is used as template for both experimental and reference RNA amplification.
T7 based linear amplification reproducibly generates amplified RNA that closely approximates original sample for gene expression profiling using cDNA microarrays.
We propose a new method for supervised learning from gene expression data. We call it 'tree harvesting'. This technique starts with a hierarchical clustering of genes, then models the outcome variable as a sum of the average expression profiles of chosen clusters and their products. It can be applied to many different kinds of outcome measures such as censored survival times, or a response falling in two or more classes (for example, cancer classes). The method can discover genes that have strong effects on their own, and genes that interact with other genes.
We illustrate the method on data from a lymphoma study, and on a dataset containing samples from eight different cancers. It identified some potentially interesting gene clusters. In simulation studies we found that the procedure may require a large number of experimental samples to successfully discover interactions.
Tree harvesting is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worthy of further investigation.
Large gene expression studies, such as those conducted using DNA arrays, often provide millions of different pieces of data. To address the problem of analyzing such data, we describe a statistical method, which we have called 'gene shaving'. The method identifies subsets of genes with coherent expression patterns and large variation across conditions. Gene shaving differs from hierarchical clustering and other widely used methods for analyzing gene expression studies in that genes may belong to more than one cluster, and the clustering may be supervised by an outcome measure. The technique can be 'unsupervised', that is, the genes and samples are treated as unlabeled, or partially or fully supervised by using known properties of the genes or samples to assist in finding meaningful groupings.
We illustrate the use of the gene shaving method to analyze gene expression measurements made on samples from patients with diffuse large B-cell lymphoma. The method identifies a small cluster of genes whose expression is highly predictive of survival.
The gene shaving method is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worth further investigation.