Chemokine receptor 1 (CCR1) is a G protein–coupled receptor that binds to members of the C-C chemokine family. Recently, CCL3 (MIP-1α), a high-affinity CCR1 ligand, was identified as part of a model that independently predicts survival in patients with diffuse large B-cell lymphoma (DLBCL). However, the role of chemokine signaling in the pathogenesis of human lymphomas is unclear. In normal human hematopoietic tissues, we found CCR1 expression in intraepithelial B cells of human tonsil and granulocytic/monocytic cells in the bone marrow. Immunohistochemical analysis of 944 cases of hematolymphoid neoplasia identified CCR1 expression in a subset of B- and T-cell lymphomas, plasma cell myeloma, acute myeloid leukemia, and classical Hodgkin lymphoma. CCR1 expression correlated with the non–germinal center subtype of DLBCL but did not predict overall survival in follicular lymphoma. These data suggest that CCR1 may be useful for lymphoma classification and support a role for chemokine signaling in the pathogenesis of hematolymphoid neoplasia.
Chemokine; CCR1; Lymphoma; Leukemia
In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the covariance test statistic, and show that when the true model is linear, this statistic has an Exp(1) asymptotic distribution under the null hypothesis (the null being that all truly active variables are contained in the current lasso model). Our proof of this result for the special case of the first predictor to enter the model (i.e., testing for a single significant predictor variable against the global null) requires only weak assumptions on the predictor matrix X. On the other hand, our proof for a general step in the lasso path places further technical assumptions on X and the generative model, but still allows for the important high-dimensional case p > n, and does not necessarily require that the current lasso model achieves perfect recovery of the truly active variables.
Of course, for testing the significance of an additional variable between two nested linear models, one typically uses the chi-squared test, comparing the drop in residual sum of squares (RSS) to a
χ12 distribution. But when this additional variable is not fixed, and has been chosen adaptively or greedily, this test is no longer appropriate: adaptivity makes the drop in RSS stochastically much larger than
χ12 under the null hypothesis. Our analysis explicitly accounts for adaptivity, as it must, since the lasso builds an adaptive sequence of linear models as the tuning parameter λ decreases. In this analysis, shrinkage plays a key role: though additional variables are chosen adaptively, the coefficients of lasso active variables are shrunken due to the
l1 penalty. Therefore, the test statistic (which is based on lasso fitted values) is in a sense balanced by these two opposing properties—adaptivity and shrinkage—and its null distribution is tractable and asymptotically Exp(1).
Lasso; least angle regression; p-value; significance test
We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui and his colleagues have propose ‘SAFE’ rules, based on univariate inner products between each predictor and the outcome, which guarantee that a coefficient will be 0 in the solution vector. This provides a reduction in the number of variables that need to be entered into the optimization. We propose strong rules that are very simple and yet screen out far more predictors than the SAFE rules. This great practical improvement comes at a price: the strong rules are not foolproof and can mistakenly discard active predictors, i.e. predictors that have non-zero coefficients in the solution. We therefore combine them with simple checks of the Karush–Kuhn–Tucker conditions to ensure that the exact solution to the convex problem is delivered. Of course, any (approximate) screening method can be combined with the Karush–Kuhn–Tucker, conditions to ensure the exact solution; the strength of the strong rules lies in the fact that, in practice, they discard a very large number of the inactive predictors and almost never commit mistakes. We also derive conditions under which they are foolproof. Strong rules provide substantial savings in computational time for a variety of statistical optimization problems.
The earliest recognizable stages of breast neoplasia are lesions that represent a heterogeneous collection of epithelial proliferations currently classified based on morphology. Their role in the development of breast cancer is not well understood but insight into the critical events at this early stage will improve efforts in breast cancer detection and prevention. These microscopic lesions are technically difficult to study so very little is known about their molecular alterations.
To characterize the transcriptional changes of early breast neoplasia, we sequenced 3′- end enriched RNAseq libraries from formalin-fixed paraffin-embedded tissue of early neoplasia samples and matched normal breast and carcinoma samples from 25 patients. We find that gene expression patterns within early neoplasias are distinct from both normal and breast cancer patterns and identify a pattern of pro-oncogenic changes, including elevated transcription of ERBB2, FOXA1, and GATA3 at this early stage. We validate these findings on a second independent gene expression profile data set generated by whole transcriptome sequencing. Measurements of protein expression by immunohistochemistry on an independent set of early neoplasias confirms that ER pathway regulators FOXA1 and GATA3, as well as ER itself, are consistently upregulated at this early stage. The early neoplasia samples also demonstrate coordinated changes in long non-coding RNA expression and microenvironment stromal gene expression patterns.
This study is the first examination of global gene expression in early breast neoplasia, and the genes identified here represent candidate participants in the earliest molecular events in the development of breast cancer.
Classifying patients into different risk groups based on their genomic measurements can help clinicians design appropriate clinical treatment plans. To produce such a classification, gene expression data were collected on a cohort of burn patients, who were monitored across multiple time points. This led us to develop a new classification method using time-course gene expressions. Our results showed that making good use of time-course information of gene expression improved the performance of classification compared with using gene expression from individual time points only. Our method is implemented into an R-package: time-course prediction analysis using microarray.
Classification; Gene expression; Longitudinal; Time-course
It has been claimed that most research findings are false, and it is known that large-scale studies involving omics data are especially prone to errors in design, execution, and analysis. The situation is alarming because taxpayer dollars fund a substantial amount of biomedical research, and because the publication of a research article that is later determined to be flawed can erode the credibility of an entire field, resulting in a severe and negative impact for years to come. Here, we urge the development of an online, open-access, postpublication, peer review system that will increase the accountability of scientists for the quality of their research and the ability of readers to distinguish good from sloppy science.
peer review; omics; high-dimensional; transparency; reproducible research
RNA structural transitions are important in the function and regulation of RNAs. Here, we reveal a layer of transcriptome organization in the form of RNA folding energies. By probing yeast RNA structures at different temperatures, we obtained relative melting temperatures (Tm) for RNA structures in over 4000 transcripts. Specific signatures of RNA Tm demarcated the polarity of mRNA open reading frames, and highlighted numerous candidate regulatory RNA motifs in 3′ untranslated regions. RNA Tm distinguished non-coding versus coding RNAs, identified mRNAs with distinct cellular functions. We identified thousands of putative RNA thermometers, and their presence is predictive of the pattern of RNA decay in vivo during heat shock. The exosome complex recognizes unpaired bases during heat shock to degrade these RNAs, coupling intrinsic structural stabilities to gene regulation. Thus, genome-wide structural dynamics of RNA can parse functional elements of the transcriptome and reveal diverse biological insights.
We describe cell type–specific significance analysis of microarrays (cssam) for analyzing differential gene expression for each cell type in a biological sample from microarray data and relative cell-type frequencies. first, we validated cssam with predesigned mixtures and then applied it to whole-blood gene expression datasets from stable post-transplant kidney transplant recipients and those experiencing acute transplant rejection, which revealed hundreds of differentially expressed genes that were otherwise undetectable.
We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.
Differential expression; FDR; Overdispersion; Poisson log-linear model; RNA-Seq; Score statistic
We suggest a method for estimating a covariance matrix on the basis of a sample of vectors drawn from a multivariate normal distribution. In particular, we penalize the likelihood with a lasso penalty on the entries of the covariance matrix. This penalty plays two important roles: it reduces the effective number of parameters, which is important even when the dimension of the vectors is smaller than the sample size since the number of parameters grows quadratically in the number of variables, and it produces an estimate which is sparse. In contrast to sparse inverse covariance estimation, our method’s close relative, the sparsity attained here is in the covariance matrix itself rather than in the inverse matrix. Zeros in the covariance matrix correspond to marginal independencies; thus, our method performs model selection while providing a positive definite estimate of the covariance. The proposed penalized maximum likelihood problem is not convex, so we use a majorize-minimize approach in which we iteratively solve convex approximations to the original nonconvex problem. We discuss tuning parameter selection and demonstrate on a flow-cytometry dataset how our method produces an interpretable graphical display of the relationship between variables. We perform simulations that suggest that simple elementwise thresholding of the empirical covariance matrix is competitive with our method for identifying the sparsity structure. Additionally, we show how our method can be used to solve a previously studied special case in which a desired sparsity pattern is prespecified.
Concave-convex procedure; Covariance graph; Covariance matrix; Generalized gradient descent; Lasso; Majorization-minimization; Regularization; Sparsity
We consider the supervised classification setting, in which the data consist of p features measured on n observations, each of which belongs to one of K classes. Linear discriminant analysis (LDA) is a classical method for this problem. However, in the high-dimensional setting where p ≫ n, LDA is not appropriate for two reasons. First, the standard estimate for the within-class covariance matrix is singular, and so the usual discriminant rule cannot be applied. Second, when p is large, it is difficult to interpret the classification rule obtained from LDA, since it involves all p features. We propose penalized LDA, a general approach for penalizing the discriminant vectors in Fisher’s discriminant problem in a way that leads to greater interpretability. The discriminant problem is not convex, so we use a minorization-maximization approach in order to efficiently optimize it when convex penalties are applied to the discriminant vectors. In particular, we consider the use of L1 and fused lasso penalties. Our proposal is equivalent to recasting Fisher’s discriminant problem as a biconvex problem. We evaluate the performances of the resulting methods on a simulation study, and on three gene expression data sets. We also survey past methods for extending LDA to the high-dimensional setting, and explore their relationships with our proposal.
classification; feature selection; high dimensional; lasso; linear discriminant analysis; supervised learning
Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.
Cancer; DNA copy number; False discovery rate; Mutation
Molecular characterization of tumors has been critical for identifying important genes in cancer biology and for improving tumor classification and diagnosis. Long non-coding RNAs, as a new, relatively unstudied class of transcripts, provide a rich opportunity to identify both functional drivers and cancer-type-specific biomarkers. However, despite the potential importance of long non-coding RNAs to the cancer field, no comprehensive survey of long non-coding RNA expression across various cancers has been reported.
We performed a sequencing-based transcriptional survey of both known long non-coding RNAs and novel intergenic transcripts across a panel of 64 archival tumor samples comprising 17 diagnostic subtypes of adenocarcinomas, squamous cell carcinomas and sarcomas. We identified hundreds of transcripts from among the known 1,065 long non-coding RNAs surveyed that showed variability in transcript levels between the tumor types and are therefore potential biomarker candidates. We discovered 1,071 novel intergenic transcribed regions and demonstrate that these show similar patterns of variability between tumor types. We found that many of these differentially expressed cancer transcripts are also expressed in normal tissues. One such novel transcript specifically expressed in breast tissue was further evaluated using RNA in situ hybridization on a panel of breast tumors. It was shown to correlate with low tumor grade and estrogen receptor expression, thereby representing a potentially important new breast cancer biomarker.
This study provides the first large survey of long non-coding RNA expression within a panel of solid cancers and also identifies a number of novel transcribed regions differentially expressed across distinct cancer types that represent candidate biomarkers for future research.
3SEQ; FFPE; human cancer; intergenic transcripts; lncRNAs; novel transcripts; solid tumors; transcriptional profiling
We propose a hierarchical Bayesian model for analyzing gene expression data to identify pathways differentiating between two biological states (e.g., cancer vs. non-cancer and mutant vs. normal). Finding significant pathways can improve our understanding of biological processes. When the biological process of interest is related to a specific disease, eliciting a better understanding of the underlying pathways can lead to designing a more effective treatment. We apply our method to data obtained by interrogating the mutational status of p53 in 50 cancer cell lines (33 mutated and 17 normal). We identify several significant pathways with strong biological connections. We show that our approach provides a natural framework for incorporating prior biological information, and it has the best overall performance in terms of correctly identifying significant pathways compared to several alternative methods.
Biological pathways; Hierarchical Bayesian models; Mixture priors
Rheumatoid arthritis (RA) is a prototypical autoimmune arthritis affecting nearly 1% of the world population and is a significant cause of worldwide disability. Though prior studies have demonstrated the appearance of RA-related autoantibodies years before the onset of clinical RA, the pattern of immunologic events preceding the development of RA remains unclear. To characterize the evolution of the autoantibody response in the preclinical phase of RA, we used a novel multiplex autoantigen array to evaluate development of the anti-citrullinated protein antibodies (ACPA) and to determine if epitope spread correlates with rise in serum cytokines and imminent onset of clinical RA. To do so, we utilized a cohort of 81 patients with clinical RA for whom stored serum was available from 1–12 years prior to disease onset. We evaluated the accumulation of ACPA subtypes over time and correlated this accumulation with elevations in serum cytokines. We then used logistic regression to identify a profile of biomarkers which predicts the imminent onset of clinical RA (defined as within 2 years of testing). We observed a time-dependent expansion of ACPA specificity with the number of ACPA subtypes. At the earliest timepoints, we found autoantibodies targeting several innate immune ligands including citrullinated histones, fibrinogen, and biglycan, thus providing insights into the earliest autoantigen targets and potential mechanisms underlying the onset and development of autoimmunity in RA. Additionally, expansion of the ACPA response strongly predicted elevations in many inflammatory cytokines including TNF-α, IL-6, IL-12p70, and IFN-γ. Thus, we observe that the preclinical phase of RA is characterized by an accumulation of multiple autoantibody specificities reflecting the process of epitope spread. Epitope expansion is closely correlated with the appearance of preclinical inflammation, and we identify a biomarker profile including autoantibodies and cytokines which predicts the imminent onset of clinical arthritis.
We use the term “index predictor” to denote a score that consists of K binary rules such as “age > 60” or “blood pressure > 120 mm Hg.” The index predictor is the sum of these binary scores, yielding a value from 0 to K. Such indices as often used in clinical studies to stratify population risk: They are usually derived from subject area considerations. In this paper, we propose a fast data-driven procedure for automatically constructing such indices for linear, logistic, and Cox regression models. We also extend the procedure to create indices for detecting treatment–marker interactions. The methods are illustrated on a study with protein biomarkers as well as a large microarray gene expression study.
Degree of freedom; Index predictor; International prognostic index
Characterizing dynamic gene expression pattern and predicting patient outcome is now significant and will be of more interest in the future with large scale clinical investigation of microarrays. However, there is currently no method that has been developed for prediction of patient outcome using longitudinal gene expression, where gene expression of patients is being monitored across time. Here, we propose a novel prediction approach for patient survival time that makes use of time course structure of gene expression. This method is applied to a burn study. The genes involved in the final predictors are enriched in the inflammatory response and immune system related pathways. Moreover, our method is consistently better than prediction methods using individual time point gene expression or simply pooling gene expression from each time point.
prediction; time course; gene expression; survival
Combining tumor antigens with an immunostimulant can induce the immune system to specifically eliminate cancer cells. Generally, this combination is accomplished in an ex vivo, customized manner. In a preclinical lymphoma model, intratumoral injection of a Toll-like receptor 9 (TLR9) agonist induced systemic antitumor immunity and cured large, disseminated tumors.
Patients and Methods
We treated 15 patients with low-grade B-cell lymphoma using low-dose radiotherapy to a single tumor site and—at that same site—injected the C-G enriched, synthetic oligodeoxynucleotide (also referred to as CpG) TLR9 agonist PF-3512676. Clinical responses were assessed at distant, untreated tumor sites. Immune responses were evaluated by measuring T-cell activation after in vitro restimulation with autologous tumor cells.
This in situ vaccination maneuver was well-tolerated with only grade 1 to 2 local or systemic reactions and no treatment-limiting adverse events. One patient had a complete clinical response, three others had partial responses, and two patients had stable but continually regressing disease for periods significantly longer than that achieved with prior therapies. Vaccination induced tumor-reactive memory CD8 T cells. Some patients' tumors were able to induce a suppressive, regulatory phenotype in autologous T cells in vitro; these patients tended to have a shorter time to disease progression. One clinically responding patient received a second course of vaccination after relapse resulting in a second, more rapid clinical response.
In situ tumor vaccination with a TLR9 agonist induces systemic antilymphoma clinical responses. This maneuver is clinically feasible and does not require the production of a customized vaccine product.
We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudo-likelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate.
Markov networks; logistic regression; L1 penalty; model selection; Binary variables
We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm Soft-Impute iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices: for example it can obtain a rank-80 approximation of a 106 × 106 incomplete matrix with 105 observed entries in 2.5 hours, and can fit a rank 40 approximation to the full Netflix training set in 6.6 hours. Our methods show very good performance both in training and test error when compared to other competitive state-of-the art techniques.
CD81 is a tetraspanin cell surface protein that regulates CD19 expression in B lymphocytes and enables hepatitis C virus infection of human cells. Immunohistologic analysis in normal hematopoietic tissue showed strong staining for CD81 in normal germinal center B cells, a cell type in which its increased expression has not been previously recognized. High-dimensional flow cytometry analysis of normal hematopoietic tissue confirmed that among B and T cell subsets, germinal center B cells showed the highest level of CD81 expression. In over 800 neoplastic tissue samples, its expression was also found in a majority of non-Hodgkin lymphomas. Staining for CD81 was rarely seen in multiple myeloma, Hodgkin lymphoma, or myeloid leukemia. In hierarchical cluster analysis of diffuse large B cell lymphoma, staining for CD81 was most similar to other germinal center B cell-associated markers, particularly LMO2. By flow cytometry, CD81 was expressed in diffuse large B cell lymphoma cells independent of the presence or absence of CD10, another germinal center B cell marker. The detection of CD81 in routine biopsy samples and its differential expression in lymphoma subtypes, particularly diffuse large B cell lymphoma, warrants further study to assess CD81 expression and its role in the risk stratification of diffuse large B cell lymphoma patients.
CD81; lymphoma; tissue microarray
We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop a simple algorithm—the graphical lasso—that is remarkably fast: It solves a 1000-node problem (~500 000 parameters) in at most a minute and is 30–4000 times faster than competing methods. It also provides a conceptual link between the exact problem and the approximation suggested by Meinshausen and Bühlmann (2006). We illustrate the method on some cell-signaling data from proteomics.
Gaussian covariance; Graphical model; L1; Lasso
Diffuse large B cell lymphoma (DLBCL) is clinically and biologically heterogeneous. In most cases of DLBCL, lymphoma cells coexpress vascular endothelial growth factor (VEGF) and its receptors VEGFR1 and VEGFR2, suggesting autocrine in addition to angiogenic effects. We enumerated microvessel density and scored lymphoma cell expression of VEGF, VEGFR1, VEGFR2 and phosphorylated VEGFR2 in 162 de novo DLBCL patients treated with R-CHOP (rituximab, cyclophosphamide, vincristine, doxorubicin and prednisone)-like regimens. VEGFR2 expression correlated with shorter overall survival (OS) independent of International Prognostic Index (IPI) (p=0.0028). Phosphorylated VEGFR2 (detected in 13% of cases) correlated with shorter progression-free survival (PFS, p=0.044) and trended toward shorter OS on univariate analysis. VEGFR1 was not predictive of survival on univariate analysis, but it did correlate with better OS on multivariate analysis with VEGF, VEGFR2, and IPI (p=0.036); in patients with weak VEGFR2, lack of VEGFR1 coexpression was significantly correlated with poor OS independent of IPI (p=0.01). These results are concordant with our prior finding of an association of VEGFR1 with longer OS in DLBCL treated with chemotherapy alone. We postulate that VEGFR1 may oppose autocrine VEGFR2 signaling in DLBCL by competing for VEGF binding. In contrast to our prior results with chemotherapy alone, microvessel density was not prognostic of PFS or OS with R-CHOP-like therapy.
Non-Hodgkin lymphoma; VEGF; angiogenesis; tumour biology; prognostic factors
We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse K-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters. These approaches are demonstrated on simulated data and on genomic data sets.