|Home | About | Journals | Submit | Contact Us | Français|
Quantitative analysis of whole slide images (WSIs) in a large cohort may provide predictive models of clinical outcome. However, the performance of the existing techniques is hindered as a result of large technical variations (e.g., fixation, staining) and biological heterogeneities (e.g., cell type, cell state) that are always present in a large cohort. Although unsupervised feature learning provides a promising way in learning pertinent features without human intervention, its capability can be greatly limited due to the lack of well-curated examples. In this paper, we explored the transferability of knowledge acquired from a well-curated Glioblastoma Multiforme (GBM) dataset through its application to the representation and characterization of tissue histology from the Cancer Genome Atlas (TCGA) Breast Invasive Carcinoma (BRCA) cohort. Our experimental results reveals two major phenotypic subtypes with statistically significantly different survival curves. Further differential expression analysis of these two subtypes indicates enrichment of genes regulated by NF-kB in response to TNF and genes up-regulated in response to IFNG.
Tumor histology provides a detailed insight into cellular morphology, organization, and heterogeneity. For example, tumor histological sections can be used to identify mitotic cells, cellular aneuploidy, and autoimmune responses. More importantly, if tumor morphology and architecture can be quantified on large histological datasets, then it will pave the way for constructing histological databases that are prognostic, the same way that genome analysis techniques have identified molecular subtypes and predictive markers. Genome-wide analysis techniques (e.g., microarray analysis and next generation sequencing (NGS)) have the advantages of standardized tools for data analysis and pathway enrichment, which enables hypothesis generation for the underlying mechanism. On the other hand, histological signatures are hard to compute because of the technical variations and biological heterogeneities in the stained histological sections; however, they offer insights into tissue composition as well as heterogeneity (e.g., mixed populations) and rare events.
Histological sections are often stained with hematoxylin and eosin stains (H&E).Traditional histological analysis is performed by a trained pathologist through the characterization of phenotypic content, such as various cell types, cellular organization, cell state and health, and cellular secretion. One of the main technical barriers for processing a large collection of histological data is that the color composition is subject to technical variations (e.g., fixation, staining) and biological heterogeneities (e.g., cell type, cell state) across histological tissue sections, especially when these tissue sections are processed and scanned at different laboratories. Here, a histological tissue section refers to an image of a thin slice of tissue applied to a microscopic slide and scanned from a light microscope. From an image analysis perspective, color variations can occur both within and across tissue sections. For example, within a tissue section, some nuclei may have low chromatin content (e.g., light blue signals), while others may have higher signals (e.g., dark blue); nuclear intensity in one tissue section may be very close to the background intensity (e.g., cytoplasmic, macromolecular components) in another tissue section.
In this paper, we aim to explore the transferability of knowledge acquired from a well-curated GBM dataset through its application to the phenotypic characterization of breast invasive carcinoma. We suggest that, unsupervised feature learning is capable of generating transferable knowledge in tissue histology that can potentially be shared across cohorts with different tumor types, which provides an effective solution when well-curated examples are not available.
Organization of this paper is as follows: Section 2 reviews related works. Section 3 describes the details of our proposed approach. Section 4 elaborates the details of our experimental setup, followed by detailed discussion on the experimental results. Lastly, section 5 concludes the paper.
Several outstanding reviews for the analysis of histology sections can be found in [1, 2]. From our perspective, four distinct works have defined the trends in histology image analysis: (i) one group of researchers proposed nuclear segmentation and organization for tumor grading and/or the prediction of tumor recurrence [3, 4, 5, 6]. (ii) A second group of researchers focused on patch level analysis (e.g., small regions) [7, 8, 9, 10], using color and texture features, for tumor representation. (iii) A third group focused on block-level analysis to distinguish different states of tissue development using cell-graph representation [11, 12]. (iv) Finally, a fourth group has suggested detection and representation of the auto-immune response as a prognostic tool for cancer .
The major challenge for computational histopathology is the large amounts of technical variations and biological heterogeneities in the data , which typically results in techniques that are tumor type specific. To overcome this problem, recent studies have focused on either fine tuning human engineered features [7, 8, 14, 15], or applying automatic feature learning [16, 17, 18], for robust representation.
The proposed approach for tissue phenotypic characterization and integrated analysis includes: transferable knowledge learning through predictive sparse decomposition (PSD), tissue phenotypic representation and subtyping via consensus clustering, survival analysis and genomic association.
Given many of the shared visual concepts among different tumor types (e.g., cell), we employed predictive sparse decomposition (PSD)  to learn transferable knowledge (i.e., sparse tissue morphometric patterns) from a well-curated GBM dataset [15, 17, 18], as shown in Figure 1. Unlike many other unsupervised feature learning algorithms [20, 21, 22, 23], the feed-forward feature inference of PSD is very efficient, as it involves only element-wise nonlinearity and matrix multiplication, which is crucial to the characterization and representation of large cohort of WSIs.
Given X = [x1, …, xN] m × N as a set of vectorized image patches, we formulate the PSD optimization problem as:
where B m × h is a set of the basis functions; Z = [z1, …, zN] h × N is the sparse feature matrix;W h × m is the auto-encoder; G = diag(g1, …, gh) h × h is a scaling matrix with diag being an operator aligning vector [g1, …, gh] along the diagonal, σ(·) is the element-wise sigmoid function and λ is a regularization constant. The goal of jointly minimizing Eq. (1) with respect to the quadruple B, Z, G, W is to enforce the inference of the nonlinear regressor Gσ(WX) to be resemble to the optimal sparse codes Z that can reconstruct X over B . An iterative process is employed for optimizing Eq. (1), and the details can be found in our previous publication .
With the transferable knowledge derived from the GBM dataset through the unsupervised feature learning procedure, as shown in Figure 1, each image patch (i.e., 20×20 sub-image) in the WSI can be represented by a sparse feature vector z. Each WSI is then represented by summarizing the feature vectors of all non-overlap, non-background and non-border patches within the WSI (e.g., moments of each individual feature dimension, etc.).
Consensus clustering  is performed for identifying subtypes/clusters across tissue sections of TCGA BRCA cohort. The input of consensus clustering are the summarized features from all tissue sections. Consensus clustering aggregates consensus across multiple runs for a base clustering algorithm. Moreover, it provides a visualization tool to explore the number of clusters in the data, as well as assessing the stability of the discovered clusters.
Tissue phenotypic subtypes derived from consensus clustering of learned features are then associated with clinical outcomes, genomic/methylation subtypes and molecular data for integrated analysis. The Kaplan-Meier estimator, a non-parametric statistic, is used to estimate the survival function from clinical outcomes. Log-rank test, a nonparametric test designed for data of right skewed and censored, is used to compare the survival distributions of two subtypes. Fisher’s exact test is used for the enrichment analysis between tissue phenotypic subtypes and genomic/methylation subtypes. Linear models are used for assessing differential expression of genes between tissue phenotypic subtypes.
The proposed approach has been applied on the TCGA BRCA cohort, including 273 tissue sections from 273 patients each of which has the labels of both the 50-gene PAM50 subtypes and methylation subtypes . For the quality control purpose, background and border portions of each whole slide image were detected and removed from the analysis.
The representation for each tissue section consists of a 1024-dimensional feature (average sparse feature over all non-overlap, non-background and non-border patches in the tissue section). Hierarchical clustering was chosen as the cluster algorithm for consensus clustering, where the distance function is Pearson correlation. The procedure was run for 500 iterations with a sampling rate of 0.8 on 273 tissue sections. Consensus clustering is implemented through the R Bioconductor ConsensusClusterPlus package. Consensus clustering matrices with 2 to 7 clusters are shown in Figure 2, where the matrices with 2 to 4 clusters reveal different levels of similarity among tissue sections and matrices with 5 to 7 clusters provide little further details. Interestingly, the top left cluster in the matrices with 2 to 4 clusters remains the same, while the bottom right samples are further divided into sub-clusters.
Survival analysis is implemented through the R survival package. Figure 3 shows the Kaplan-Meier plot for two subtypes associated with patient survival from the two-cluster consensus clustering result. The log-rank test p-value of 0.0005 indicates that the difference between survival times of these two subtypes is statistically significant. Due to the short median overall follow up and the small number of overall survival events, survival analysis was not performed on the three-cluster and four-cluster consensus clustering results.
Figure 4 and and55 show tissue phenotypic subtypes and corresponding 50-gene PAM50 subtypes / methylation subtypes  for each tissue section in the consensus clustering results of two, three and four clusters, respectively. Fisher’s exact test reveals no enrichment between tissue phenotypic subtypes and 50-gene PAM50 subtypes / methylation subtypes.
Fifty-six differential expressed genes between the two subtypes, as shown in Figure 4d, indicates enrichment of genes regulated by NF-kB in response to TNF and genes up-regulated in response to IFNG (via MSigDB ). TNF refers to a group of cytokines that induce proliferation, and inflammation and apoptosis depending upon the adaptor proteins. It is shown that TNF-α acting on TNFR1 promotes breast cancer growth via p42/P44 MAPK, JNK, Akt and NF-kB-dependent pathways . IFNG is an inflammatory cytokine that induces the expression and function of IRF1, a tumor suppressor gene that can increase antiestrogen responsiveness. Observations also support the exploration of clinical trials combining antiestrogens and compounds that can induce IRF1 for the treatment of some ER-positive breast cancers .
In this paper, we proposed a knowledge sharing approach based on unsupervised feature learning (i.e., predictive sparse decomposition) for tissue phenotypic characterization, followed by phenotypic subtyping and genomic and clinical association. The knowledge (i.e., sparse tissue morphometric patterns) was initially learned from a well-curated GBM dataset, and then transferred to the TCGA BRCA cohort. Experimental results indicate no enrichment between the tissue phenotypic subtypes and 50-gene PAM50 subtypes / methylation subtypes. Instead, they reveal two major phenotypic subtypes with statistically significantly different survival curves. Further differential expression analysis of these two subtypes indicates enrichment of genes regulated by NF-kB in response to TNF and genes up-regulated in response to IFNG.
We suggest that such an approach can be potentially applied for many other biomedical applications when well-curated examples are not easily available. And our future work will focus on (i) validating our findings on breast invasive carcinoma with an independent cohort; and (ii) validating our approach as well as the transferability of the pre-built knowledge (i.e., sparse tissue morphometric patterns) with other biomedical applications, such as the study of biological responses to environmental challenges.
This work was supported by NIH R01 CA184476 carried out at Lawrence Berkeley National Laboratory.