Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Proc IEEE Int Symp Biomed Imaging. Author manuscript; available in PMC 2016 July 5.
Published in final edited form as:
PMCID: PMC4932846



Quantitative analysis of whole slide images (WSIs) in a large cohort may provide predictive models of clinical outcome. However, the performance of the existing techniques is hindered as a result of large technical variations (e.g., fixation, staining) and biological heterogeneities (e.g., cell type, cell state) that are always present in a large cohort. Although unsupervised feature learning provides a promising way in learning pertinent features without human intervention, its capability can be greatly limited due to the lack of well-curated examples. In this paper, we explored the transferability of knowledge acquired from a well-curated Glioblastoma Multiforme (GBM) dataset through its application to the representation and characterization of tissue histology from the Cancer Genome Atlas (TCGA) Breast Invasive Carcinoma (BRCA) cohort. Our experimental results reveals two major phenotypic subtypes with statistically significantly different survival curves. Further differential expression analysis of these two subtypes indicates enrichment of genes regulated by NF-kB in response to TNF and genes up-regulated in response to IFNG.

Keywords: Breast invasive carcinoma, unsupervised feature learning, knowledge sharing, predictive sparse decomposition, consensus clustering, survival analysis, enrichment analysis


Tumor histology provides a detailed insight into cellular morphology, organization, and heterogeneity. For example, tumor histological sections can be used to identify mitotic cells, cellular aneuploidy, and autoimmune responses. More importantly, if tumor morphology and architecture can be quantified on large histological datasets, then it will pave the way for constructing histological databases that are prognostic, the same way that genome analysis techniques have identified molecular subtypes and predictive markers. Genome-wide analysis techniques (e.g., microarray analysis and next generation sequencing (NGS)) have the advantages of standardized tools for data analysis and pathway enrichment, which enables hypothesis generation for the underlying mechanism. On the other hand, histological signatures are hard to compute because of the technical variations and biological heterogeneities in the stained histological sections; however, they offer insights into tissue composition as well as heterogeneity (e.g., mixed populations) and rare events.

Histological sections are often stained with hematoxylin and eosin stains (H&E).Traditional histological analysis is performed by a trained pathologist through the characterization of phenotypic content, such as various cell types, cellular organization, cell state and health, and cellular secretion. One of the main technical barriers for processing a large collection of histological data is that the color composition is subject to technical variations (e.g., fixation, staining) and biological heterogeneities (e.g., cell type, cell state) across histological tissue sections, especially when these tissue sections are processed and scanned at different laboratories. Here, a histological tissue section refers to an image of a thin slice of tissue applied to a microscopic slide and scanned from a light microscope. From an image analysis perspective, color variations can occur both within and across tissue sections. For example, within a tissue section, some nuclei may have low chromatin content (e.g., light blue signals), while others may have higher signals (e.g., dark blue); nuclear intensity in one tissue section may be very close to the background intensity (e.g., cytoplasmic, macromolecular components) in another tissue section.

In this paper, we aim to explore the transferability of knowledge acquired from a well-curated GBM dataset through its application to the phenotypic characterization of breast invasive carcinoma. We suggest that, unsupervised feature learning is capable of generating transferable knowledge in tissue histology that can potentially be shared across cohorts with different tumor types, which provides an effective solution when well-curated examples are not available.

Organization of this paper is as follows: Section 2 reviews related works. Section 3 describes the details of our proposed approach. Section 4 elaborates the details of our experimental setup, followed by detailed discussion on the experimental results. Lastly, section 5 concludes the paper.


Several outstanding reviews for the analysis of histology sections can be found in [1, 2]. From our perspective, four distinct works have defined the trends in histology image analysis: (i) one group of researchers proposed nuclear segmentation and organization for tumor grading and/or the prediction of tumor recurrence [3, 4, 5, 6]. (ii) A second group of researchers focused on patch level analysis (e.g., small regions) [7, 8, 9, 10], using color and texture features, for tumor representation. (iii) A third group focused on block-level analysis to distinguish different states of tissue development using cell-graph representation [11, 12]. (iv) Finally, a fourth group has suggested detection and representation of the auto-immune response as a prognostic tool for cancer [13].

The major challenge for computational histopathology is the large amounts of technical variations and biological heterogeneities in the data [14], which typically results in techniques that are tumor type specific. To overcome this problem, recent studies have focused on either fine tuning human engineered features [7, 8, 14, 15], or applying automatic feature learning [16, 17, 18], for robust representation.


The proposed approach for tissue phenotypic characterization and integrated analysis includes: transferable knowledge learning through predictive sparse decomposition (PSD), tissue phenotypic representation and subtyping via consensus clustering, survival analysis and genomic association.

3.1. Transferable Knowledge Learning

Given many of the shared visual concepts among different tumor types (e.g., cell), we employed predictive sparse decomposition (PSD) [19] to learn transferable knowledge (i.e., sparse tissue morphometric patterns) from a well-curated GBM dataset [15, 17, 18], as shown in Figure 1. Unlike many other unsupervised feature learning algorithms [20, 21, 22, 23], the feed-forward feature inference of PSD is very efficient, as it involves only element-wise nonlinearity and matrix multiplication, which is crucial to the characterization and representation of large cohort of WSIs.

Fig. 1
Computational workflow of unsupervised feature learning with predictive sparse decomposition (PSD).

Given X = [x1, …, xN] [set membership] Rm × N as a set of vectorized image patches, we formulate the PSD optimization problem as:



where B [set membership] Rm × h is a set of the basis functions; Z = [z1, …, zN] [set membership] Rh × N is the sparse feature matrix;W [set membership] Rh × m is the auto-encoder; G = diag(g1, …, gh) [set membership] Rh × h is a scaling matrix with diag being an operator aligning vector [g1, …, gh] along the diagonal, σ(·) is the element-wise sigmoid function and λ is a regularization constant. The goal of jointly minimizing Eq. (1) with respect to the quadruple left angle bracketB, Z, G, Wright angle bracket is to enforce the inference of the nonlinear regressor Gσ(WX) to be resemble to the optimal sparse codes Z that can reconstruct X over B [19]. An iterative process is employed for optimizing Eq. (1), and the details can be found in our previous publication [17].

3.2. Tissue Phenotypic Representation and Subtyping

With the transferable knowledge derived from the GBM dataset through the unsupervised feature learning procedure, as shown in Figure 1, each image patch (i.e., 20×20 sub-image) in the WSI can be represented by a sparse feature vector z. Each WSI is then represented by summarizing the feature vectors of all non-overlap, non-background and non-border patches within the WSI (e.g., moments of each individual feature dimension, etc.).

Consensus clustering [24] is performed for identifying subtypes/clusters across tissue sections of TCGA BRCA cohort. The input of consensus clustering are the summarized features from all tissue sections. Consensus clustering aggregates consensus across multiple runs for a base clustering algorithm. Moreover, it provides a visualization tool to explore the number of clusters in the data, as well as assessing the stability of the discovered clusters.

3.3. Integrated Analysis with Genomic Signatures and Clinical Outcomes

Tissue phenotypic subtypes derived from consensus clustering of learned features are then associated with clinical outcomes, genomic/methylation subtypes and molecular data for integrated analysis. The Kaplan-Meier estimator, a non-parametric statistic, is used to estimate the survival function from clinical outcomes. Log-rank test, a nonparametric test designed for data of right skewed and censored, is used to compare the survival distributions of two subtypes. Fisher’s exact test is used for the enrichment analysis between tissue phenotypic subtypes and genomic/methylation subtypes. Linear models are used for assessing differential expression of genes between tissue phenotypic subtypes.


The proposed approach has been applied on the TCGA BRCA cohort, including 273 tissue sections from 273 patients each of which has the labels of both the 50-gene PAM50 subtypes and methylation subtypes [25]. For the quality control purpose, background and border portions of each whole slide image were detected and removed from the analysis.

4.1. Consensus clustering

The representation for each tissue section consists of a 1024-dimensional feature (average sparse feature over all non-overlap, non-background and non-border patches in the tissue section). Hierarchical clustering was chosen as the cluster algorithm for consensus clustering, where the distance function is Pearson correlation. The procedure was run for 500 iterations with a sampling rate of 0.8 on 273 tissue sections. Consensus clustering is implemented through the R Bioconductor ConsensusClusterPlus package. Consensus clustering matrices with 2 to 7 clusters are shown in Figure 2, where the matrices with 2 to 4 clusters reveal different levels of similarity among tissue sections and matrices with 5 to 7 clusters provide little further details. Interestingly, the top left cluster in the matrices with 2 to 4 clusters remains the same, while the bottom right samples are further divided into sub-clusters.

Fig. 2
Consensus clustering matrices of 273 TCGA patients with BRCA for cluster number of N = 2 to N = 7 based on tissue morphometric features.

4.2. Survival analysis and genomic association

Survival analysis is implemented through the R survival package. Figure 3 shows the Kaplan-Meier plot for two subtypes associated with patient survival from the two-cluster consensus clustering result. The log-rank test p-value of 0.0005 indicates that the difference between survival times of these two subtypes is statistically significant. Due to the short median overall follow up and the small number of overall survival events, survival analysis was not performed on the three-cluster and four-cluster consensus clustering results.

Fig. 3
Kaplan-Meier plot for the two subtypes associated with patient survival from the two-cluster consensus clustering result (181 patients in Subtype 1 and 92 patients in Subtype 2).

Figure 4 and and55 show tissue phenotypic subtypes and corresponding 50-gene PAM50 subtypes / methylation subtypes [25] for each tissue section in the consensus clustering results of two, three and four clusters, respectively. Fisher’s exact test reveals no enrichment between tissue phenotypic subtypes and 50-gene PAM50 subtypes / methylation subtypes.

Fig. 4
Coordinated analysis for two-cluster consensus clustering result: a. Phenotypic subtypes; b. 50-gene PAM50 subtypes [25]; c. Methylation subtypes [25]; d. Genes that are differently expressed between the two phenotypic subtypes (FDR-adjusted p-value< ...
Fig. 5
Coordinated analysis for three-cluster and four-cluster consensus clustering results.

Fifty-six differential expressed genes between the two subtypes, as shown in Figure 4d, indicates enrichment of genes regulated by NF-kB in response to TNF and genes up-regulated in response to IFNG (via MSigDB [26]). TNF refers to a group of cytokines that induce proliferation, and inflammation and apoptosis depending upon the adaptor proteins. It is shown that TNF-α acting on TNFR1 promotes breast cancer growth via p42/P44 MAPK, JNK, Akt and NF-kB-dependent pathways [27]. IFNG is an inflammatory cytokine that induces the expression and function of IRF1, a tumor suppressor gene that can increase antiestrogen responsiveness. Observations also support the exploration of clinical trials combining antiestrogens and compounds that can induce IRF1 for the treatment of some ER-positive breast cancers [28].


In this paper, we proposed a knowledge sharing approach based on unsupervised feature learning (i.e., predictive sparse decomposition) for tissue phenotypic characterization, followed by phenotypic subtyping and genomic and clinical association. The knowledge (i.e., sparse tissue morphometric patterns) was initially learned from a well-curated GBM dataset, and then transferred to the TCGA BRCA cohort. Experimental results indicate no enrichment between the tissue phenotypic subtypes and 50-gene PAM50 subtypes / methylation subtypes. Instead, they reveal two major phenotypic subtypes with statistically significantly different survival curves. Further differential expression analysis of these two subtypes indicates enrichment of genes regulated by NF-kB in response to TNF and genes up-regulated in response to IFNG.

We suggest that such an approach can be potentially applied for many other biomedical applications when well-curated examples are not easily available. And our future work will focus on (i) validating our findings on breast invasive carcinoma with an independent cohort; and (ii) validating our approach as well as the transferability of the pre-built knowledge (i.e., sparse tissue morphometric patterns) with other biomedical applications, such as the study of biological responses to environmental challenges.


This work was supported by NIH R01 CA184476 carried out at Lawrence Berkeley National Laboratory.


1. Demir C, Yener B. Automated cancer diagnosis based on histopathological images: A systematic survey. Technical Report, Rensselaer Polytechnic Institute, Department of Computer Science. 2009
2. Gurcan M, Boucheron LE, Can A, Madabhushi A, Rajpoot NM, Bulent Y. Histopathological image analysis: a review. IEEE Transactions on Biomedical Engineering. 2009;2:147–171. [PMC free article] [PubMed]
3. Axelrod D, Miller N, Lickley H, Qian J, Christens-Barry W, Yuan Y, Fu Y, Chapman J. Effect of quantitative nuclear features on recurrence of ductal carcinoma in situ (DCIS) of breast. Cancer Informatics. 2008;4:99–109. [PMC free article] [PubMed]
4. Datar M, Padfield D, Cline H. Color and texture based segmentation of molecular pathology images using HSOMs. ISBI. 2008:292–295.
5. Basavanhally A, Xu J, Madabhushu A, Ganesan S. Computer-aided prognosis of ER+ breast cancer histopathology and correlating survival outcome with oncotype DX assay. ISBI. 2009:851–854.
6. Doyle S, Feldman M, Tomaszewski J, Shih N, Madabhushu A. Cascaded multi-class pairwise classifier (CASCAMPA) for normal, cancerous, and cancer confounder classes in prostate histology. ISBI. 2011:715–718.
7. Bhagavatula R, Fickus M, Kelly W, Guo C, Ozolek J, Castro C, Kovacevic J. Automatic identification and delineation of germ layer components in h&e stained images of teratomas derived from human and nonhuman primate embryonic stem cells. ISBI. 2010:1041–1044. [PMC free article] [PubMed]
8. Kong J, Cooper L, Sharma A, Kurk T, Brat D, Saltz J. Texture based image recognition in microscopy images of diffuse gliomas with multi-class gentle boosting mechanism. ICASSAP. 2010:457–460.
9. Han J, Chang H, Loss L, Zhang K, Baehner FL, Gray JW, Spell-man PT, Parvin Bahram. Comparison of sparse coding and kernel methods for histopathological classification of glioblastoma multiforme. ISBI. 2011:711–714. [PMC free article] [PubMed]
10. Mujahid Khan A, Sirinukunwattana K, Rajpoot NM. A global covariance descriptor for nuclear atypia scoring in breast histopathology images. IEEE J. Biomedical and Health Informatics. 2015;19(5):1637–1647. [PubMed]
11. Acar E, Plopper GE, Yener B. Coupled analysis of in vitro and histology samples to quantify structure-function relationships. PLoS One. 2012;7(3):e32227. [PMC free article] [PubMed]
12. Bilgin CC, Ray S, Baydil B, Daley WP, Larsen M, Yener B. Multiscale feature analysis of salivary gland branching morphogenesis. PLoS One. 2012;7(3):e32906. [PMC free article] [PubMed]
13. Fatakdawala H, Xu J, Basavanhally A, Bhanot G, Ganesan S, Feldman F, Tomaszewski J, Madabhushi A. Expectation-maximization-driven geodesic active contours with overlap resolution (EMaGACOR): Application to lymphocyte segmentation on breast cancer histopathology. IEEE Transactions on Biomedical Engineering. 2010;57(7):1676–1690. [PubMed]
14. Kothari S, Phan JH, Osunkoya AO, Wang MD. Biological interpretation of morphological patterns in histopathological whole slide images; ACM Conference on Bioinformatics, Computational Biology and Biomedicine; 2012.
15. Chang H, Borowsky A, Spellman PT, Parvin B. Classification of tumor histology via morphometric context; Proceedings of the Conference on Computer Vision and Pattern Recognition; 2013. [PMC free article] [PubMed]
16. Huang CH, Veillard A, Lomeine N, Racoceanu D, Roux L. Time efficient sparse analysis of histopathological whole slide images. Computerized medical imaging and graphics. 2011;35(7–8):579–591. [PubMed]
17. Chang H, Zhou Y, Borowsky A, Barner KE, Spellman PT, Parvin B. Stacked predictive sparse decomposition for classification of histology sections. IJCV. 2015;113(1):3–18.
18. Zhou Y, Chang H, Barner KE, Spellman PT, Parvin B. Classification of histology sections via multispectral convolutional sparse coding. CVPR. 2014:3081–3088. [PMC free article] [PubMed]
19. Kavukcuoglu K, Ranzato M, LeCun Y. Tech. Rep. CBLL-TR-2008-12-01. NYU: Computational and Biological Learning Lab, Courant Institute; 2008. Fast inference in sparse coding algorithms with applications to object recognition.
20. Lee H, Battle A, Raina R, Ng AY. Efficient sparse coding algorithms. NIPS. 2007:801–808.
21. Lee H, Ekanadham C, Ng AY. Advances in Neural Information Processing Systems. Vol. 20. MIT Press; 2008. Sparse deep belief net model for visual area v2.
22. Poultney C, Chopra S, Lecun Y. Advances in Neural Information Processing Systems (NIPS 2006. MIT Press; 2006. Efficient learning of sparse representations with an energy-based model.
23. Yu K, Zhang T, Gong Y. Nonlinear learning using local coordinate coding. In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A, editors. Advances in Neural Information Processing Systems 22. 2009. pp. 2223–2231.
24. Monti S, Tamayo P, Mesirov J, Golub TR. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn. 2003;52:91–118.
25. The Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumors. Nature. 2012;490(7418):61–70. [PMC free article] [PubMed]
26. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS. 2005;102(43):15545–15550. [PubMed]
27. Rivas MA, Carnevale RP, Proietti CJ, Rosemblit C, Beguelin W, Salatino M, Charreau EH, Frahm I, Sapia S, Brouckaert P, Elizalde PV, Schillaci R. Tnf alpha acting on tnfr1 promotes breast cancer growth via p42/p44 mapk, jnk, akt and nf-kappa b-dependent pathways. Exp Cell Res. 2008;314(3):509–529. [PubMed]
28. Ning Y, Riggins RB, Mulla JE, Chung H, Zwart A, Clarke R. Ifngamma restores breast cancer sensitivity to fulvestrant by regulating stat1, ifn regulatory factor 1, nf-kappab, bcl2 family members, and signaling to caspase-dependent apoptosis. Mol Cancer Ther. 2010;9(5):1274–1285. [PMC free article] [PubMed]