Our main motivation
for quantifying morphometric composition from histology sections is to gain insight into cellular morphology, organization, and sample tumor heterogeneity in a large cohort. In tumor sections, robust representation and classification can identify mitotic cells, cellular aneuploidy, and autoimmune responses. More importantly, if tissue morphology and architecture can be quantified on a very large scale dataset, then it will pave the way for constructing databases that are prognostic, the same way that genome-wide array technologies have identified molecular subtypes and predictive markers. Genome-wide molecular characterization (e.g., transcriptome analysis) has the advantage of standardized techniques for data analysis and pathway enrichment, which can enable hypothesis generation for the underlying mechanisms. However, array-based analysis (i) can only provide an average measurement of the tissue biopsy, (ii) can be expensive, (iii) can hide occurrences of rare events, and (iv) lacks the clarity for translating molecular signature into a phenotypic signature. Though nuclear morphology and context are difficult to compute as a result of intrinsic cellular characteristic and technical variations, histology sections can offer insights into tumor architecture and heterogeneity (e.g., mixed populations), in addition to, rare events. Moreover, in the presence of a very large dataset, phenotypic signatures can be used to identify intrinsic subtypes within a specific tumor bank through unsupervised clustering. This facet is orthogonal to histological grading, where tumor sections are classified against known grades. The tissue sections are often visualized with hematoxylin and eosin stains, which label DNA content (e.g., nuclei) and protein contents, respectively, in various shades of color. Even though there are inter- and intra- observer variations [1
], a trained pathologist can characterize the rich content, such as the various cell types, cellular organization, cell state and health, and cellular secretion. If hematoxylin and eosin (H&E) stained tissue sections can be quantified in terms of cell type (e.g., epithelial, stromal), tumor subtype, and histopathological descriptors (e.g., necrotic rate, nuclear size and shape), then a richer description can be linked with genomic information for improved diagnosis and therapy. This is the main benefit of histological imaging since it can capture tumor architecture.
Ultimately, our goal is to mine a large cohort of tumor data in order to identify morphometric indices (e.g., nuclear size) that have prognostic and/or predictive subtypes. The Cancer Genome Atlas (TCGA) offers such a collection; however, the main issue with processing a large cohort, is the inherent variations as a result of (i) the sample preparation protocols (e.g., fixation, staining), practiced by different laboratories, and (ii) the intrinsic tumor architecture (e.g., cell type, cell state). For example, with respect to heterogeneity in the tumor architecture, the nuclear color in the RGB space found in one tissue section may be similar to the cytoplasmic color in another tissue section. Simultaneously, the nuclear color intensity (e.g., chromatin content) can vary within a whole slide image. Therefore, image analysis should be tolerant and robust, with respect to variations in sample preparation and tumor architecture, within the entire slide image and across the tumor cohort.
Stained whole mount tissue sections are scanned at either at 20X or 40X, which results in larger images in the order of 40kby-40k pixels or higher. Each image is partitioned into blocks of 1k-by-1k pixels for processing, and cells at the borders of each block are excluded during the processing. The details of the computational pipeline can be found in our earlier paper [2
]. Our approach evolved from our observation that simple color decomposition and thresholding misses or over-estimates some of the nuclei in the image, i.e., nuclei with low chromatin contents are excluded. Further complications ensue as a result of diversity in nuclear size and shape (e.g., the classic scale problem).
The general approach is shown in , where the primary novelty
is in the image-based modeling of inherent ambiguities that are associated with technical variations and biological heterogeneity. Image-based modeling captures prior knowledge from a diverse set of annotated images (e.g., a dictionary) needed in order to model the foreground and background representations. Each annotated image is independent of other images and signifies one facet (e.g., color space, nuclear shape and size) of the diversity within the cohort. Moreover, each image is represented in the feature-space as the Gaussian Mixture Model (GMM
) of the Laplacian of Gaussian (LoG
) and RGB
responses. Collectively, the reference dictionary of annotated images provides the means for color normalization and for capturing global statistics for segmenting test images. The computed global statistics can then be coupled, through a graph cut formulation, with the intrinsic local image statistics and spatial continuity for binarization. Having segmented an input test image, each segmented foreground region is subsequently validated for nuclear shape. If needed, it is decomposed through geometric reasoning. A secondary novelty is in the details of the computational pipeline. For example, we introduce the concept of (i) “color map normalization” for registering a test image against each of the images in the reference library, and (ii) “blue ratio image” for mapping RGB
images into the gray space; thus, LoG
responses can be computed efficiently in one channel. All important free parameters are selected through cross-validation. Thus far, close to 1000 whole slide images have been processed, and the data has been made publicly available through our website at http://tcga.lbl.gov
. In addition, segmentation results, from the whole mount tissue sections, are available for quality control through a web-based zoomable interface.
Work flow in Nuclear Segmentation for a cohort of whole mount tissue sections.
Essentially, nuclear segmentation provides the basis for morphometric representation on a cell-by-cell basis. As a result, tumor histology can be represented as a meaningful data matrix, where well-known bioinformatics and statistical tools can be readily applied for hypotheses generation. For example, a large cohort facilitates tumor subtyping based on computed morphometric features. Each subtype can then be (i) tested for its prognostic value, and (ii) utilized for identifying molecular basis of each subtype for hypothesis generation. In the case of GBM, prognostic and/or predictive subtypes have also been posted on our Web site.
Organization of this paper is as follows: Section II reviews previous research with a focus on quantitative representation of the H&E sections for translational medicine. Sections III and IV describes the details of the image-based modeling for nuclear segmentation and experimental validation, respectively. Section V examines one application of nuclear segmentation of morphometric subtyping and molecular association for hypothesis generation. Lastly, section VI concludes the paper.