Over the past decade, there has been increased research interest in decision-support systems for endoscopic imagery. In the context of routine examinations of the colon, an important task is to perform pit pattern
discrimination. This is usually guided by the Kudo criteria [Kudo et al., 1994
], based on the observation of a strong correlation between the visual appearance of the highly-magnified mucosa and the visual appearance of dissected specimen under the microscope. Pit pattern analysis not only facilitates in vivo
predictions of the histology but represents a valuable guideline for treatment strategies in general. The Kudo criteria discriminates between five pit pattern types I-V, where type III is subdivided into III-S and III-L. Types I and II are usually characteristic of non-neoplas
lesions, type III and IV indicate adenomnatous polyps and type V is highly indicative for invasive carcinoma. Apart from incidental image structures, such as colon folds, the pit patterns are the predominant concepts upon which histological predictions are made. While images where one particular pit pattern type is prevalent are fairly rare, mixtures of pit patterns are quite commonly found in practice. The development of decision-support systems for endoscopic imagery is desirable for several reasons.
First, routine examinations often involve unnecessary biopsies or polyp resections, both because physicians are under serious time pressure and because the standard protocol dictates the use of biopsies in cases of uncertainty. This is controversial, since resecting metaplastic lesions is time-consuming and the removal of invasive cancer can be hazardous.
Second, the interpretation of the acquired image material can be difficult, due to high variability in image appearance depending on the type of imaging equipment. Novel modalities, such as high-magnification endoscopy, narrow band imaging (NBI) or confocal laser endomicroscopy (CLE) all highlight different mucosal structures; CLE even provides an in-vivo
view of deeper tissue layers at a microscopic scale. A critical problem is that visual criteria for assessing the malignant potential of colorectal lesions are still under extensive clinical evaluation and substantial experience [Tung et al., 2001
] is usually required to achieve good results under these criteria.
Third, decision-support systems can be a helpful aid to the training of future physicians. Due to differences among endoscopic imaging modalities and endoscope brands, it is advisable to train the physician on data from the very device to be used in practice. However, the learning of Kudo's pit pattern classification requires experienced physicians to go through the time-consuming selection of images representative of the different pit pattern types. This is a tedious process, which becomes unmanageable for large-scale datasets.
For all these reasons, there has been increasing interest in decision-support systems for endoscopic imagery over the last decade. This effort has been predominantly directed to the use of automated image content analysis techniques in the prediction of histopathological results (e.g. [André et al., 2009
; Tischendorf et al., 2010
; Kwitt et al., 2011
; Häfner et al., 2012
]). It has led to a plethora of approaches that first compute a collection of localized appearance features and then input these features to a discriminant classifier, usually a support vector machine. From a purely technical point of view, this problem description is similar to scene recognition problems in the computer vision literature, with the difference that invariance properties of the image representation, such as invariance to rotation or translation, are considered more important in the medical field. A relevant research trend in computer vision is to replace the inference of scene labels from appearance descriptors alone by more abstract, intermediate-level, representations [Fei-Fei and Perona, 2005
; Lazebnik et al., 2006
; Boureau et al., 2010
; Rasiwasia et al., 2006
; Rasiwasia and Vasconcelos, 2008
; Dixit et al., 2011
]. The prevalent approach to scene classification is to learn a codebook of so called visual words
, from a large corpus of appearance descriptors, and represent each image as an histogram — known as the bag-of-words (BoW) histogram — of codeword indices. These mid-level representations are input to a discriminant classifier for scene label prediction.
In the context of pit-pattern classification, this classification architecture could, in principle, be used to produce a class label, such as neoplas
, to be presented to a physician. However, while BoW histograms have state-of-the-art recognition rates for both medical and computer vision applications, they are not generally amenable to human interpretation. This is due to the facts that they 1) are high-dimensional, and 2) define a space whose coordinate axes lack semantic interpretation. This lack of interpretability raises a number of difficulties to the clinical deployment of the resulting decision-support systems. First, while the resulting predictions are valuable, it is not uncommon for the medical community to reject black-box
solutions that do not provide interpretable information on how these predictions were reached. Second, the lack of insight on the factors that determine the predicted image labels severely compromise their usefulness for physician training. Third, it has been recently argued that a more semantically-focused mid-level representation is conducive to better recognition results (cf. [Schwaninger et al., 2006
; Rasiwasia and Vasconcelos, 2008
]). Several works have, in fact, shown that an image representation which captures the occurrence probabilities of predefined semantic concepts is not only competitive with BoW, but computationally more efficient due to its lower dimensionality. Since the semantic concepts can be chosen so as to be interpretable by physicians, the approach is also conducive to a wider acceptability by the medical community For example, [André et al., 2012
] demonstrated that low-dimensional semantic encodings are highly beneficial to the interpretation of CLE imagery.
The goal of this work is to establish a semantic encoding of endoscopic imagery so as to produce systems for automated malignancy assessment of colorectal lesions of greater flexibility than those possible with existing approaches. We demonstrate the benefits of the proposed encoding on image material obtained during routine examinations of the colon mucosa. The imaging modality is high-magnification chromo-endoscopy which offers a level ofvisualdetail suitable for the categorization of mucosal surface structures into different pit pattern
types. Some typical images are shown in the top row of . The aforementioned shortcomings of previous approaches are addressed by adapting a recent method [Rasiwasia and Vasconcelos, 2008
] from the scene recognition literature to the inference of semantic encodings for endoscopic imagery. Some examples of these encodings are shown in the bottom row of . While the general principle is well established in the computer vision literature, we demonstrate that it is a principled solution for a number of important applications in the domain of endoscopic image analysis. The first is the automated assessment of the malignant potential of colorectal lesions, where the proposed semantic encoding is shown to enable state-of-the-art image classification with substantially increased human interpretability of classifier predictions. The second is a tool to browse endoscopic image databases by typicality of particular pit patterns, allowing trainees in gastroenterology to find most-representat
cases for each pit pattern class. The third is a strategy to determine images which represent the averag
e-case for a particular pit pattern type. This enables physicians to keep track of whatthey typically see in clinical practice. A preliminary version of this work appeared in Kwitt et al. 
Figure 1 Endoscopy images of the colon mucosa (top row), taken by a high-magnification endoscope, showing typical mucosal structures (pit patterns). The bottom row shows the semantic encoding proposed in this work. The height of each bar indicates the probability (more ...)