Home | About | Journals | Submit | Contact Us | Français |

**|**Springer Open Choice**|**PMC5541128

Formats

Article sections

Authors

Related links

International Journal of Computer Assisted Radiology and Surgery

Int J Comput Assist Radiol Surg. 2017; 12(8): 1281–1292.

Published online 2017 June 2. doi: 10.1007/s11548-017-1620-7

PMCID: PMC5541128

Menglong Ye,^{}^{1} Edward Johns,^{2} Benjamin Walter,^{3} Alexander Meining,^{3} and Guang-Zhong Yang^{1}

Menglong Ye, Email: ku.ca.lairepmi@11ey.gnolgnem.

Received 2017 January 25; Accepted 2017 May 23.

Copyright © The Author(s) 2017

Serial endoscopic examinations of a patient are important for early diagnosis of malignancies in the gastrointestinal tract. However, retargeting for optical biopsy is challenging due to extensive tissue variations between examinations, requiring the method to be tolerant to these changes whilst enabling real-time retargeting.

This work presents an image retrieval framework for inter-examination retargeting. We propose both a novel image descriptor tolerant of long-term tissue changes and a novel descriptor matching method in real time. The descriptor is based on histograms generated from regional intensity comparisons over multiple scales, offering stability over long-term appearance changes at the higher levels, whilst remaining discriminative at the lower levels. The matching method then learns a hashing function using random forests, to compress the string and allow for fast image comparison by a simple Hamming distance metric.

A dataset that contains 13 in vivo gastrointestinal videos was collected from six patients, representing serial examinations of each patient, which includes videos captured with significant time intervals. Precision-recall for retargeting shows that our new descriptor outperforms a number of alternative descriptors, whilst our hashing method outperforms a number of alternative hashing approaches.

We have proposed a novel framework for optical biopsy in serial endoscopic examinations. A new descriptor, combined with a novel hashing method, achieves state-of-the-art retargeting, with validation on in vivo videos from six patients. Real-time performance also allows for practical integration without disturbing the existing clinical workflow.

Endoscopic examinations have been widely used for visualising the human gastrointestinal (GI) tract. Surveillance endoscopy has been a popular approach for monitoring abnormal changes, such as colorectal polyps and Barretts’ esophagus. A typical endoscopic procedure involves taking tissue samples for histological analysis afterwards, which is both time-consuming and expensive. With the advances in biophotonics, optical biopsy has emerged as a technique for providing in vivo, in situ, and real-time tissue characterisation, such that in time, curative treatment can be performed. Techniques for optical biopsy include narrow band imaging (NBI), blue light imaging (BLI), and confocal laser endomicroscopy (CLE), which can be either integrated into endoscope systems, or manufactured as an external probe-based device, to retrieve the cellular details on the tissue.

Despite the advantages provided by optical biopsy, retargeting of a biopsied location remains a challenging problem for both intra- and inter-examination. In [1], a feature matching method based on Markov random fields was proposed for intra-examination retargeting. Allain et al. [2] combined feature matching with epipolar geometry to provide biopsied location estimation with an uncertainty score. Alternatively, a 3D tracking approach was introduced by Mountney et al. [3] that uses simultaneous localisation and mapping (SLAM) to achieve consistent retargeting in a relatively static endoscopic environment. In [4–6], retargeting of a biopsied location was formulated as a 2D object tracking task, where detectors based on random forests were included to learn online the appearance of the biopsied area. Later, a hybrid approach dealing with occlusion was proposed by Mouton et al. [7] to perform efficient retargeting during probe-based CLE examinations. However, the above approaches would encounter difficulties when applied to serial examinations where there is long-term variation in local tissue appearance.

For retargeting over successive examinations of a patient, which we refer to as the inter-examination retargeting problem, endoscopic video manifolds (EVM) was proposed by Atasoy et al. [8], to learn a low-dimensional intrinsic representation of the video collected in the first examination. This mapping was then learned based on locality preserving projections [9], such that retargeting of a query image in the second examination can be achieved via image retrieval. In [10], a detailed study was performed to evaluate visual descriptors used for viewpoint selection in endoscopic surveillance. In addition to vision-based approaches, the use of external positioning sensors has also been considered. In [11], multiple electromagnetic sensors were used to register the trajectories of the endoscope motion across examinations. Although this method is not affected by the issues of image-based inter-examination retargeting, addition of extra sensors could introduce further complexity to the setup.

Our recent work in [12] introduced a vision-based framework for inter-examination retargeting to assist optical biopsy procedures. The proposed framework (see Fig. Fig.1)1) formulates retargeting as an image retrieval task to enable retargeting of biopsied locations in the second (surveillance) examination based on the targets recorded in the first (diagnosis) examination. A global image description scheme is designed by pooling the spatial information obtained from regional comparisons over multiple scales. Inspired by hashing-based techniques, the global descriptors are compressed into short binary strings with a novel random forest-based encoding function. This then enables real-time retargeting, without interfering with the current clinical workflow. Following our previous work, this paper provides extended descriptions of the methodology, as well as new insights into the technical contributions. Furthermore, other alternative approaches are added into our comparison studies with further validation on in vivo GI video sequences collected from six patients.

Over the last two decades, there has been significant progress in using keypoint-based approaches for image description. One of these is the bag-of-words (BOW) framework [13], which builds a dictionary by performing clustering on local features, such as SIFT [14]. A descriptor of an image is obtained by extracting these features and collecting a frequency histogram from individual words (in the dictionary) for this image. Recently, BOW has been combined with geometric constraints for image retrieval [15] and place recognition [16]. However, the success of these approaches depends on re-occurrences of same local keypoints across different views, which is not always possible for endoscopic scenes as these typically undergo long-term appearance changes on the local tissue surface.

Recently, descriptors based on local binary patterns (LBPs) have emerged to be powerful tools for scene recognition [17], object tracking [18], and 3D reconstruction [19]. The main advantages of LBPs include the tolerance to illumination changes and the superior computational efficiency. Compared to keypoint-based descriptions, such as BOW, LBPs-based descriptors also do not rely on consistent detection of same keypoints over images, thus providing more robustness to long-term tissue appearance changes.

In this paper, we use a symmetric version of LBP based on regional comparisons Fig. Fig.2a.2a. Our LBP performs 4 diagonal comparisons inside an image patch, yielding a 4-bit binary string for this patch. This binary string is then converted into an integer ranging from 0 to 15. With this, a 16-*dimensional*(*d*) image histogram descriptor can be simply obtained by sliding this pattern over the entire image. To consider the global geometry that would be effective for endoscopic scene description, we employ the spatial pyramid pooling approach [20] to aggregate the responses of LBP across various scales and locations. Here, we use a three-level coarse-to-fine representation, as shown in Fig. Fig.33.

Proposed binary pattern performs regional comparisons to obtain a single integer describing the image location

Spatial pyramid pooling is applied to aggregate the responses from regional comparisons at multiple scales, which generates a 496-*d* image descriptor

In addition to the first level that produces 16-*d* descriptor, for the second level, the image is divided into 2 × 2 partitions with an additional partition overlapping at the centre, providing a 80-*d* descriptor. In the third level, we divide the image into 4 × 4 partitions, with additional 3 × 3 partitions overlapped, resulting in a 400-*d* descriptor. To balance the contributions from different levels, the LBP masks contain 24 × 24, 12 × 12 and 6 × 6 pixels for the first, second and third levels, respectively. Finally, a 496-*d* global descriptor for this image is obtained by concatenating the descriptors across all levels.

Let us now denote the video sequences collected in the first (diagnosis) and second (surveillance) examinations as 𝒪_{1} and 𝒪_{2}, respectively. During the surveillance examination, retargeting of a query image (in 𝒪_{2}) is required to be real time such that a regular clinical procedure would not be interfered with. To enable the real-time retargeting capability, we adopt hashing which has proved to be efficient for large-scale image retrieval [21–24]). We follow the two-step hashing approach in [24] to compress the image descriptors into compact binary codes and then learn the mapping function via a novel random forests hash. This allows for fast matching between descriptors based on Hamming distance computation. Furthermore, a quadratic loss function is used for learning the hashing function that maps the original descriptors to a new space, where images from the same scene have a smaller distance.

In this work, we adopt supervised hashing, requiring a scene label for each image in the training image set. We define a scene as a cluster of adjacent images which represent the same topological location. To obtain the scene labels for images, we perform image clustering on the diagnosis video collected in the first examination similar to [8]. Specifically, we use an semiautomatic approach that performs K-means (intensity-based) clustering, followed by manually merging similar clusters. This results in an affinity matrix 𝒜 where *a*_{ij} = 1 if *x*_{i} and *x*_{j} have the same scene label, and *a*_{ij} = 0 if not.

Given a set of image descriptors extracted from the diagnosis video, which are denoted as ${\left\{{\mathbf{x}}_{i}\right\}}_{i=1}^{n}$, our aim is to infer their corresponding *m*-bit binary codes ${\left\{{\mathbf{b}}_{i}\right\}}_{i=1}^{n}$. This inference is performed by encouraging the Hamming distance between images of the same scene to be small, whilst large for images of different scenes. We sequentially obtain each bit in the binary code by optimising for *r*-th bit with the objective function:

$$\begin{array}{c}\hfill \begin{array}{cc}\hfill \underset{{\mathbf{b}}_{\left(r\right)}}{min}& \sum _{i=1}^{n}\sum _{j=1}^{n}{l}_{r}\left({b}_{r,i},{b}_{r,j}\u037e{a}_{ij}\right),\hfill \\ \hfill & \text{s.t.}\phantom{\rule{0.166667em}{0ex}}{\mathbf{b}}_{\left(r\right)}\in {\left\{-1,1\right\}}^{n}\hfill \end{array}\end{array}$$

1

where *b*_{r,i} and *b*_{r,j} are the *r*-th bits for images *i* and *j*, respectively. Here, **b**_{(r)} represents a vector that concatenates the *r*-th bits of *n* images. Therefore, this optimisation sequentially seeks the values of **b**_{(r)} for each bit.

Following [24], we consider a hash loss function *l*(*b*_{1}, *b*_{2}) that takes binary variables *b*_{1}, *b*_{2} ∈ { - 1, 1} as input and satisfies *l*(1, 1) = *l*( - 1, - 1) and *l*( - 1, 1) = *l*(1, - 1). This loss can be replaced with an equivalent quadratic function defined as:

$$\begin{array}{c}\hfill \begin{array}{cc}\hfill h\left({b}_{1},{b}_{2}\right)& ={\displaystyle \frac{1}{2}}\left[{b}_{1}{b}_{2}\left({l}^{11}-{l}^{-11}\right)+{l}^{11}+{l}^{-11}\right]\hfill \\ \hfill & =l\left({b}_{1},{b}_{2}\right),\hfill \end{array}\end{array}$$

2

Here, *l*^{11} and *l*^{-11} are the constants that represent *l*(1, 1) and *l*( - 1, 1), respectively. Note that, Eq. 2 can be proved by checking all the possible binary inputs. For example, when *b*_{1} = *b*_{2} = 1, we have

3

and when *b*_{1} = - 1 and *b*_{2} = 1, we can obtain

$$\begin{array}{cc}\hfill h\left(-1,1\right)=& {\displaystyle \frac{1}{2}}\left[-1\xb71\xb7\left({l}^{11}-{l}^{-11}\right)+{l}^{11}+{l}^{-11}\right]\hfill \\ \hfill =& l\left(-1,1\right).\hfill \end{array}$$

4

Similar equations can also be derived for *h*( - 1, - 1) and *h*(1, - 1). Given that *l*^{11} + *l*^{-11} results in a constant, we now use Eq. 2 to reformulate Eq. 1 as

$$\begin{array}{c}\hfill \begin{array}{cc}\hfill \underset{{\mathbf{b}}_{\left(r\right)}}{min}& \sum _{i=1}^{n}\sum _{j=1}^{n}{b}_{r,i}{b}_{b,j}\left({l}_{r,i,j}^{11}-{l}_{r,i,j}^{-11}\right),\hfill \\ \hfill & \text{s.t.}\phantom{\rule{0.166667em}{0ex}}{\mathbf{b}}_{\left(r\right)}\in {\left\{-1,1\right\}}^{n}.\hfill \end{array}\end{array}$$

5

When considering the affinity label between images *i* and *j*, we have ${l}_{r,i,j}^{11}={l}_{r}\left(1,1\u037e{a}_{ij}\right)$ and ${l}_{r,i,j}^{-11}={l}_{r}\left(-1,1\u037e{a}_{ij}\right)$. Let us denote ${c}_{r,i,j}={l}_{r,i,j}^{11}-{l}_{r,i,j}^{-11}$, and define matrix 𝒞 that contains all the *c*_{r,i,j} elements. The objective is finally turned into a matrix representation:

$$\begin{array}{c}\hfill \begin{array}{cc}& \underset{{\mathbf{b}}_{\left(r\right)}}{min}{\mathbf{b}}_{\left(r\right)}^{T}\mathcal{C}{\mathbf{b}}_{\left(r\right)},\hfill \\ \hfill & \text{s.t.}\phantom{\rule{0.166667em}{0ex}}{\mathbf{b}}_{\left(r\right)}\in {\left\{-1,1\right\}}^{n}.\hfill \end{array}\end{array}$$

6

Note that, for solving this unconstrained binary quadratic problem, we perform a series of local optimisations via graph-cut [24]. Furthermore, in this work, we employ a hinge loss function, defined as

$$\begin{array}{cc}& {l}_{r}\left({b}_{r,i},{b}_{r,j}\u037e{a}_{ij}\right)\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}=\left\{\begin{array}{cc}{\left[0-\mathcal{D}\left({\mathbf{b}}_{i}^{r},{\mathbf{b}}_{j}^{r}\right)\right]}^{2},\hfill & \phantom{\rule{1em}{0ex}}\text{if}\phantom{\rule{0.333333em}{0ex}}{a}_{ij}=1\hfill \\ {\left[max\left(0.5m-\mathcal{D}\left({\mathbf{b}}_{i}^{r},{\mathbf{b}}_{j}^{r}\right),0\right)\right]}^{2},\hfill & \phantom{\rule{1em}{0ex}}\text{if}\phantom{\rule{0.333333em}{0ex}}{a}_{ij}=0\hfill \end{array}\right.\hfill \end{array}$$

7

where ${\mathbf{b}}_{i}^{r}$ and ${\mathbf{b}}_{j}^{r}$ denote the first *r* bits for **b**_{i} and **b**_{j}, respectively. 𝒟( · , · ) indicates the Hamming distance. Equation 7 encourages the images of same scene to be close and pushes the images of different scenes to have distances larger than half the maximum distance (0.5 m). It is worth noting that during this sequential optimisation, each current bit (*r*-th bit) derivation uses the results of previous bits (0 - (*r* - 1)-th bits).

After obtaining the binary codes for the training image set (𝒪_{1}), the next step is to obtain the binary code of a query image in 𝒪_{2}, such that efficient Hamming distance-based matching can be performed. Note that the optimisation with Eq. 6 only aims to infer the binary codes on the training image set. To allow for out-of-sample extension, we need to learn a mapping function. In this work, we propose to use random forests as this mapping.

Given the global image descriptors ${\left\{{\mathbf{x}}_{i}\right\}}_{i=1}^{n}$ and their corresponding binary codes ${\left\{{\mathbf{b}}_{i}\right\}}_{i=1}^{n}$, we now formulate this mapping function as a set of binary classification functions ${\left\{{\mathit{\varphi}}_{i}\left(\mathbf{x}\right)\right\}}_{i=1}^{m}$, with each random forest *ϕ*_{i}(**x**) taking the image descriptor as the input, and returning the label { - 1, 1} for the *i*-th bit, defined as:

$$\begin{array}{cc}& {\mathit{\varphi}}_{i}\left(\mathbf{x}\right)\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}=\left\{\begin{array}{cc}-1\hfill & \phantom{\rule{1em}{0ex}}\text{if}\phantom{\rule{0.333333em}{0ex}}\frac{1}{K}{\sum}_{k=1}^{K}{\mathit{\alpha}}_{k}\left(\mathbf{x}\right)<0.5\hfill \\ 1\hfill & \phantom{\rule{1em}{0ex}}\text{otherwise}\hfill \end{array}\right.\hfill \end{array}$$

8

Here, we train *K* decision trees for each *i*-th hash function, and assign - 1 or 1 by calculating the average responses from all trees. The training input for each tree *α*_{k}(**x**) is a subset randomly sampled from ${\left\{{\mathbf{x}}_{i}\right\}}_{i=1}^{n}$.

The split function at each tree node is associated with learning two parameters *s* and *τ*, which performs a comparison on the *s*-th element in **x**_{i} with threshold *τ*. To grow each decision tree, we maximise an information gain to find the optimal parameters that split the input data *X* into left *X*_{L} and right *X*_{R} subsets. We define this information gain *I* as

$$\begin{array}{c}\hfill I=\mathit{\pi}\left(X\right)-{\displaystyle \frac{1}{|X|}}\sum _{t\in \left\{L,R\right\}}|{X}_{t}|\mathit{\pi}\left({X}_{t}\right)\end{array}$$

9

Here, we use the Shannon entropy:

$$\begin{array}{c}\hfill \mathit{\pi}\left(X\right)=-\sum _{y\in \left\{-1,1\right\}}{p}_{y}log\left({p}_{y}\right),\end{array}$$

10

where *p*_{y} indicates the fraction of data in *X* assigned to label *y*. We stop growing a tree when the defined maximum depth has been reached, or the value of *I* is below *e*^{-10}.

In this work, we train *m* random forests, acting as the mapping function ${\left\{{\mathit{\varphi}}_{i}\left(\mathbf{x}\right)\right\}}_{i=1}^{m}$ with each generating one bit of the binary code according to Eq. 8. During the surveillance examination, retargeting of a query image is achieved by obtaining its binary code (via the mapping function), followed by comparing the Hamming distance to the binary codes ${\left\{{\mathbf{b}}_{i}\right\}}_{i=1}^{n}$ from the previous diagnosis video. Finally, the relevant images of the query image are retrieved.

We implemented our framework on an HP workstation with an Intel × 5650 CPU and 24GB RAM, using Matlab and C++. Performance evaluation of our framework was conducted on in vivo data. We collected 13 video sequences ( ≈ 17, 700 images) from standard GI endoscopic examinations on six patients. Two videos were collected in successive endoscopies for each of Patients 1–5. Three videos were collected for Patient 6 in serial examinations with time intervals of 3–4 months apart. Standard Olympus endoscope systems were used for video recording in 720 × 576-pixel size, and the black borders in the images were removed before applying our framework. The NBI mode was turned on during data acquisition for image enhancement.

In this work, we consider retargeting for patient-specific data; therefore, the random forests mapping function needs to be trained separately for each patient.. Leave-one-video-out validation was performed on the patients individually, which results in 16 experiments in total. For each experiment, one video was used as 𝒪_{1} for binary code inference and mapping function learning, and the other video was used as 𝒪_{2} for testing with randomly selected 50 query images. For obtaining the ground truth, intensity-based K-means clustering was performed on 𝒪_{1} and 𝒪_{2}, resulting 10-34 clusters depending on video lengths. The clusters in 𝒪_{1} and 𝒪_{2} are then matched side-by-side manually by an expert, which generates the scene labels for the testing images (by checking their belonged clusters). The value of K is empirically determined according to the number of images contained in each video sequence. Our experiments did not focus on evaluating the sensitivity of the value of K on the framework performance, because this is a parameter which would be defined according to the particular clinical task. For example, for precise retargeting by trading off recall, then a larger value of K would be used to divided the sequence into a greater number of distinct clusters. It took around ten minutes for the expert to review obtained clusters for each video. We provide in Table Table11 the details of the clustered video dataset, and their inter-cluster variances (ICV) [25].

We employed precision-recall analysis in evaluating both our descriptor and hashing framework. Let us now consider the top *U* image attempts retrieved from 𝒪_{1} relevant to a query in 𝒪_{2}. A retrieval attempt is marked as true positive (TP) if it has the same scene label as the query, and false positive (FP), otherwise. Precision is then defined as the fraction of retrievals that are TP: $P=\frac{\#TP}{U}$, and recall is calculated as $R=\frac{\#TP}{V}$, where *V* is the number of all relevant images to the query. Mean average precision (MAP) is also used in evaluation as an indicative measure for image retrieval. When *Q* queries are tested and *U* retrievals are made, the MAP is obtained as

11

where represents the precision of *q*-th query with the top *u* retrieval attempts. In addition, we also define as the mean recognition rate, which represents the reliability of a system for returning its top ranked result.

The proposed descriptor in this work has been validated against several popular image descriptors, including the GIST [26] descriptor based on wavelet responses, and a SPACT descriptor [17] based on pixel comparisons. We also compared to the BOW descriptor [13] using SIFT features. Furthermore, the popular variants of BOW, including Fisher vector (FV) [27] and VLAD [28] are also added into this comparison. For GIST, we performed 4 × 4 partitioning on the image, and each partition was convolved with Gabor filters of 4 scales and 8 orientations, which results in a 512-*d* descriptor. We also followed [17] to implement a 1240-*d* SPACT descriptor using pixel-based census transform. For BOW, we created a dictionary that contains 10,000 words by sampling the SIFT features from the GI video sequences. For FV and VLAD, we used the publically available code to obtain 8192-*d* descriptors, followed by extracting their principal components to finally derive 256-*d* descriptors.

We present in Fig. Fig.44 the precision-recall curves of our descriptor compared to the others. These curves are generated by varying the value of *U* and presented for patient-specific experiments. It can been seen that our descriptor outperforms the others in all experiments. We can also observe that the BOW approach has provided inferior results to the others due to the dependence on consistent keypoint detection, which is not reliable with long-term appearance changes on tissue surface (Patient 6 in Fig. Fig.4).4). This also makes other variants of BOW including FV and VLAD generate similar results. Table Table22 shows the MAP measures with our descriptor presenting the highest values in all experiments. Although GIST provides robustness to deformation, it lacks in encoding of the local texture details. The multi-level spatial pooling scheme in our descriptor ensures the similarities can be obtained across a range of scales. Our descriptor also outperforms the SPACT descriptor for the regional comparisons, due to better tolerance to illumination changes and camera translation.

Mean average precisions for retrieval performance. Our descriptor is compared to a range of popular descriptors

For evaluating the entire framework (after hashing), we compared to a range of state-of-the-art hashing approaches. These include hashing via iterative quantization (ITQ) [23], anchor graph hashing (AGH) [21], kernalised supervised hashing (KSH) [22], and Fasthash [24]. In addition, comparisons to two more recently proposed hashing approaches including hashing with latent factor models (LFH) [29] and column sampling based hashing (CSH) [30] were also performed. We also compared our framework to a relevant retargeting approach named endoscopic video manifolds (EVM) [8]. Each random forest for the mapping function in our framework contained 100 trees, with a maximum depth of 10 for each tree. We provide in Fig. Fig.5a5a the recognition rates of all hashing-based approaches on different lengths (*m* = {16, 32, 48, 64}) of binary codes, where our hashing scheme provides the highest recognition rates {0.75, 0.82, 0.86, 0.87}. We also present the precisions with 50 top retrievals on all lengths in Fig. Fig.5b,5b, showing ours performs the best with {0.79, 0.83, 0.86, 0.87}. It is evident that 64-bit binary codes present the best performance, and we therefore use this length for the remaining evaluation.

Evaluation of different binary code lengths. **a** Means and standard deviations of recognition rates, defined as mean average precisions with top retrievals (MAP@1); **b** means and standard deviations of precision values with top 50 retrievals (P@50)

The precision-recall curves of patient-specific experiments for all hashing-based approaches (64-bit) are provided in Fig. Fig.66 with their associated MAP measures reported in Table Table3.3. We observe from this table that after hashing, the retargeting performance has improved over the original descriptor (Table (Table2).2). In addition, our hashing scheme outperforms other alternatives, providing graceful falloffs in precision-recall, as well as the highest MAPs. The employed two-step hashing scheme provides flexibility in using independent classifiers for learning the mapping function, thus achieving more powerful discrimination than the approaches in [21–23, 29]. We also find that linear classifiers used in [30] are less discriminative than our classifiers, and boosted trees (Fasthash [24]) tend to overfit the training dataset, presenting lower MAP scores to our random forest-based hashing. It is worth noting the comparison to the EVM method, from which we notice that EVM generates inferior results to ours, and its performance on a similar dataset in our experiments is poorer than the one reported in [8]. This is because in our work, we use two different sequences from training and testing, yielding a realistic retargeting scenario, whilst in their studies training and testing data are from the same sequence. Finally, we present example retargeting results of our framework in Figs. Figs.77 and and88.

Precision-recall curves of framework evaluation on patient-specific experiments. Our hashing scheme is compared to state-of-the-art approaches on 64-bit binary code

Mean average precisions for retrieval performance. Our entire framework is compared to state-of-the-art hashing schemes (using 64-bit) and a previous retargeting approach

Example results for Patients 1–3. Top ranked retrievals based on Hamming distances, with *blue*-, *green*-, and *yellow*-border images being queries for retargeting, correct retargeting, and incorrect retargeting results, respectively

Example results for Patients 4–6. Top ranked retrievals based on Hamming distances, with *blue*-, *green*-, and *yellow*-border images being queries for retargeting, correct retargeting and incorrect retargeting results, respectively

Run-time speed is an important factor in using computer vision techniques for endoscopic interventions. A vision algorithm is usually required be real time such that a regular clinical procedure would not be interrupted. Our framework currently performs retargeting of one query within 19*ms*, which includes extracting the image descriptor, mapping into a binary code, and computing Hamming distances. Whilst the querying time using the original descriptor is around 490*ms*, the run-time speed improved by hashing meets the requirements of real-time capability.

It is worth noting the limitation of the current dataset, in which there are three videos collected from one patient within long-term intervals, and the other videos were collected from patients with serial endoscopies during one examination. Nevertheless, our experimental protocol follows realistic scenarios in surveillance endoscopy that only videos collected in ‘previous examinations’ are known, and used for subsequent examinations of the same patients. Our vision-based retargeting framework in this work provides relevant images of a query image of the same patient and does not provide the depth information of the endoscopic cameras [3] or specific locations (within images) of optical biopsies [4]; however, it can be used as an additional function to assist endoscopists by performing image retrieval for patient-specific data collected in serial examinations.

We proposed in this paper an image retrieval framework for inter-examination retargeting in gastrointestinal endoscopy. An image descriptor was proposed to consider the global geometry of an endoscopic scene by pooling the regional information at multi-scale. The extracted image descriptors from a previous video sequence were compressed into short binary codes via hashing. To allow for retargeting of a query image in the current examination, we proposed a novel random forest-based mapping, which provides not only strong discrimination in learning the mapping function, but also real-time retargeting capabilities. We compared our framework to a range of popular descriptors and hashing-based approaches. Experiments were conducted on in vivo video data collected from six patients, demonstrating the consistent state-of-the-art performance provided by our descriptor and hashing.

Currently, the framework learns the mapping function using only one previous video sequence. As further videos could be collected for the same patient, our framework can be readily extended to learn the mapping using two or more previous video sequences, which could further improve the retargeting performance. In addition, future works would also involve performing hierarchical image matching for further speedup or employing convolutional neural networks as more training data become available.

Compliance with ethical standards### Conflict of interest

### Informed consent

The authors declare that they have no conflict of interest.

Informed consent was obtained from all individual participants included in the study.

1. Atasoy S, Glocker B, Giannarou S, Mateus D, Meining A, Yang GZ, Navab N (2009) Probabilistic region matching in narrow-band endoscopy for targeted optical biopsy. In: MICCAI 2009, vol 5761. Springer, Berlin, pp 499–506 [PubMed]

2. Allain B, Hu M, Lovat LB, Cook RJ, Vercauteren T, Ourselin S, Hawkes DJ. Re-localisation of a biopsy site in endoscopic images and characterisation of its uncertainty. Med Image Anal. 2012;16(2):482–496. doi: 10.1016/j.media.2011.11.005. [PubMed] [Cross Ref]

3. Mountney P, Giannarou S, Elson D, Yang GZ (2009) Optical biopsy mapping for minimally invasive cancer screening. In: MICCAI 2009, vol 5761. Springer, Berlin, pp 483–490 [PubMed]

4. Ye M, Giannarou S, Patel N, Teare J, Yang GZ (2013) Pathological site retargeting under tissue deformation using geometrical association and tracking. In: MICCAI 2013, vol 8150. Springer, Berlin, pp 67–74 [PubMed]

5. Ye M, Johns E, Giannarou S, Yang GZ (2014) Online scene association for endoscopic navigation. In: MICCAI 2014, vol 8674. Springer, Berlin, pp 316–323 [PubMed]

6. Ye M, Giannarou S, Meining A, Yang GZ. Online tracking and retargeting with applications to optical biopsy in gastrointestinal endoscopic examinations. Med Image Anal. 2016;30:144–157. doi: 10.1016/j.media.2015.10.003. [PubMed] [Cross Ref]

7. Mouton A, Ye M, Lacombe Françoisand Yang GZ. Hybrid retargeting for high-speed targeted optical biopsies. In: Navab N, Hornegger J, Wells WM, Frangi AF, editors. MICCAI 2015. Berlin: Springer; 2015. pp. 471–479.

8. Atasoy S, Mateus D, Meining A, Yang GZ, Navab N. Endoscopic video manifolds for targeted optical biopsy. IEEE Trans Med Imaging. 2012;31(3):637–653. doi: 10.1109/TMI.2011.2174252. [PubMed] [Cross Ref]

9. He X, Niyogi P (2004) Locality preserving projections. In: Advances in neural information processing systems, pp 153–160

10. Vemuri AS, Nicolau S, Marescaux J, Soler L, Ayache N (2015) Automatic view-point selection for inter-operative endoscopic surveillance. In: Medical content-based retrieval for clinical decision support, Munich, Germany, Tanveer Syeda-Mahmood and Hayit Greenspan and Anant Madabhushi (October 2015), pp 1–8

11. Vemuri AS, Nicolau S, Sportes A, Marescaux J, Soler L, Ayache N. Interoperative biopsy site relocalization in endoluminal surgery. IEEE Trans Biomed Eng. 2016;63(9):1862–1873. doi: 10.1109/TBME.2015.2503981. [PubMed] [Cross Ref]

12. Ye M, Johns E, Walter B, Meining A, Yang GZ (2016) Robust image descriptors for real-time inter-examination retargeting in gastrointestinal endoscopy. In: MICCAI 2016. Springer, Cham, pp 448–456

13. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. In: CVPR

14. Lowe DG. Distinctive image features from scale-invariant keypoints. Int J Comput Vis. 2004;60(2):91–110. doi: 10.1023/B:VISI.0000029664.99615.94. [Cross Ref]

15. Zhang Y, Jia Z, Chen T (2011) Image retrieval with geometry-preserving visual phrases. In: CVPR, IEEE, pp 809–816

16. Johns E, Yang GZ (2014) Pairwise probabilistic voting: fast place recognition without RANSAC. Springer, Cham, pp 504–519

17. Wu J, Rehg J. Centrist: A visual descriptor for scene categorization. IEEE Trans Pattern Anal Mach Intell. 2011;33(8):1489–1501. doi: 10.1109/TPAMI.2010.224. [PubMed] [Cross Ref]

18. Subrahmanyam M, Maheshwari R, Balasubramanian R. Local maximum edge binary patterns: a new descriptor for image retrieval and object tracking. Signal Process. 2012;92(6):1467–1479. doi: 10.1016/j.sigpro.2011.12.005. [Cross Ref]

19. Calonder M, Lepetit V, Strecha C, Fua P (2010) Brief: binary robust independent elementary features. In: European conference on computer vision, Springer, Berlin, pp 778–792

20. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: CVPR, pp 2169–2178

21. Liu W, Wang J, Kumar S, Chang SF (2011) Hashing with graphs. In: ICML, pp 1–8

22. Liu W, Wang J, Ji R, Jiang YG, Chang SF (2012) Supervised hashing with kernels. In: CVPR, pp 2074–2081

23. Gong Y, Lazebnik S, Gordo A, Perronnin F. Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans Pattern Anal Mach Intell. 2013;35(12):2916–2929. doi: 10.1109/TPAMI.2012.193. [PubMed] [Cross Ref]

24. Lin G, Shen C, van den Hengel A. Supervised hashing using graph cuts and boosted decision trees. IEEE Trans Pattern Anal Mach Intell. 2015;37(11):2317–2331. doi: 10.1109/TPAMI.2015.2404776. [PubMed] [Cross Ref]

25. Kovács F, Legány C, Babos A (2005) Cluster validity measurement techniques. In: 6th International symposium of Hungarian researchers on computational intelligence, Citeseer

26. Oliva A, Torralba A. Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis. 2001;42(3):145–175. doi: 10.1023/A:1011139631724. [Cross Ref]

27. Perronnin F, Sánchez J, Mensink T. Improving the fisher kernel for large-scale image classification. Berlin: Springer; 2010.

28. Arandjelovic R, Zisserman A (2013) All about vlad. In: CVPR, pp 1578–1585

29. Zhang P, Zhang W, Li WJ, Guo M (2014) Supervised hashing with latent factor models. In: Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval. SIGIR ’14, New York, NY, USA. ACM, pp 173–182

30. Kang WC, Li WJ, Zhou ZH (2016) Column sampling based discrete supervised hashing. In: Proceedings of the thirtieth AAAI conference on artificial intelligence. AAAI’16, AAAI Press, pp 1230–1236

Articles from Springer Open Choice are provided here courtesy of **Springer**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |