|Home | About | Journals | Submit | Contact Us | Français|
There is interest to expand the reach of literature mining to include the analysis of biomedical images, which often contain a paper’s key findings. Examples include recent studies that use Optical Character Recognition (OCR) to extract image text, which is used to boost biomedical image retrieval and classification. Such studies rely on the robust identification of text elements in biomedical images, which is a non-trivial task. In this work, we introduce a new text detection algorithm for biomedical images based on iterative projection histograms. We study the effectiveness of our algorithm by evaluating the performance on a set of manually labeled random biomedical images, and compare the performance against other state-of-the-art text detection algorithms. In this paper, we demonstrate that a projection histogram-based text detection approach is well suited for text detection in biomedical images, with a performance of F score of .60. The approach performs better than comparable approaches for text detection. Further, we show that the iterative application of the algorithm is boosting overall detection performance. A C++ implementation of our algorithm is freely available through email request for academic use.
Biomedical literature mining is concerned with transforming free text into a structured, machine-readable format, to improve tasks such as information retrieval and extraction. Recent work indicates that there is much interest to also consider image information when mining research articles, as images often depict the results of experiments, and sum up a paper’s key findings. There are several obstacles when mining image information. First, there are many different types of images, such as graphs, gel electrophoresis and microscopy images, diagrams or heat maps. There exists no image publication standard, neither with regard to image resolution, or image file format (images are stored at different resolutions, and in a variety of file formats, such as jpeg, tiff etc). Also, there are no explicit image design guidelines, even though authors seem to follow some universally accepted norms when creating figures such as box plots, heatmaps or gel electrophoresis images.
A unifying element across all biomedical images is image text, i.e. text characters that are embedded in images. Text in images serves several purposes, such as labeling a graph, representing genes in a heat map images, or proteins in a pathway diagram. We have previously shown that extracting image text, and making it available to image search, improves biomedical image retrieval . In this work, we are concerned with optimizing the performance of a critical step in image text extraction — locating text regions in images, which is known as text detection in studies on image processing and Optical Character Recognition (OCR).
Generally speaking, text detection is a crucial step in processing textual information in biomedical images. For example, properly finding the text regions is the first stage of a standard OCR pipeline for extracting image text. Determining the location of text is also important for high-level image content understanding, as it is the text location that indicates the meaning of certain image text element, such as the label of the x-versus y-axis in a graph. Practical applications aside, in this paper, we are exclusively concerned with optimizing the performance of text detection, which is a fundamental research problem in image text processing.
In this paper, we introduce a new text detection algorithm suited for biomedical images. We also discuss the methodological details in creating a gold standard biomedical image text detection corpus, and the use of the corpus for evaluating the performance of our algorithm. During the development of the corpus, we laid down clear guidelines on what exactly constitutes an image text region (or element) and how to manually mark the image region linked to the string. We then compared our algorithm against three existing state-of-the-art text detection methods. The evaluation results suggest the advantages of our algorithm for detecting text regions in biomedical images.
First, we are going to briefly look at prior work on image processing algorithms for image text detection, which is concerned with separating image text elements from other elements in an image.  presented an algorithm for text detection from scene images. In their work, they first detect character components according to gray-level differences and then match the results to standard character patterns captured in a database. Their method is very robust to the font, size and intensity variation in the image texts, but is not able to deal with color and orientation changes. To address the text detection problem for color images,  introduced a connected component-based method for locating texts in a complex color image. Their method analyzes the color histogram of the RGB space to detect text regions.  introduced a neural network based approach for identifying text in color images. To attack the text detection problem for texts with different orientations and other distortions,  describe the use of low level image features such as density and contrast to detect image texts, with the ability to deal with skew in the image text.  also proposed a morphological approach for image text detection, which is robust to the presence of noise, text orientation, skew and curvature.
There is a body of work using advanced texture and graph segmentation methods to detect text in images. For example,  introduce a method for learning texture discrimination masks for image text detection.  used a learning based approach to detect image text through image texture analysis.  introduced a system for image text detection and recognition, which adopts a multi-scale texture segmentation scheme. In their method, a collection of second-order Gaussian derivatives are used to detect candidate text regions, followed by a K-means clustering process and a multi-resolutional stroke generation, filtering and aggregation process to further refine the detected text region.  proposed a graph-based image segmentation algorithm for efficiently separating textual elements from graphical elements in an image. Their algorithm can automatically adapt itself to the image structure variation.  proposed a novel method for text detection and segmentation through using stroke filters for text polarity assessment in analyzing features in local image regions.
There also exists a growing collection of work on text detection from videos or motion images, which are closely related to the image text detection problem studied in this paper. For example,  used a hybrid neural network and projection profile analysis based approach to detect and track text regions in a video.  applied a variety of text detection methods and then fused the individual text detection results together to achieve a robust text detection for videos.  introduced a support vector machine based approach for image text detection in videos.  proposed a coarse-to-fine localization scheme for detecting texts in multilingual videos. Recently,  proposed a discrete cosines transform coefficients based method for text detection in compressed videos. Despite the many commonalities between the video text and image text detection problems, one of the main differences between them is that frame images in a video demonstrate temporal coherence, which offer much useful information for text detection. Such clues are not present in still images, and hence make the image text detection problem more challenging than its counterpart in videos.
Our study is related to other projects in biomedical image processing. For example,  used image features for text categorization.  studied the use of natural language processing to index and retrieve molecular images.  described an algorithmic system for accessing fluorescence microscopy images via image classification and segmentation.
In our own prior work , we discussed a novel approach for biomedical image search based on OCR. We have shown that the approach offers additional advantages compared to searching over image captions alone, notably the retrieval of additional and relevant images. The current study is closely linked to that project, discussing the algorithmic details for detecting image text regions.
An overview of our method is shown in Figure 1. An input image (i.e. an image from a biomedical publication) undergoes detection of layout lines and panel boundaries, which are excluded from the image to increase text detection robustness. We implement the algorithm proposed by  for detecting these layout elements. The image is then converted to black and white, and subjected to an edge detection algorithm. The resulting edge image is then subjected to a pivoting text region detection (PTD) algorithm for extraction of text regions. PTD is repeated several times, in order to divide detected text regions into text subregions. If no more text regions are detected, the algorithm exits. Our algorithm is based on traditional histogram analysis-based text region detection, which takes edge images as input. We extend the traditional approach as follows: We perform a pivoting procedure while applying the histogram analysis, and repeat the procedure until no more text (sub)regions are detected.
One of the most popular and well known text region detection methods is through analyzing the vertical and horizontal projection histograms of an image. More concretely, given an input image, we first detect the edge pixels in the image. Then a vertical and a horizontal projection histogram are derived. It is assumed that text regions generally exhibit higher density of edge pixels than non-text regions. The vertical and horizontal histograms will thus show the highest density of edge pixels in text areas. A density threshold defines the exact dimensions of the text area along the vertical and horizontal histogram. The elements of this basic procedure are discussed in more detail in the next section.
One distinct feature of many biomedical images is that they often employ a distributed and nested text layout. Figure 3.(a) and Figure 4.(a) show two typical examples, where text is distributed across many different image regions. Also, text regions often display some degree off nestedness. For example, the numbers along the x axis in Figure 4.(b) can be grouped in one large text area, or -more correctly-into separate (inner) text areas surrounding each individual number (Figure 4.(d)). The traditional histogram-based analysis technique does not cope well with distributed and nested text layout. To address this problem, we introduce a new iterative pivoting histogram analysis procedure for text region detection.
We introduce a pivoting step into the classical histogram-based text detection algorithm in order to account for the distributed nature of biomedical image text. The pivoting procedures subdivides image regions into its text subcomponents, instead of identifying large text blocks. Our procedure is realized through analyzing the histograms of the input image region following the vertical and horizontal directions alternatively, hence the name “pivoting”. Figure 2 illustrates the key steps, and Figure 1 in Appendix B (Supplementary Files) shows the working of the algorithm on a sample image. An input image is converted into black and white and subjected to edges detection (Figure 1.d, Appendix B). For a specified region (the whole image in the first iteration of the procedure), to detect text areas in , we first vertically project all the edge pixels to derive the image region’s horizontal histogram (Figure 1.e, Appendix B). We then segment the horizontal histogram into several segments, each corresponding to a horizontal region in the input image, denoted as Seg1, Seg2, ···. The segments are defined by a threshold on the histogram densities. We then derive for each horizontal segment a vertical histogram through horizontally projection of all the edge pixels in the region. (This step is different from the traditional approach, where the horizontal projection is performed on the whole image). The resultant vertical histogram corresponding to the horizontal segment Segi of the image is denoted as (Figure 1.g, Appendix B). We then segment the vertical histogram the vertical segments , ··· using a threshold on the densities (Figure 1.g1–3, Appendix B). Each such segment corresponds to a vertical region in the input image. Through pairing of a vertical segment with its corresponding horizontal segment Segi, we are able to specify a rectangular region (bounding box) in (Figure 1.h1–3, Appendix B), corresponding to text regions.
In Appendix A (Supplementary Files), we formally describe this procedure mathematically.
Our algorithm iteratively constructs vertical and horizontal histograms to find nested text regions. As can be seen in Figure 1.h2, Appendix B, the first round of the PTD algorithm could not resolve the true text areas of the image region. In the image, region 1 groups distinct image text elements, and we propose to repeat the PTD step for separating these elements.
More concretely, our algorithm maintains an active local image region collection (ALIRC) during its running time (Figure 1, main paper). Initially, the collection contains a single image region, which is the full image area of the input image. The algorithm then constructs pivoting vertical and horizontal histograms (see previous section) and detects text regions. Each detected text region is regarded as a new target region and added into ALIRC. The input image region is removed from ALIRC, with one exception: if, after subtracting the text regions from the input image, the input image is nonempty, we populate ALIRC with an updated version of the input image, with the text areas subtracted. We iteratively apply our histogram-based text region segmentation procedure on all the image regions in the ALIRC until no more finer separation between text and non-text regions can be achieved. We will then output all the image regions maintained in the ALIRC. A final heuristic removes regions that are maintained in ALIRC but do not correspond to text regions. The heuristic evaluates the overall edge density, removing regions that exhibit a density that is too low or too high.
In Appendix B (Supplementary Files), we show a step-by-step example of text region detection using our iterative and pivoting text detection algorithm for a biomedical image.
In this section, we will first discuss the creation of a gold standard biomedical image text detection corpus. We will then discuss our evaluation strategy to measure the performance of our iterative PTD algorithm for detecting text regions in biomedical images.
To objectively evaluate the performance of our algorithm, and to quantitatively compare the performance of our method top other peer methods, we created a gold standard corpus of biomedical images with manual markup of text regions. In order to create this corpus, we selected a two step approach. The first step dealt with the identification of the text regions in the image. We set up guidelines for manual identification of text regions (image text) in biomedical images, which are listed in Table 1. The guidelines define the nature of an image text region in a biomedical image, what to do about Greek letters and other special characters, and strings in super or subscript. After selecting 161 random images from biomedical articles indexed in PubMed Central, we used the guidelines to identify the image text regions. In the second step, we identified a minimum rectangular region (bounding box) for each detected text region. Such a bounding box is defined as the smallest rectangular region covering all character pixels of the text region. These image bounding boxes represent the gold standard image text regions.
To evaluate the performance of our PTD text detection algorithm, we can proceed as follows: We compare the predicted text region bounding boxes with the bounding boxes of the gold standard corpus. In our study, we employ two approaches for measuring the degree of overlap between the predicted and gold standard text regions, looking at both the pixel overlap and the percentage of shared region.
One approach for measuring the overlap of two text detection results is to measure the recall, precision and F-rate as determined by shared pixels. More concretely, recall is defined as the fraction of pixels in the gold standard text area that are contained in the (algorithmically) detected text region. Precision is defined as the fraction of pixels in the detected text region that are also contained in the groundtruth text area. And F-rate is defined as the harmonic mean of precision and recall, i.e. F-rate = 2 Precision Recall=(Precision + Recall).
Another intuitive measure of overlap between two text detection results is to calculate the overlapping area modulated by the reciprocal of the area of the union of the two text detection results. Mathematically, this measurement can be formulated as:
In the above, Text Regiongroundtruth stands for text region in the gold standard corpus and Text Regionalgorithm stands for the algorithmically detected text region. The operator Area(X) computes the area of the region X in pixels. The range of the Modulated Overlapping Area (MOA) measurement as defined above is between 0 and 1. When Text Regiongroundtruth fully agrees with Text Regionalgorithm, MOA reaches the maximum value of 1. When Text Regiongroundtruth is entirely disjoint from Text Regionalgorithm, MOA reaches the minimum value of 0.
We start with a qualitative assessment on the performance of our text detection algorithm. To this end, we provide sample images along with automatically detected text regions (Figures 3 and and4).4). The blue boxes outline the detected text regions, while the purple lines and areas indicate non-textual elements. A qualitative assessment of our approach is helpful for identifying the strength and weaknesses of our algorithm. For example, we see satisfactory text detection performance in Figure 3.(b). However, two strings “the” and “number” in the bottom horizontal label of the image are mistakenly detected as one single text region “the number”. In Figure 4, we show the intermediate text detection results of two rounds of the PTD algorithm, from which we can see that our algorithm progressively refines its text detection results.
To explore the effectiveness and advantages of our approach, we also compare the performance of our algorithm with a few state-of-the-art text detection algorithms. To this end, we identified recently published algorithms for text detection, including the DCT feature based text detection method proposed by , the text particle based multi-band fusion method for text detection as proposed by , the visual saliency based and biologically inspired text detection method proposed by , and the fast text detection method proposed by . We also implemented two simplified version of our algorithm to study the different components of our procedure. To distinguish between these different versions of our algorithm, we call the iterative text detection method introduced in Sec. 2.4 the multistep method, which is denoted as “multiple steps”. We also study the performance of our method when the number of iterations is limited to one round. We call this modification of our algorithm the one step iteration version, denoted as “one step”. Finally, we also implemented the classical histogram-based analysis without pivoting where the vertical histogram is derived for the full image rather than for the segments from the horizontal histogram (see Sec. 2.2). We refer to this naive version as “naive”.
The results of these evaluations are shown in Table 2. We observe the following: The naive method outperforms the other peer methods in terms of F-rate and MAO. The pivoting procedure improves upon the naive version, with a performance increase of 0.045 F rate and .051 MAO. The iterative procedure further improves upon the pivoting result, both in terms of F-rate and MAO. There is no performance increase when conducting more than 2 iterations of our algorithm.
Our evaluation showed that the iterative PTD algorithm performs well on the gold standard text detection corpus (Table 2). The naive (classical) version is outperformed by the pivoting algorithm, which performs the vertical histogram on each image text segment as determined by the horizontal histogram (Section 2.3 and Figure 2). The pivoting algorithm subdivides image text regions into subcomponents, instead of identifying large text blocks as in the naive or classical approach. This subdivision into smaller units seems to cope better with the distributed nature of the biomedical image text. The iterative application of our algorithm results in further performance gains. As discussed, iteration ensures the detection of nested image regions. As can bee seen in Table 2, performance seems to stabilize after one iteration. This can be understood as follows: Biomedical images seem to contain (on average) one level of text nesting, which can be recovered by one iteration of our PTD algorithm.
We conducted an extensive comparison with existing text detection algorithms. None of the tested algorithms were able to outperform the histogram-based text detection approach. It should be noted that these algorithms are optimized for a particular text detection task, which might be different from the one encountered in biomedical images. Consequently, the performance of these algorithms as presented in the literature is higher than the numbers presented in Table 2. Our results indicate that we can not use these algorithms on biomedical images without major modifications.
For comparison, we quickly review the performance of the tested algorithms on other image sets. In , the author reports algorithm performance for two typical settings of his algorithm–a low frequency mode and a high frequency mode. The evaluation is performed on the ICDAR-set, which is from the TrialTrain data used in the ICDAR 2003 Robust Reading Competition, see . For the low frequency mode, the average precision, recall and F-rate of his algorithm is 32.6%, 91.9%, and 43.4% respectively. For the high frequency mode, the average precision, recall and F-rate is 35.6%, 88.6%, and 45.1% respectively. It should be noted that the  algorithm performs well on our gold standard corpus in terms of recall. Precision is low, though, indicating many falls positive calls.
 evaluated their method on the Location Detection Database of IC-DAR 2003 Robust Reading Competition Dataset, see . The precision, recall and F-rate on the dataset is 60%, 81%, 69% respectively. Finally,  reported the performance of their algorithm on an image set consisting of 308 images from the Web, recorded broadcast videos, and digital videos. The reported a recall and accuracy of 91.1% and 95.8% respectively.
Biomedical image search and mining is becoming an increasingly important topic in biomedical informatics. Accessing the biomedical literature via image content is complementary to text-based search and retrieval. A key element in unlocking biomedical image content is to detect and extract (via OCR) text from biomedical images, and making the text available for image search. In this paper, we are concerned with text detection, i.e. finding the precise areas of image text elements. We propose a new text detection algorithm which is ideally suited for this purpose. The key feature of our algorithm is that it searches for text regions in a pivoting and iterative fashion. The pivoting procedure allows for recovery of distributed image text, and the iterative procedure uncovers nested image information. We believe that these two algorithm features are crucial for detecting text in biomedical images.
Funding This research has been funded by NLM grant 5K22LM009255 and NLM grant 1R01LM009956.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.