Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2755638

Formats

Article sections

- Abstract
- 1 Introduction to Second Revision
- 2 Introduction
- 3 Grouping with Factors
- 4 Belief Propagation for Factor Graphs
- 5 Application: Finding Text
- 6 Experimental Results
- 7 Discussion
- References

Authors

Related links

Image Vis Comput. Author manuscript; available in PMC 2010 June 4.

Published in final edited form as:

Image Vis Comput. 2009 June 4; 27(7): 854–863.

doi: 10.1016/j.imavis.2009.02.006PMCID: PMC2755638

NIHMSID: NIHMS99067

Smith-Kettlewell Eye Research Institute, San Francisco, CA 94115, USA

Foreground-background segmentation has recently been applied [26,12] to the detection and segmentation of specific objects or structures of interest from the background as an efficient alternative to techniques such as deformable templates [27]. We introduce a graphical model (i.e. Markov random field)-based formulation of structure-specific figure-ground segmentation based on simple geometric features extracted from an image, such as local configurations of linear features, that are characteristic of the desired figure structure. Our formulation is novel in that it is based on factor graphs, which are graphical models that encode interactions among arbitrary numbers of random variables. The ability of factor graphs to express interactions higher than pairwise order (the highest order encountered in most graphical models used in computer vision) is useful for modeling a variety of pattern recognition problems. In particular, we show how this property makes factor graphs a natural framework for performing grouping and segmentation, and demonstrate that the factor graph framework emerges naturally from a simple maximum entropy model of figure-ground segmentation.

We cast our approach in a learning framework, in which the contributions of multiple grouping cues are learned from training data, and apply our framework to the problem of finding printed text in natural scenes. Experimental results are described, including a performance analysis that demonstrates the feasibility of the approach.

We thank the reviewers for their comments that we received in Oct. 2008. Below we describe our responses to each comment (with original reviewer comments in italics).

My only coment is to present images with more brightness (not so dark). Please correct citation error in second paragraph of page 14. Also add a comment/reference to the usefulness of Chernoff information to quantify discriminability between Pon and Poff at the end of section 5.2.

We have complied with all three requests.

Overall, I feel that the paper would benefit from more fully developed experiments showing the impact the method could have on vision in general and on text finding in particular. The experiments were performed on 71 training and 20 test images only. The scale of these experiments should be increased.

In our experiments we have now doubled the number of test images from 20 to 40, and we now show more results in the figures. In addition, we would like to point out that the authors have recently used the same figure-ground approach (also using factor graphs) for finding crosswalks in street intersection scenes, with excellent results (see [9]).

In addition, the paper should be compared against at least one alternative approach. Since only the performance of the proposed work is given in Section 6, it hard to judge the true positive and false positive rates.

Unfortunately, it is difficult to directly compare our experimental results with those of other text-finding methods since source code (or executables) for other methods was not available. We do not claim that our approach in its current form would be competitive with other methods; rather we claim that our approach is novel in its reliance on geometric features and its formulation in terms of factor graphs, that the approach generalizes to texture-like patterns other than text (such as crosswalks) and that our experiments demonstrate feasibility. We have added some details on how the algorithm could be improved in Sec. 7.

Also, I think that it helps the reader if the paper explains the differences between the proposed work and ref[24] and ref[18] before Section 3.1. Currently, it is written that the proposed work is inspired by these approaches, but the novelty that the paper brings over these references should be justified here.

We have added a description of how our work differs from those references in Sec. 3.1.

The format of the references should be consistent to each other. For instance, ”CVPR” is added after ref.[1] but not after ref[3] or ref [6]. And in ref[10], the name of the conference was not added (just the short form is used). Similarly, while ”Proceedings” is spelled out in some references, ”Proc.” is used in others. More importantly, some references start with the last name of the authors but some with the first initial and then the last name. They need to be fixed.

We have standardized the format of the references.

Originally proposed as a generic process for segmenting the foreground of a scene from the background, figure-ground (foreground-background) segmentation has recently been successfully applied [26,12] to the detection and segmentation of specific objects or structures (i.e. targets) of interest from the background. Standard techniques such as deformable templates [27] are poorly suited to finding some targets, such as printed text, stripe patterns, vegetation or buildings, particularly when the targets are regular (e.g. quasi-periodic) or texture-like structures with widely varying extent, shape and scale.

In these cases it seems more appropriate to group target features into a common foreground class (at least as an initial segmentation step to precede further processing), rather than directly attempt to find a detailed correspondence between a prototype and the target in the image, as is typically done with deformable template and shape matching techniques.

Our graphical model-based approach to figure-ground segmentation is an outgrowth of earlier work on figure-ground segmentation applied to finding crosswalks in traffic intersections [5] and builds on more recent work in [19] and [9]. The approach emphasizes the use of the *geometric* relationships of features extracted from an image as a means of grouping the target features into the foreground. In contrast with related MRF techniques [28] for classifying individual image patches into small numbers of categories, our approach seeks to make maximal use of geometric features extracted from images, rather than raw pixel information. Geometric information is generally more intuitive to understand than filter-based feature information, and it may also be more appropriate when lighting conditions are highly variable.

We formulate our approach in the general case of figure-ground sementation and apply it to the problem of finding printed text in natural scenes. Experimental results are described, including a performance analysis that demonstrates the feasibility of the approach.

In this paper we describe the use of factor graphs as a natural framework for grouping (i.e. segmenting) features in an image. Any grouping process analyzes relationships among features in deciding how to group them; these relationships are interactions among features that reflect their compatibility for inclusion into the same group. In much past work on grouping and segmentation, such as normalized cuts [22,26] and graphical-model based typical cuts [20] (all of which inspired our approach), these interactions are pairwise measures that measure the similarity (or affinity) of two features. Our approach is similar to that of [20] in that it constructs a Markov random field with one binary-valued (0 or 1) node variable for each feature to be grouped, and uses belief propagation to decide how to group the features (two features are assigned to the same group if the probability that they are jointly assigned to the same binary labels is above a threshold); however, their approach imposes a fundamental symmetry between the two possible states of each node variable, whereas in our approach the two possible states represent figure (state 1) or ground (state 0), which are not symmetric. The object-specific figure-ground technique in [26] imposes the same type of figure-ground asymmetry as our work does, but their work was applied to specific targets such as telephones and mugs and it is unclear how this technique would generalize to texture-like patterns with highly variable numbers of elements.

The above techniques all use *pairwise* measures that measure the similarity (or affinity) of two features. However, many clustering problems necessitate the use of higher-order interactions; for instance, the problem of grouping points on a 2-D plane into lines requires an interaction defined on triplets of points, since every pair of points is trivially collinear. Some recent work [1] has investigated hypergraph partitioning techniques for handling these higher-order interactions.

Factor graphs [11] are graphical models designed to express interactions of any order (generalizing the pairwise interactions often used in graphical models, i.e. Markov random fields), and may be used for formulating simple and efficient grouping algorithms. We apply this formulation to the problem of object-specific figure-ground segmentation, which is how we cast the problem of text detection.

In the next section we introduce factor graphs, and in subsequent sections we describe how a particular functional form of factor graph appropriate for figure-ground segmentation is suggested by a maximum entropy model of grouping.

Factor graphs [11] are representations of probability distributions of many variables that can be expressed as a product of several factors, where each factor is an explicit interaction among a subset of these variables. (For a comprehensive treatment of factor graphs, see Chapter 8 of [2], which can be downloaded at http://research.microsoft.com/~cmbishop/PRML/.) Fig. 1 shows an example of a factor graph depicted in a graphical format. Each square node represents a factor, or interaction, among one or more variables, depicted by circles, and the topology of the factor graph indicates how the joint distribution of all variables factors.

Factor graph. This graph represents a distribution on four variables *w*,*x*,*y*,*z* (drawn as circles) using four factors *f*, *g*, *h*, *i* (drawn as squares). Edges connect factors with the variables they influence. The joint distribution represented by this factor **...**

The *arity* of a factor is the number of variables that interact in the factor. An arity-1 factor is called a *unitary* factor, and an arity-2 factor is sometimes referred to as a *pairwise interaction*. For the example shown in Fig. 1, factor *h* is arity-1, factor *i* is arity-2 and factors *f* and *g* are arity-3. In our figure-ground segmentation application, one factor arises from each constraint (measurement), and its arity equals the number of features (nodes) included in the measurement constraint.

The factor graph framework is an extremely general framework that is convenient for representing a large variety of probabilistic models. As we will see in a later section, inference on factor graphs is made tractable by approximate techniques such as belief propagation, which we use in our text-finding algorithm.

In this section we discuss the theoretical motivation for the functional form of the model we use for figure-ground segmentation, which is inspired from a simple maximum entropy probability model. As we will see, this functional form is naturally described using a factor graph.

The motivation for our maximum entropy model is that, given a collection of features to be segmented into figure or ground, evidence for how to assign (segment) features to figure or ground arises from considering *relationships among two or more features*. (Depending on the nature of the segmentation task, there may also be evidence for *individual* features belonging to figure or ground that is independent of other features.) The goal is to combine noisy evidence pertaining to groups of features in such a way that we can decide the likelihood that any *individual* feature should be assigned to figure or ground. Our approach is to construct a joint probability distribution of the assignments of all the features that is consistent with the evidence from groups of features. The joint distribution is chosen to be the least biased distribution that is consistent with the evidence; we enforce this “minimal-bias” criterion using a maximum entropy approach [17].

We now present the details of our maximum entropy model of figure-ground segmentation. We are given a collection of *n* features (i.e. nodes) in an image, each of which has an unknown assignment to figure or ground, denoted by *x _{i}* for

We can regard the probability measure associated with each feature subset as a measurement that constrains the joint distribution of all nodes given the measurement data, which yields the posterior distribution *P*(*x*_{1}, *x*_{2}, …, *x _{n}*|

We demonstrate this fact with an illustrative example (see Fig. 2); it is straightforward to extend the analysis to an arbitrary collection of features and feature subsets. In this example there are four nodes, *x*_{1}*, x*_{2}*, x*_{3}*, x*_{4}, and two subset constraints: *P*(*x*_{1} = 1, *x*_{2} = 1, *x*_{3} = 1|*data*) = *p*_{1} and *P* (*x*_{3} = 1, *x*_{4} =1|*data*) = *p*_{2}. The joint distribution we are seeking, *P* (*x*_{1}*, x*_{2}*, x*_{3}*, x*_{4}|*data*), has entropy equal to − Σ* _{X} P* (

$$\begin{array}{l}E=-\sum _{X}P(X\mid \mathit{data})logP(X\mid \mathit{data})\\ +{\lambda}_{1}(\sum _{{x}_{4}}P({x}_{1}=1,{x}_{2}=1,{x}_{3}=1,{x}_{4}\mid \mathit{data})-{p}_{1})\\ +{\lambda}_{2}(\sum _{{x}_{1},{x}_{2}}P({x}_{1},{x}_{2},{x}_{3}=1,{x}_{4}=1\mid \mathit{data})-{p}_{2})\\ +\tau (\sum _{X}P(X\mid \mathit{data})-1)\end{array}$$

(1)

Example of probability measure applied to subsets of features (i.e. candidate groupings). Each feature (node) is depicted by a circle, labeled 1 through 4; the two ovals depict candidate groupings of features. Values *p*_{1} and *p*_{2} denote the probability that **...**

The Lagrange multipliers *λ*_{1} and *λ*_{2} enforce the two subset constraints, and *τ* enforces the fact that *P*(*X*|*data*) is a normalized distribution. The expression for *E/P*(*X*|*data*) is:

$$\partial E/\partial P(X\mid \mathit{data})=-logP(X\mid \mathit{data})-1+{\lambda}_{1}{x}_{1}{x}_{2}{x}_{3}+{\lambda}_{2}{x}_{3}{x}_{4}+\tau $$

(2)

where we notice that *x*_{1}*x*_{2}*x*_{3} = 1 only when *x*_{1} = 1, *x*_{2} = 1 and *x*_{3} = 1. If we set *E/P*(*X*|*data*) = 0, then we get the following solution:

$$P(X\mid \mathit{data})=\frac{1}{Z}{e}^{{\mu}_{1}{x}_{1}{x}_{2}{x}_{3}}{e}^{{\mu}_{2}{x}_{3}{x}_{4}}$$

(3)

The key point to notice is that *P*(*X*|*data*) contains one factor (term) for each constraint, which in this case we label as *f*(*x*_{1}*, x*_{2}*, x*_{3}) = *e*^{μ1}^{x}^{1}^{x}^{2}^{x}^{3} and *g*(*x*_{3}*, x*_{4}) = *e ^{μ}*

This model is naturally formulated as a factor graph, since the joint posterior probability *P*(*X*|*data*) is expressible as a product of the factors *f*(.) and *g*(.) (neglecting the normalization constant *Z* that is independent of *X*). In the next section we will develop a simpler version of the model, whose parameters can be easily learned from training data. Factor BP will then be used to determine how to label individual features (figure or ground), which we describe in Section 4.

In the previous subsection, all measurements were marginal probabilities that certain features (nodes) all belonged to figure. However, it is rare that we are given measurements that give direct access to such marginals – instead, it is much more likely that we have various *cues* that indirectly reflect such marginal probabilities. For instance, if we want to find those features in an image that group into a (roughly) straight line, an important cue will be the degree of collinearity among triplets of features. Thus, we will use the same functional form of the maximum entropy model in Sec. 3.2, given by Eq. 3, to construct a simpler model that uses arbitrary cues rather than direct measurements of marginal probabilities.

We now express this simpler model for the concrete example given in Sec. 3.2. Two cues are measured in this example, *C*_{123} and *C*_{34}, which are scalar quantities that reflect the likelihood that the corresponding node subsets {1, 2, 3} and {3, 4} should be grouped into the figure. (Note that these are arbitrary cues, *not* direct measurements of marginal probabilities denoted by *p*_{1} and *p*_{2} in the previous section.) Then, if we assume conditional independence of the two measured cues, we have the following expression for the joint likelihood:

$$P(\mathit{data}\mid X)=P({C}_{123}\mid X)P({C}_{34}\mid X)=P({C}_{123}\mid {x}_{1},{x}_{2},{x}_{3})P({C}_{34}\mid {x}_{3},{x}_{4})$$

(4)

Following the form of Eq. 3, we will impose the requirement that the individual likelihood functions *P*(*C*_{123}|*x*_{1}*, x*_{2}*, x*_{3}) and *P* (*C*_{34}|*x*_{3}*, x*_{4}) will each depend only on whether the states they are conditioned on are all figure or not. In other words, *P*(*C*_{123}|*x*_{1}*, x*_{2}*, x*_{3}) = *P _{on}*(

Having defined the likelihood functions *P _{on}*(.) and

The model is used to perform inference by estimating the MAP (maximum a posterior), where the posterior is given by Bayes rule:

$$P(X\mid \mathit{data})=P(X)P(\mathit{data}\mid X)/P(\mathit{data})$$

(5)

Note that maximizing the posterior with respect to *X* is equivalent to maximizing *P*(*X*)*P*(*data*|*X*), since *P*(*data*) is independent of *X*.

As before, *X* = (*x*_{1}*, x*_{2}, …, *x _{N}*), where

$$P(X)P(\mathit{data}\mid X)=P(X)\prod _{(\mathit{ijk})}P({C}_{\mathit{ijk}}\mid {x}_{i},{x}_{j},{x}_{k})$$

(6)

where Π_{(}_{ijk}_{)} denotes a product over all feature triples such that *i* < *j* < *k*.

Note that Π_{(}_{ijk}_{)} *P* (*C _{ijk}*|

$$[\prod _{(\mathit{ijk})}{P}_{\mathit{off}}({C}_{\mathit{ijk}})][\prod _{(\mathit{ijk}):{x}_{i}{x}_{j}{x}_{k}=1}{P}_{on}({C}_{\mathit{ijk}})/{P}_{\mathit{off}}({C}_{\mathit{ijk}})]$$

(7)

where the restriction *x _{i}x_{j}x_{k}* = 1 in the product ensures that only those triples whose nodes all belong to figure are included. Since the term Π

$$R(X)=P(X)\prod _{(\mathit{ijk}):{x}_{i}{x}_{j}{x}_{k}=1}{P}_{on}({C}_{\mathit{ijk}})/{P}_{\mathit{off}}({C}_{\mathit{ijk}})$$

(8)

*R*(*X*) is proportional to the posterior distribution (i.e. it is unnormalized, and the constant of proportionality is the normalization factor, which depends on the data). Assuming a simple i.i.d. prior, such as
$P(X)={\prod}_{i=1}^{N}{P}_{i}({x}_{i})$, where *P _{i}*(

Taking logarithms, an equivalent way of estimating the MAP is to maximize the following function:

$$logR(X)=\sum _{i}log{P}_{i}({x}_{i})+\sum _{(\mathit{ijk})}{x}_{i}{x}_{j}{x}_{k}log[{P}_{on}({C}_{\mathit{ijk}})/{P}_{\mathit{off}}({C}_{\mathit{ijk}})]$$

(9)

where the product *x _{i}x_{j}x_{k}* ensures that the sum is taken only over those triples whose nodes all belong to figure. We will use factor graph BP to estimate the MAP of this function, as described in Sec. 4.

As a simple example of a factor graph for figure-ground segmentation, consider the problem of segmenting a collection of points in an image (Fig. 4(a)), where the figure consists of points that lie on smooth, nearly straight lines, and the ground consists of all other points. We will define factors that express the degree of colinearity among points; since any two points define a line, we will construct arity-3 factors with three points, which are the smallest factors for which colinearity is a meaningful concept. In this example, given point features *i*, *j* and *k* in the image, we define the collinearity *C _{ijk}* of the points as follows:

$${C}_{\mathit{ijk}}=\mid \frac{({\overrightarrow{p}}_{1}-{\overrightarrow{p}}_{2})\xb7({\overrightarrow{p}}_{2}-{\overrightarrow{p}}_{3})}{\parallel {\overrightarrow{p}}_{1}-{\overrightarrow{p}}_{2}\parallel \phantom{\rule{0.16667em}{0ex}}\parallel {\overrightarrow{p}}_{2}-{\overrightarrow{p}}_{3}\parallel}\mid $$

(10)

where *$\stackrel{\u20d7}{p}$ _{i}* denotes the (

Simple factor graph example. (a) Collection of points in an image to be segmented. (b) Points segmented as figure are shown as red circles.

We measure the distribution of *C _{ijk}* from a database of training examples, where

Finally, we estimate an i.i.d. prior on *X* by empirically counting what proportion of points in the training database belong to figure (see the discussion of the prior just before Eq. 9).

Having learned *P _{on}* and

Fig. 4 shows a schematic example of a figure-ground segmentation performed using a simple model of this form. The maximizer of log *R*(*X*) was estimated using factor BP (belief propagation), which is discussed in Sec. 4. Fig. 4(a) shows the point features input to the algorithm and Fig. 4(b) shows the output, with figure points drawn as red circles (all the rest are classified as ground).

Belief propagation (BP) is a standard procedure [25,2] for performing inference on graphical models such as MRFs and factor graphs. A graphical model specifies the joint distribution of all the variables in the model; a fundamental challenge in graphical models is to find the most likely values of all the variables (e.g. the MAP estimate). A naive way of accomplishing this is to exhaustively search over all combinations of variables, and to find the combination which is assigned the highest global probability by the graphical model. Such an approach is clearly intractable because of the exponentially large number of such combinations that must be considered. BP is a fast iterative procedure for estimating the marginal properties of individual variables, which may be used to estimate the most likely combination of all variables. Specifically, BP estimates either the marginal probabilities of each variable or some sort of score indicating which states are most likely for each variable.

The main idea behind BP is that different factors and variables “talk” to each other at each iteration, passing “messages” to their neighbors with their estimates of the neighbors’ likely states. After enough iterations, this series of “conversations” is likely to converge to a consensus, at which time the marginal estimates are fully determined. While BP provides only approximate solutions in general, it converges in a finite number of iterations to the exact solution when the graphical model has no loops (i.e. cycles), which is the case for graphical models representing Markov chains or any graphical models whose connectivity is tree-shaped. Moreover, empirical studies have established [15] that in practice, BP often converges to good (if somewhat sub-optimal) solutions even in the presence of loops.

While most work in computer vision using BP has been done for pairwise MRFs [23,4,21,7], BP is readily extended to factor graphs [11]. Here we present a brief overview of factor graph BP, using notation similar to that of [11]. (A good introduction to factor graph BP is contained in Chapter 8 of [2], which can be downloaded at http://research.microsoft.com/~cmbishop/PRML/) We describe the “max-product” version of factor BP, which provides scores indicating which states are most likely for each variable, rather than the “sum-product” version, which estimates marginal probabilities for each variable. Not only does max-product BP provide a direct estimate of the MAP, but it is also computationally more efficient.

In factor BP, messages are sent from factors to variables, indicating (at any iteration) the current estimate of the likelihood of each state for that variable according to the factor. (In standard treatments of BP [11] there are also messages sent from variables to factors, but since these are in turn expressed in terms of the opposite messages – from factors to variables – we will substitute these expressions anywhere the messages from variables to factors appear.) The fundamental operation of BP is the *message update equation*, which defines how to update messages at each iteration, and which one iterates enough times until the messages converge (though there is no guarantee of convergence in loopy graphs). Then the messages are used to calculate the *beliefs*, which provide scores indicating which states are most likely for each variable.

We express the max-product factor BP algorithm in the log domain, where it is understood that factor functions here are the log of the factors defining the joint probability of the factor graph, as in Sec. 3.3. In this domain, the message update equation is as follows:

$${m}_{f\to x}(x)\leftarrow \underset{\sim \{x\}}{max}\left(f(X)+\sum _{y\in n(f)\backslash \{x\}}\phantom{\rule{1em}{0ex}}\sum _{h\in n(y)\backslash \{f\}}{m}_{h\to y}(y)\right)$$

(11)

where *X* is the set of arguments of function *f*, and ~ {*x*} denotes the set of all arguments of *f* except for *x*, which we term the set of “siblings” of *x* under factor *f* (see Fig. 5(a)). Also, *n*(*f*) denotes all neighboring variables of factor *f* (i.e. all variables directly influenced by *f*), and *n*(*y*) denotes all neighboring factors of variable *y* (i.e. all factors that directly influence *y*).

(a) The siblings of a node variable *x* under factor *f* are the other variables included in the factor, in this case *y* and *z*. (b) The message update *m*_{$\stackrel{\u20d7}{f}$x}(*x*), indicated by the solid line with the arrow, is calculated by summing all messages **...**

The message update equation is illustrated in Fig. 5(b). The first sum in the update equation is over all siblings of variable *x*. For each sibling *y*, the second sum includes all messages from factors *other than f* that flow into the sibling.

Note that, for each factor *f* and neighboring variable *x*, updating all messages *m _{$\stackrel{\u20d7}{f}$x}*(

This is because Eq. 11 must be iterated for each value of *x* on the left-hand side, and for each value of *x* the max must be evaluated over the remaining *M* − 1 sibling variables ~ {*x*}.

The order in which messages are updated defines the *message update schedule.* In an asychronous schedule (which is commonly used on serial computers), one message (i.e. specified by one factor and one variable) is updated at a time according to Eq. 11 at each iteration. An entire message “sweep” refers to a sequence of message updates, such that every possible message in the graph is updated once. Generally speaking, message convergence requires several sweeps, depending on the connectivity of the factor graph and the precise nature of the probabilities encoded by it.

Because of a simplification that arises from our figure-ground application, we use instead a fully synchronous schedule, in which all messages are updated in parallel (see Sec. 4.1 for details).

Once the messages have converged to some value (which, in general, we can only hope happens after enough message updates), we can calculate the belief function for each node:

$$b(x)=\sum _{f\in n(x)}{m}_{f\to x}(x)$$

(12)

In max-product BP, the belief is a function with the following property: the state that maximizes the belief of a node is an estimate of the node’s state in the most likely global configuration of states across the entire graphical model (i.e. the MAP estimate if the graphical model is interpreted as representing a posterior distribution).

In this section we consider the behavior of factor BP for the case of a factor graph we have devised for performing figure-ground segmentation. Since we are working in the log domain, our factor graph represents the log posterior (unnormalized) given in Eq. 9. We re-express this equation as follows:

$$logR(X)=G\sum _{i}{x}_{i}+\sum _{(\mathit{ijk})}{x}_{i}{x}_{j}{x}_{k}L({C}_{\mathit{ijk}})$$

(13)

where we define *L*(*C _{ijk}*) = log[

Let us assume that we initialize all messages to 0, as is standard in max-product BP. We now analyze a synchronous (parallel) message update schedule applied to this initial condition. It is straightforward in this case to determine the message values after one sweep (i.e. each message has been updated exactly once). For all messages flowing from arity-1 factors (e.g. corresponding to the prior), the expression is:

$${m}_{f\to x}^{(1)}(x=1)=G$$

(14)

$${m}_{f\to x}^{(1)}(x=0)=0$$

(15)

since the variables receiving messages from these factors have no siblings. Here the superscript (1) indicates the value of the messages after one sweep.

All other messages (i.e. greater than arity-1) have the following form after one sweep:

$${m}_{f\to x}^{(1)}(x=1)=max(0,{L}_{f})$$

(16)

$${m}_{f\to x}^{(1)}(x=0)=0$$

(17)

where *L _{f}* denotes the value of

If we use the messages obtained after just one sweep, the beliefs have a simple expression. Since only the difference between beliefs for the figure and ground states matters, we define *B _{x}* =

$${B}_{x}=\sum _{f\in n(x)}max(0,{L}_{f})+G$$

(18)

A value of *B _{x}* > 0 means that the estimated optimal state of

Finally, we note that in a previous version of this work [19], we constructed a similar factor graph for which *L _{f}* > 0 for all factors, and no prior was used. In this case we showed that one sweep of factor BP guaranteed convergence. The difference here is that

We now describe a specific application of our figure-ground segmentation framework to the problem of finding text. In this application, a bottom-up procedure is used for grouping edges into composite features that are signatures of regions containing text. The next subsection describes how these features are constructed, and subsequent subsections explains how the features are grouped into factors and how the resulting factor graph is used to detect the presence of text features.

We use a very simple edge detector to provide the basic elements to be grouped. First, the image is blurred slightly and decimated by a factor of three in each dimension, yielding a size of 648 × 864 pixels. Two kinds of edges are detected, corresponding to local maxima or minima of the horizontal and vertical image intensity derivatives. The edges are grouped into line segments, which are approximately straight and fully connected sequences of edge pixels (with between 3 and 20 pixels, which sets an appropriate range of scales for text detection). There are two kinds of line segments, those that are oriented (approximately) vertically and those that are oriented (approximately) horizontally. Vertical segments that are sufficiently close together and have opposite polarity are grouped into “weakly matched vertical edges”, shown in Fig. 6(a). “Weakly matched horizontal edges” are determined in a similar way (see Fig. 6(b)). As the figure suggests, weakly matched edges are features designed to be prevalent along the borders of letter strokes.

Edge features used to construct text features shown on cropped image of street sign. (a) Weakly matched vertical edge segments. (b) Weakly matched horizontal edge segments. Edges shown in red and green to indicate opposite polarities.

Next we prune the set of weakly matched vertical segments to obtain our final features, “anchored vertical segments” (see Fig. 7). An anchored vertical feature is a weakly matched vertical segment whose topmost or bottommost pixel lies sufficiently close to the leftmost or rightmost pixel of a weakly matched horizontal segment. By “sufficiently close” pixels we mean that they are either identical or one pixel is one of the eight nearest neighbors of the other.

Anchored verticals. (a) Anchored verticals shown for image in previous figure. (b) Same as (a) but shown for entire image. Note density and regularity of anchored verticals in text region, where bottoms and tops tend to be at the same level. By contrast, **...**

As Fig. 7 shows, anchored verticals have a distribution on text regions that is significantly different from the distribution outside of text regions. Anchored verticals are distributed densely on text regions, and their bottoms and tops tend to be aligned to the same level. By contrast, outside of text regions, anchored verticals are distributed more sparsely and irregularly. We will exploit this differential distribution of anchored verticals (as well as other cues) to segment out text regions in an image.

Having constructed a set of useful anchored vertical features that have a distinctive signature in text regions, we now proceed to construct a factor graph based on these features. In the factor graph, each anchored vertical is a variable node. Factor nodes are defined as triples of anchored verticals that may plausibly belong to one text region. As we have shown, one iteration of factor BP can be performed very simply, requiring minimal calculations (see Eq. 18). However, the search for factors is computationally intensive, since there are many possible triples of anchored verticals to consider in each image.

The search for factors is conducted by considering all triples of anchored verticals that satisfy a number of criteria based on multiple cues. For any triple of anchored verticals, up to two possible factors can be formed: one from the tops of the three anchored verticals, and another from the bottoms. Here “top” and “bottom” refer to the upper and lower endpoints, respectively, of the anchored vertical. Once the factors are extracted from the image, multiple cues are used to define the “strenth” *L _{f}* of any factor

(a) Conditional distributions *P*_{on}(.) (blue) and *P*_{off} (.) (red), measured by histogramming, for an alignment cue. (b) *logP*_{on}(.)*/P*_{off} (.) for same cue.

Next we discuss the types of grouping cues used in our factor graph before defining them precisely. The following types of cues are used:

- Alignment: this reflects the fact that anchored vertical bottoms belonging to a group of text tend to lie along a straight line. (A similar tendency holds for anchored vertical tops, but to a lesser extent since upper and lower case letters have different heights.)
- Parallelism: neighboring anchored vertical segments tend to be roughly parallel to each other. However, when neighboring segments are
*both*very well aligned and very highly parallel, this is a sign that the segments belong to a periodic structure (such as the rails of a fence or the cracks between bricks in a building) that is not text. To exclude such false positives, we have defined a joint (top) alignment and parallelism cue, which is the only two-dimensional vector cue we have used. Such a cue requires two-dimensional histograms to be learned from training data, but this cue is otherwise treated just like the other (one-dimensional scalar) cues. - Orientation: most text of interest is oriented roughly horizontally. Therefore, we define an orientation cue that measures the (unsigned) orientation of the line connecting the leftmost and rightmost tops (or bottoms) of a segment triple.
- Color consistency: RGB color values are relatively consistent across the background of a typical text sign, since the background is usually a homogeneous color (different from the text letters themselves) and the illumination tends to be consistent across the background.
- Connectivity: this cue reflects the fact that anchored vertical segments are connected to more factors when they belong to text than when they belong to non-text. It is a “meta”-cue since it is calculated on the basis of factors due to all other cues.

In order to define the cues more precisely we define the following notation. The subscripts 1, 2 and 3 denote the three tops or bottoms of a triple of anchored verticals, where they are ordered from left to right across the image. Top or bottom locations will generically be referred to as pixel locations (*x _{i}*,

Expressed in this notation, the following grouping cues are learned from training data and used to construct factor graphs.

- Bottom alignment: Let
*y*^{*}be the y-coordinate of the point on*L*_{13}(defined in terms of bottoms) having*x*-coordinate equal to*x*_{2}. Then the bottom alignment cue is defined as*a*= |_{bottom}*y*^{*}−*y*_{2}|. (Perfect alignment would mean that (*x*_{2},*y*_{2}) lies exactly on*L*_{13}, i.e.*a*= 0.)_{bottom} - Joint parallelism and top alignment: this cue is defined as
*$\stackrel{\u20d7}{J}$*= (*p*,*a*). Here_{top}*p*is defined as max(|θ_{1}− θ_{2}|, |θ_{2}− θ_{3}|), where θis the orientation of anchored vertical_{i}*i. a*is defined similarly to the bottom alignment cue above._{top} - Orientation of tops and of bottoms: given top (or bottom) locations (
*x*,_{i}*y*), the overall orientation is defined by the orientation of the line connecting (_{i}*x*_{1},*y*_{1}) and (*x*_{3},*y*_{3}). The orientation is an unsigned number defined such that a horizontal line has an orientation of 0. The orientations of the tops and bottoms define two separate cues,and_{top}._{bottom} - Color consistency: denoting an RGB vector by
*$\stackrel{\u20d7}{I}$*= (*R, G, B*), we define ${\overrightarrow{I}}_{i}^{t}$ and ${\overrightarrow{I}}_{i}^{b}$ to be the RGB values near the top (superscript “t”) and bottom (superscript “b”) of anchored vertical*i*= 1, 2, 3. Specifically, the RGB value is defined at the location 6 pixels above the top point and 6 pixels below the bottom point. Then the color consistency cue is defined as $C={\sum}_{i}\mid {\overrightarrow{I}}_{i}^{t}-{\overrightarrow{I}}_{i}^{b}\mid $, where |.| is the*L*1 norm. - Connectivity: for each anchored vertical segment, this arity-1 cue is simply the total number of factors connected to the segment, denoted
*T*.

These cues were selected by exploring a larger set of cues and choosing only those cues with sufficient discriminating power (which we discuss at the end of this subsection).

Next we describe the selection process, which uses some of the grouping cues described above (as well as a few other cues). The first stage of the selection process removes all segments whose length is outside of a certain range (corresponding to the range of scales we are interested in). In addition, the color consistency of the segment all by itself (i.e. the value |*$\stackrel{\u20d7}{I}$ ^{t}* −

Having pruned many single anchored vertical segments in the first stage of the selection process, the goal of subsequent stages is to construct arity-3 factors. Arity-3 factors are built by combining (temporary) arity-2 factors, which are in turn constructed from pairs of segments that are sufficiently close together, and whose orientation (defined from a line connecting the midpoint of each segment) is sufficiently close to 0. We impose an ordering constraint to avoid double-counting: the first and second elements of each pair must be ordered from left to right.

Arity-3 factors are constructed by combining two arity-2 factors sharing a common segment. To construct an arity-3 factor, the following constraints must be satisfied: the three segments of the factor must be ordered from left to right; the bottom alignment cue of the three segments must be below a certain threshold; and the color consistency cue must be below a particular threshold. Once all candidate arity-3 factors have been computed, the connectivity cue is then computed for each segment, and all segments (and factors connected to the segment) are pruned for which the connectivity is sufficiently low. (All arity-2 factors are then deleted.)

These selection cues and thresholds were selected by trial and error to ensure that few “bad” factors were discarded from true text regions, while keeping the total number of factors in an image to a manageable number.

Empirical distributions *P _{on}*(.) and

We note that the form of the selection process, in particular the choice of selection cues and thresholds, has a large effect on the resulting empirical distributions *P _{on}*(.) and

After all factors have been constructed, as described in the previous section, belief propagation is performed to perform inference with them. Any segment whose belief value indicates it is more likely to belong to figure than to ground is chosen as a candidate winner. For an example of this process, see Fig. 9, which shows the anchored vertical segments extracted in a typical input image, and Fig. 10 for the results of belief propagation applied to this image.

Input image with anchored vertical segments colored to indicate correct (ground truth) labels: green means text and red means non-text.

Output before final processing step: all segments are shown that are designated as figure by belief propagation. Green segments are true positives and red segments are false positives.

Note that belief propagation correctly discards many, but not all, false positives, and detects most of the true text (while leaving some gaps).

A post-processing stage is then used to clean up the results of belief propagation, exploiting the fact that false positive segments tend to appear in isolated clusters, while true positives tend to be densely arrayed along lines of text. This post-processing stage consists of a simple convolution process, in which a sliding horizontal window (one pixel high by 100 pixels across) is applied across the image. At each location of the window, the number of candidate winner segments that it intersects is computed, and if this number is above a threshold then the window location (i.e. the center of the window) is classified as a winner. This process serves to remove isolated candidate winners and reinforce those that are densely arrayed along lines of text. See Fig. 11 for an example.

Output after final processing step (convolution), with lines of text detected by algorithm shown as horizontal green lines.

We show another result (also taken from one of the 40 test images), this time demonstrating the presence of a false positive and false negative text detection. Fig. 12 shows the input image with extracted segments; Fig. 13 shows the results of running BP on the image, which incorrectly removes some of the valid text segments and fails to remove some clusters of non-text segments. The final result is given in Fig. 14, which shows that an insufficient number of true positives were detected by BP to declare the presence of text, and that the chance alignment of features in the tree generated a false negative text detection.

A second image with anchored vertical segments colored to indicate correct (ground truth) labels: green means text and red means non-text.

Output before final processing step: all segments are shown that are designated as figure by belief propagation. Green segments are true positives and red segments are false positives.

Our algorithm was programmed in unoptimized Matlab code, which took about one minute to search for the presence of text in each image. We trained our algorithm on 71 images and tested it on 40 images (different from the training images). In these 40 test images a total of 18142 anchored vertical segments were extracted, of which 767 belong to text and 17375 belong to non-text (based on manual inspection of the images). The results of belief propagation (before the convolution post-processing step) are that 647 out of 767 text segments were detected, with 540 false positives, corresponding to a true positive rate of 84 % and a false positive rate of 3 %.

We also evaluate the performance of the algorithm after the convolution postprocessing step, quantifying detection performance at the level of entire lines of text (i.e. a street sign typically contains one line of text) rather than individual anchored vertical segments. Among the 40 test images there are 53 hits (true positive detections), 8 misses (false negatives) and 4 false alarms (false positives). (False alarms mainly came from building windows and railings.)

We note that there is some ambiguity in these performance statistics, since it is not always clear from examining an image how to distinguish between text and non-text (e.g. when the resolution is poor). Moreover, in some text regions too few anchored vertical segments are extracted, which obviously impairs performance; the solution to this problem is to improve the feature extraction procedure.

We have presented a novel feature grouping framework, based on factor graphs, which emerges naturally from a simple maximum entropy model of figure-ground segmentation. The framework allows the use of multiple grouping cues, and the contributions of these cues are learned from training data. We have applied the framework to the problem of segmenting text in natural scenes, and experimental results demonstrate the feasibility of the approach.

Future work will focus on improving the reliability of text detection. The main limitation of the current approach is the use of anchored vertical segments, which are not reliably detected in some text regions (especially those containing letters with few vertical elements). One way to ameliorate this problem is to simply lower the threshold of detection so as to detect more segments; another way is to eliminate the requirement that anchored vertical segments must lie near horizontal edge segments. Either solution will increase the number of true and false candidate segments, in which case additional cues may need to be strengthened in the selection process to decrease the total number of candidates, such as color consistency (which currently samples very few pixels but could easily be extended to sample more neighboring pixels). The connectivity cue might be strengthened to reward connections to segments lying to the left or right as opposed to segments lying above or below (since we are assuming the text is roughly horizontal). Another issue that may impair performance is the conditional independence assumption that underlies our approach (Eq. 4), which could be alleviated by adopting a conditional random field (CRF) framework [16].

In order for our algorithm to function as part of a practical system for finding and reading text, we will also have to use the text features output by it to determine appropriate bounding boxes to enclose the text, and use OCR to actually read the text. We note that the OCR stage will have the benefit of discarding some false positives which cannot be ruled out by our algorithm alone. Finally, the algorithm needs to be optimized for speed so that it can be ported to a camera cell (mobile) phone, as was accomplished in a previous version of this work [19].

The authors would like to acknowledge support from NIH grants EY015187-01A2 and 1 R01 EY018345-01, the National Institute on Disability and Rehabilitation Research grant H133E060001, and the Claire Giannini Fund.

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

1. Agarwal S, Lim J, Zelnik-Manor L, Perona P, Kriegman D, Belongie S. Beyond pairwise clustering. Computer Vision and Pattern Recognition (CVPR 2005) 2005;2:838–845.

2. Bishop C. Pattern Recognition and Machine Learning. Springer; 2007.

3. Chen X, Yuille AL. Detecting and reading text in natural scenes. Computer Vision and Pattern Recognition (CVPR 2004)

4. Coughlan J, Ferreira S. Finding Deformable Shapes using Loopy Belief Propagation; European Conference on Computer Vision (ECCV 2002); Denmark: Copenhagen; May, 2002. pp. 453–468.

5. Coughlan J, Shen H. A fast algorithm for finding crosswalks using figure-ground segmentation; 2nd Workshop on Applications of Computer Vision, in conjunction with European Conference on Computer Vision (ECCV 2006).

6. Cover TM, Thomas JA. Elements of Information Theory. Wiley Interscience Press; New York: 1991.

7. Felzenszwalb P, Huttenlocher D. Efficient Belief Propagation for Early Vision. Computer Vision and Pattern Recognition (CVPR 2004)

8. Gao J, Yang J. An adaptive algorithm for text detection from natural scenes. Computer Vision and Pattern Recognition (CVPR 2001)

9. Ivanchenko V, Coughlan J, Shen H. Detecting and Locating Crosswalks using a Camera Phone; Fourth IEEE Workshop on Embedded Computer Vision, in conjunction with Computer Vision and Pattern Recognition (CVPR 2008); Anchorage, Alaska. Jun, 2008. [PMC free article] [PubMed]

10. Jain A, Tu B. Automatic text localization in images and video frames. Pattern Recognition. 1998;31(12):2055–2076.

11. Kschischang FR, Frey BJ, Loeliger HA. Factor graphs and the sum-product algorithm. IEEETIT: IEEE Transactions on Information Theory. 2001;47

12. Kumar S, Hebert M. Man-Made Structure Detection in Natural Images using a Causal Multiscale Random Field. Computer Vision and Pattern Recognition (CVPR 2003)

13. Li H, Doermann D, Kia O. Automatic text detection and tracking in digital videos. IEEE Transactions on Image Processing. 2000;9(1):147–156. [PubMed]

14. Liang J, Doermann D, Li H. Camera-based analysis of text and documents: a survey. International Journal on Document Analysis and Recognition. 2005;7:84–104.

15. Murphy KP, Weiss Y, Jordan MI. Loopy belief propagation for approximate inference: an empirical study. Uncertainty in AI. 1999

16. Quattoni A, Collins M, Darrell T. Conditional Random Fields for Object Recognition. Neural Information Processing Systems (NIPS 2004)

17. Ratnaparkhi A. Technical Report 97-08. Institute for Research in Cognitive Science, University of Pennsylvania; A simple introduction to maximum entropy models for natural language processing. http://citeseer.ist.psu.edu/128751.html.

18. Shen H, Coughlan J. Finding text in natural scenes by figure-ground segmentation. International Conference on Pattern Recognition (ICPR 2006) 2006;(4):113–118.

19. Shen H, Coughlan J. Grouping Using Factor Graphs: an Approach for Finding Text with a Camera Phone; Workshop on Graph-based Representations in Pattern Recognition (GbR 2007, associated with The International Association for Pattern Recognition); Alicante, Spain. Jun, 2007.

20. Shental N, Zomet A, Hertz T, Weiss Y. Pairwise clustering and graphical models. Neural Information Processing Systems (NIPS 2003)

21. Sigal L, Isard M, Sigelman BH, Black MJ. Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation. Neural Information Processing Systems (NIPS 2003)

22. Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000;22(8):888–905.

23. Sun J, Shum H, Zheng N. Stereo matching using belief propagation. European Conference on Computer Vision (ECCV 2002) 2002:510–524.

24. Wu V, Manmatha R, Riseman EM. Finding text in images. ACM DL. 1997:3–12.

25. Yedidia JS, Freeman WT, Weiss Y. Exploring Artificial Intelligence in the New Millennium. Science & Technology Books; Jan, 2003. Understanding Belief Propagation and Its Generalizations; pp. 239–236. Chap. 8. also available as MERL Cambridge Research Technical Report TR TR2001-022.

26. Yu SX, Shi J. ”Object-specific figure-ground segregation. Computer Vision and Pattern Recognition (CVPR 2003)

27. Yuille AL. Deformable Templates for Face Recognition. Journal of Cognitive Neuroscience. 1991;3(1) [PubMed]

28. He X, Zemel RS, Carreira-Perpinan MA. Multiscale Conditional Random Fields for Image Labeling. CVPR. 2004

29. Zhang D, Chang S. Learning to detect scene text using a higher-order mrf with belief propagation. Computer Vision and Pattern Recognition (CVPR 2004)

30. Zheng Y, Li H, Doermann D. Text identification in noisy document images using markov random field. International Conference on Document Analysis and Recognition. 2003

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |