Home | About | Journals | Submit | Contact Us | Français |

**|**Sensors (Basel)**|**v.17(3); 2017 March**|**PMC5375903

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Related Work
- 3. Background
- 4. Proposed Approach: CRF-Boosting
- 5. Experimental Results and Analysis
- 6. Conclusions and Future Research Agendas
- References

Authors

Related links

Sensors (Basel). 2017 March; 17(3): 617.

Published online 2017 March 17. doi: 10.3390/s17030617

PMCID: PMC5375903

Joonki Paik, Academic Editor

School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju 61005, Korea; Email: rk.ca.tsig@awhe (E.Y.); Email: moc.liamg@kawg.nah.semaj (J.G.)

Received 2016 December 6; Accepted 2017 March 14.

Copyright © 2017 by the authors.

Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Due to the reasonably acceptable performance of state-of-the-art object detectors, tracking-by-detection is a standard strategy for visual multi-object tracking (MOT). In particular, online MOT is more demanding due to its diverse applications in time-critical situations. A main issue of realizing online MOT is how to associate noisy object detection results on a new frame with previously being tracked objects. In this work, we propose a multi-object tracker method called CRF-boosting which utilizes a hybrid data association method based on online hybrid boosting facilitated by a conditional random field (CRF) for establishing online MOT. For data association, learned CRF is used to generate reliable low-level tracklets and then these are used as the input of the hybrid boosting. To do so, while existing data association methods based on boosting algorithms have the necessity of training data having ground truth information to improve robustness, CRF-boosting ensures sufficient robustness without such information due to the synergetic cascaded learning procedure. Further, a hierarchical feature association framework is adopted to further improve MOT accuracy. From experimental results on public datasets, we could conclude that the benefit of proposed hybrid approach compared to the other competitive MOT systems is noticeable.

Multiple object tracking (MOT) [1,2] is one of the most important and hectic areas in the field of computer vision research, and recent advances on detection and tracking of multiple objects have led to its application to diverse practical problems such as bio-medical imaging, visual surveillance systems and augmented reality. The main tasks of establishing MOT systems are to extract positions of objects, to generate the trajectories of each individual object, and to maintain the identity of each object, even for crowded environments. There are several issues that increase MOT complexity such as imprecise and noisy detections, occlusions by the other objects or background, and dynamic interactions among objects.

Due to the success in developing robust object detectors [3,4,5], many recent studies on MOT adopt tracking-by-detection approaches [6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], where the key research topic is data association to link object detections or tracklets (i.e., track fragments) in a sequence of frames for assembling the final trajectories of the objects. Such MOT systems based on data association consist of two main components: (1) *a tracklet affinity model* measuring the likelihood (or linking probability) that two detection responses or tracklets belong to the same target; and (2) *a global optimization framework* determining which detection responses or tracklets should be linked based on the affinity measurement, which is commonly formulated as a maximum a posteriori problem.

Although many methods have been proposed to develop global optimization frameworks based on linear programming [6], min-cost flow algorithm [7] and Hungarian algorithm [8], relatively less effort has been devoted to improving the affinity model. Simple affinity models widely adopted for efficiency purposes are mostly based on straightforward parametric models (e.g., Gaussian distributions for object location changes and distance between color histograms for object appearance affinity measurement). Moreover, in many cases, the model parameters and the relative emphases of different cues are determined depending on prior knowledge or human observation of the data. When environmental changes or different cues (e.g., appearance, motion, and context information) are combined into one affinity model, it is almost impossible to tune the model manually.

To overcome such difficulties, we propose a hybrid data association algorithm combining conditional random field (CRF) [21,22] and online hybrid boosting for building robust MOT. Existing data association approaches adopting different machine learning techniques such as boosting need training sets with ground truth information [23,24] for higher accuracy. While *rank boost* [23] achieves better performance than *binary boost* [23], it is very difficult to design an online algorithm for this because of its ranking concept. CRF is a powerful model adopted in many computer vision research fields, but not widely utilized in data association for MOT. In our work, with the aim of designing an online MOT system, we incorporate CRF, which enables low-level data association into a hybrid boosting-based data association approach with a ranking concept. Specifically, we represent the association of detection responses between two frames as a graph for CRF, and design an online algorithm by applying the results of CRF-based pairwise similarity matching to build the training data. Finally, the CRF learning output is used for the input to the hybrid boosting algorithm that learns tracklet affinity models. To this end, the contributions of this work are as follows:

- A robust hybrid data association is proposed by cascading robust CRF-based pairwise similarity matching and online hybrid boosting.
- A hierarchical feature association framework is adopted to improve the accuracy.
- A fully automated online MOT method called CRF-boosting is established.

The rest of this paper is organized as follows: the preliminaries of this work, CRF and boosting approaches, are described in Section 2. Section 3 describes the details of the proposed hybrid MOT approach. The experimental results and analysis are given in Section 4. Finally, the conclusions and future work are given in Section 5.

One key issue in MOT is how to distinguish targets from background and other objects. To do this, researchers usually try to find or learn proper appearance models which have the capabilities of identifying one target from among all other objects or background. Also, to perform effective tracklets associations, data association frameworks have been widely studied. Most of the MOT methods usually take the tracking-by-detection approaches [6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,25,26,27,28,29,30,31,32,33,34,35] and can be classified into two categories: (1) MOT utilizing past and current frames for association decisions (e.g., [34,35]); and (2) MOT using all the frames, including past, current, and future frames (e.g., [7,8,9,24,36]). The former usually adopts a particle filtering framework based on detection responses, and it is more suitable for time-critical applications and systems because it does not require future frames. However, it is very vulnerable to noisy observations and long-term occlusions of targets. To obtain further improved results, the latter uses all the frames and adopts global optimization. The tracking-by-detection-based MOT methods usually associate detection responses obtained from a pre-trained detector into tracklets progressively and finally construct trajectories for all targets. Appearance models, whether pre-trained or online learned, are commonly adopted to distinguish targets. In addition, motion models can be also adopted to predict the feasible position of objects in the future frames, which reduces the search space. The appearance and motion models may be optimized, but for differentiating all targets each other, there are still some challenges such as: (1) similar appearance of targets; (2) complex interactions among objects; (3) frequent occlusions; (4) different size of targets, and (5) initialization and termination of tracks.

Among the related works, we mainly focus on reviewing the closely related works on boosting-based MOT [9,24,27,37] and CRF-based MOT [33,38]. Boosting-based MOT is easier to implement than CRF-based MOT, and boosting can be used in combination with different learning algorithms to improve its performance. In boosting-based MOT, most studies focus on improving the robustness and effectiveness of appearance models which can be used as distinctive feature information. In contrast, studies on CRF-based MOT have usually focused on data association to generate final trajectories. In Li et al. [24], HybridBoost was used to learn an appearance model which is integrated in a hierarchical data association framework [39] to progressively grow tracklets. In Yang et al. [9] devised a part-based appearance modelling and grouping-based data association framework to alleviate the problems of frequent occlusions and similar appearances among objects. A boosting algorithm was used to learn a part-based appearance model. In Kuo et al. [37], an online learning approach to build a discriminative appearance model was proposed. The AdaBoost algorithm is used to combine effective image descriptors and their corresponding similarity measurements. To make online learning possible, positive and negative training samples are obtained from the results of short but reliable tracklets using a dual-threshold method [39]. Bae and Yoon [27] proposed online MOT based on tracklet confidence and online discriminative appearance learning. Effective tracklets are obtained by sequentially linking detections/tracklets using local and global association according to their confidence levels, and incremental linear discriminant analysis [32] is used for online discriminative appearance model learning. Yang et al. [33] proposed a CRF model to consider both tracklet affinities and dependences among tracklets, and to transform the problem of MOT into an energy minimization task. In Yang and Nevatia [38], an online learned CRF model was used to generate final trajectories. For online learning, low-level tracklets are required and are generated by simply using color or location information between two consecutive frames. However, in many cases, this is not practical because it can increase association errors under noisy observation conditions.

In this section, the two key elements, CRF and hybrid boosting methods, which are used to build an effective tracklet affinity model in Section 3, are explained in detail.

CRFs are discriminative undirected probabilistic graphical models developed for labeling/segmenting structural and sequential data [21,40], and it is shown in [41] that they are competent in modelling spatial relationships. We can define conditional distribution $p(x|z)$ over the hidden variables *x* given observation *z* where nodes $x=<{x}_{1},{x}_{2},\dots ,{x}_{n}>$ represents hidden states and nodes $z=<{z}_{1},{z}_{2},\dots ,{z}_{n}>$ indicates data. Using the nodes *x _{i}* and their connectivity structure represented by undirected edges, we define the conditional distribution $p(x|z)$ over

$$p(x|z)=\frac{1}{Z(z)}{\displaystyle {\prod}_{c\in C}{\varphi}_{c}(z,{x}_{c})},$$

(1)

where $Z(z)={\displaystyle {\sum}_{x}{\displaystyle {\prod}_{c\in C}{\varphi}_{c}(z,{x}_{c}})}$ is the normalizing partition function. Also, ${\varphi}_{c}(z,{x}_{c})$ is described by log-linear combinations of feature functions *f _{c}* as follows:

$${\varphi}_{c}(z,{x}_{c})=\mathrm{exp}({w}_{c}^{T}\cdot {f}_{c}(z,{x}_{c})),$$

(2)

where ${w}_{c}^{T}$ is a weight vector, and ${f}_{c}(z,{x}_{c})$ is a feature function. Then, (1) can be rewritten as:

$$p(x|z)=\frac{1}{Z(z)}\mathrm{exp}({\displaystyle {\sum}_{c\in C}{w}_{c}^{T}}\cdot {f}_{c}(z,{x}_{c})).$$

(3)

The weights of the feature functions in (3) are determined by the CRF parameter learning. CRF learns the weights discriminatively through maximizing the conditional likelihood of labeled training data. We can find the global optimum of (3) using a numerical gradient method, but it is very inefficient because the inference procedure of the optimization should be executed at each iteration. Thus, we adopt the method of maximizing the pseudo-likelihood of the training data and it is given by the sum of local likelihoods $({x}_{i}|MB({x}_{i}))$, where $MB({x}_{i})$ is the *x _{i}*’s Markov blanket indicating the set of the immediate neighbors of

$$L(w)=-{\displaystyle {\sum}_{i=1}^{n}\mathrm{log}p({x}_{i}|MB({x}_{i}),w)+\frac{{(w-\tilde{w})}^{T}(w-\tilde{w})}{2{\sigma}^{2}}},$$

(4)

where the rightmost term represents a Gaussian shrinkage prior with mean $\tilde{w}$ and variance ${\sigma}^{2}$. We use unconstrained L-BFGS [36] as a gradient descent method to optimize (4). Then, at the inference stage using a new test data, the learned CRF estimate the most likely configuration of all hidden variables ** x** using belief propagation [40].

Boosting has been successfully used in a variety of machine learning tasks and widely applied to computer vision tasks as well. In this section, for learning an appearance affinity model, we introduce a hybrid boosting algorithm having the property of both a ranking function and a binary classifier.

A ranking problem includes an instance space *X* with a ranking function *H* that defines a linear ordering of instances in *X*. *H* takes the form of $H:X\to R$. Proposed by Freund et al. [23], rank boost is an algorithm invented for this purpose. In rank boost, a set of instance pairs $R=\left\{<{x}_{i},{x}_{j}>|{x}_{i},{x}_{j}\in X\right\}$ constitute training data, where *x _{j}* should be ranked higher than

We can map the ranking problem onto the data association problem. We define instance *X* to be *T* × *T* where *T* is the set of tracklets to be possibly associated. For example, given tracklets ${T}_{1},{T}_{2},{T}_{3},{T}_{4}\in T$, if *T*_{1} and *T*_{3} are the real trajectory that should be correctly linked, then the ranking must be $H(<{T}_{1},{T}_{3}>)>H(<{T}_{1},{T}_{2}>)$ and $H(<{T}_{1},{T}_{3}>)>H(<{T}_{1},{T}_{4}>)$. When *T ^{t}* is the terminating tracklet of a target trajectory, to prevent associating

To resolve the problem, in the hybrid boosting algorithm, the training set is composed of a ranking sample set *R* and a binary sample set *B*. The ranking sample set is denoted by:

$$R=\left\{({x}_{i,0},{x}_{i,1})|{x}_{i,0}\in X,{x}_{i,1}\in X\right\},$$

(5)

where each *x _{i}*

$$B=\{({x}_{j},{y}_{j})|{x}_{j}\in X,{y}_{j}\in \{-1,1\left\}\right\},$$

(6)

where *y _{j}* = 1 indicates the corresponding

$$Z=\beta {\displaystyle {\sum}_{({x}_{i,0},{x}_{i,1})\in R}{w}_{0}({x}_{i,0},{x}_{i,1})}\mathrm{exp}(H({x}_{i,0})-H({x}_{i,1}))+(1-\beta ){\displaystyle {\sum}_{({x}_{j},{y}_{j})\in B}{w}_{0}({x}_{j},{y}_{j})\mathrm{exp}(-{y}_{j}H({x}_{j}))},$$

(7)

where *β* is a constant coefficient and *ω*_{0} is the initial weight function. In the boosting algorithm, to find *H*(*x*), we need to minimize *Z*, and *H* can be obtained by adding new weak ranking classifiers sequentially. Therefore, (7) can be written using weak ranking classifier $h(t):X\to R$ and its weight *α _{t}* as follows:

$$Z=\beta {\displaystyle \sum _{({x}_{i,0},{x}_{i,1})\in R}{w}_{0}({x}_{i,0},{x}_{i,1})\mathrm{exp}({\alpha}_{t}({h}_{t}({x}_{i,0})-{h}_{t}({x}_{i,1})))+(1-\beta ){\displaystyle \sum _{({x}_{j},{y}_{j})\in B}{w}_{0}({x}_{j},{y}_{j})\mathrm{exp}(-{y}_{j}{\alpha}_{t}{h}_{t}({x}_{j}))}},$$

(8)

Finally, the final strong ranking classifier is the weighted combination of the selected weak ranking classifiers as follows:

$$H(x)={\displaystyle {\sum}_{t=1}^{n}{\alpha}_{t}{h}_{t}(x)},$$

(9)

where *n* is the number of boosting rounds. Attributed to the loss function *Z*, *H*(*x*) contains the advantage of both a ranking classifier and a binary classifier.

In this section, based on the CRF and hybrid boosting discussed in Section 3, we demonstrate how to design a robust online MOT system called CRF-boosting.

For tracking multiple objects robustly under difficult conditions such as with noisy or missed detections, many boosting-based data association methods have used training data with ground truth (GT) information or the like. In many cases, due to the impracticality and inconvenience of obtaining training data with accurate GT information in different situations, offline learning of an affinity model was commonly adopted. However, in such a way, it is very difficult to implement robust online MOT with real-time processing capability. To overcome this drawback, in this work, we generate a CRF model for intermittent temporary tracklet association between two consecutive frames, and the results (i.e., those with selected good samples) from the CRF model are used as the training data for hybrid boosting to establish an online MOT system called CRF-boosting. In addition, based on hierarchical feature association through online hybrid boosting algorithm, detection responses are progressively linked into longer ones to form final tracking outcomes in an online manner. Figure 1 shows the overall schematics of the proposed system.

At the first step, as input data, detection responses are obtained from image sequences. In the hybrid boosting algorithm, we use not only ranking information, but also binary information, and thus it is very crucial to utilize accurate and reliable tracklet information in its training process. To do this, we use a learned CRF model [40] which can give the similarity information between objects in two consecutive frames. The construction of the CRF model is described in Section 4.2. The reliable short tracklets constructed by the CRF model are used as the input of the hybrid boosting-based data association algorithm that produces the final trajectory information. The details of the hybrid boosting are described in Section 4.3.

In CRF, intermittent temporary connections among detected objects between frames are made with the feature information of the objects. To find the links between two frames, we generate a CRF graph that contains hidden node ${x}_{t}^{i}$ indicating object *i* in frame *t*. In generating a graph of CRF, node ${x}_{t-1}^{i}$ is not connected with all nodes ${x}_{t}^{i}$ at the next frame *t*; Node ${x}_{t-1}^{i}$ is connected with ${x}_{t}^{i}$ within certain boundary *σ* from its position (i.e., only neighboring objects are connected) using regional (i.e., local proximity-based) connectivity assuming that the object is not moving suddenly far away between two consecutive frames. Here, we set the $\sigma =2.5\times \mathrm{hight}\text{}\mathrm{of}\text{}\mathrm{object}\text{}i$. Then, considering the local proximity, an efficient CRF model can be constructed. An example is given in Figure 2.

Node ${z}_{t}^{i}$ in Figure 2 corresponds to the local features (i.e., observation data) of hidden node ${x}_{t}^{i}$ (i.e., object *i*). In this work, we use the spatial distance [40] and visual appearance including color histogram [22] and covariance [37,43] as the features. Then, (3) in Section 3.1 expresses conditional distribution of the CRF, and the function of each feature for similarity measurement is defined as the differences of features among the objects. In this CRF, the feature function of spatial distance between object *j* in frame *t* − 1 and object *i* in frame *t* is defined as:

$${f}_{sd}(i,j,{z}_{t}^{i,sd},{z}_{t-1}^{j,sd})=\frac{{\Vert {z}_{t}^{i,sd}-{z}_{t-1}^{j,sd}\Vert}^{2}}{{\sigma}_{sd}^{2}},$$

(10)

where ${z}_{t}^{i,ch}$ is the position of individual points in *i*, ${z}_{t-1}^{j,sd}$ is the position of individual points in *j*, and *σ*^{2} is the variance of the distances in the training data. The feature function of color histogram is defined as:

$${f}_{ch}(i,j,{z}_{t}^{i,ch},{z}_{t-1}^{j,ch})=\frac{{\Vert {z}_{t}^{i,ch}-{z}_{t-1}^{j,ch}\Vert}^{2}}{{\sigma}_{ch}^{2}},$$

(11)

where ${z}_{t}^{i,ch}$ is the color histogram of *i*, ${z}_{t-1}^{j,sd}$ is the color histogram of *j*, and *σ*^{2} is the variance of the color histogram differences in the training data. Single channel histograms are concatenated to construct a single vector with 8 bins for each channel, resulting a 24-dimensional vector. Next, the feature function of covariance is computed by:

$${f}_{cov}(i,j,{C}_{i},{C}_{j})=\sqrt{{\displaystyle {\sum}_{k=1}^{7}{\mathrm{ln}}^{2}{\gamma}_{k}({C}_{i},{C}_{j})}},$$

(12)

where ${\left\{{\lambda}_{k}({C}_{i},{C}_{j})\right\}}_{k=1,\dots ,7}$ are the generalized eigenvalues of *C _{i}* and

$${C}_{i}=\frac{1}{P-1}{\displaystyle {\sum}_{p=1}^{P}({z}_{i,p}-{\mu}_{i}}){({z}_{i,p}-{\mu}_{i})}^{T},$$

(13)

where *P* is the number of pixels in the region of *i*, denoted as ${R}_{i}$, ${\mu}_{i}$ is the pixel mean vector over ${R}_{i}$, *I* is the intensity of the pixel and ${z}_{i,p}$ is the vector consists of the first and second derivatives of ${R}_{i}$ at *p*-th pixel, which is given as:

$${z}_{i,p}={\left[\left|\frac{\partial I}{\partial x}\right|\text{\hspace{1em}}\left|\frac{\partial I}{\partial y}\right|\text{\hspace{1em}}\left|\frac{{\partial}^{2}I}{\partial {x}^{2}}\right|\text{\hspace{1em}}\left|\frac{{\partial}^{2}I}{\partial {y}^{2}}\right|\text{\hspace{1em}}\left|\frac{{\partial}^{2}I}{\partial xy}\right|\right]}^{T}.$$

(14)

Similar to [43], the image derivatives are computed using the filters ${[-1\text{}0\text{}1]}^{T}$ and ${[-1\text{}2\text{}-1]}^{T}$, resulting covariance of a region is a 9 × 9 matrix.

For learning a hybrid boosting algorithm in an online manner, we have to compose training sets automatically. In this work, the information of matched detection responses as a result of CRF matching (Section 4.2) in consecutive frames are employed for the purpose. The spatio-temporal distance information is used for composing training dataset. The training datasets are divided into the ranking dataset and the binary dataset, where each dataset consists of positive and negative datasets for learning the boosting algorithm. Then, we assume that each tracklet corresponds to an object and the targets at a frame (i.e., at the same time) constitutes the tracklets different from each other. That is, since it is trivial that the objects in different trajectories cannot be the same target, we use this spatio-temporal constraint for building the training data. In this way, using the reliable tracklets output of the CRF matching, we can construct the training dataset for the boosting algorithm. We used the ranking training set defined in (5) and the binary training set defined in (6). Figure 3 shows an example of constructing the training dataset.

Similar to [24], as shown in Table 1, 13 types of hierarchical features, representing length of tracklets (idx 1 to 3), appearance information of tracklets (idx 4, 5), frame gap information of tracklets (idx 6 to 9), and motion information of tracklets (idx 10, 11), are adopted in this work. The online hybrid boosting algorithm is given in Algorithm 1 (details was discussed in Section 3.2). In the boosting algorithm, each feature is a function $f:x\to R$, which takes a pair of tracklets $x=<{T}_{i},{T}_{j}>$ as its input and outputs a real value. The weak ranking classifier is defined as:

$$h(x)=\{\begin{array}{ll}+1& if\text{\hspace{0.17em}}f(x)>\delta \\ -1& othewise\end{array}.$$

(15)

As described in Figure 4, we design the boosting algorithm with two stages in its training procedure. In constructing trajectories, the two stage training procedures can help to exploit more accurate ranking information, e.g., by appearance affinity models with different poses in a trajectory, through considering different length of tracklets. For this, the maximum length in the first stage is defined as the 1/4 of the full training image sequences, and that in the second stage is 1/2 of the sequences. By training incrementally, we can obtain the more accurate tracklets information rather than utilizing all image sequences at once, which improves the MOT system robustness (i.e., capable of reducing tracking errors).

Two-stage training procedure: (**a**) 1st stage: Maximum length of tracklets is 1/4 of the whole image sequences for training; (**b**) 2nd stage: Maximum length of tracklets is 1/2 of whole image sequences for training.

The procedure of the proposed CRF-boosting algorithm is given in Algorithm 2. In the proposed CRF-Boosting tracker, two-stage training is performed. As a result of CRF-based pairwise similarity matching, robust low-level tracklets are obtained and using them, ranking and binary classification samples are formed in an online manner. Then, a strong ranking classifier $H(x)$ is learned using hybrid boosting in Algorithm 1. The CRF-boosting tracker using $H(x)$ as the tracklet affinity model is then applied to generate the 1st stage association. The above procedures are repeated to establish the 2nd stage association. Finally, trajectories for all targets are constructed.

Algorithm 1: Online Hybrid Boosting Algorithm |

Algorithm 2: CRF-Boosting Tracker with the Two-Stage Training Procedure |

In this section, the experimental results, their analyses, and the experimental conclusions that can be drawn are discussed. We evaluate the effectiveness of our proposed MOT system with three widely used public surveillance datasets: CAVIAR [44], PETS2009 [45] and ETH [46]. The CAVIAR dataset contains 26 video sequences of corridor in a shopping mall taken by a single camera with frame size of 384 × 288 and frame rate of 25 FPS. The PETS2009 dataset include the “S2.L1” (sparsely crowded scenes), “S2.L2” (moderately crowded scenes), “S2.L3” (densely crowded scenes) videos taken by a multiple static camera with frame size of 768 × 576 pixels and frame rate of 25 FPS. The ETH dataset contains video sequences taken by a stereo forward-looking camera mounted on a moving children’s stroller on busy street scenes. The frame rate is 14 FPS and the image size is 640 × 480 pixels for the videos. We chose the “Bahnhof” and “Sunny day” sequences from the ETH dataset. The human detection results are the same as used in [37,38] and are provided by courtesy of authors of [22].

Following the metrics used in [24], we use the evaluation metrics described in Table 2. The better MOT performance is obtained for the *higher* values in RC and MT and for the *lower* values in FAF, ML, FRG and IDS. By the definitions, the total sum of MT, PT and ML should be 100%. In general, a higher value of PT is better, but if MOT improves MT by better association capability PT can be decreased because it can result in lesser partial trajectories. That is, PT depends on the tracklet association performance of MOT. Therefore, we exempt PT from the analyses of the experimental results, but it is remained in the resulting tables, Table 3, Table 4, Table 5 and Table 6, for the readers’ reference.

*Results and Analysis on the CAVIAR dataset:* Wu and Nevatia [47] presented body-part detection based MOT in which a human are represented by four body parts including full-body, head-shoulder, torso and legs. Zhang et al. [7] introduced a min-cost flow network based data association framework with a non-overlap constraint on trajectories. Huang et al. [39] devised three-level hierarchical data association approach. At the low level, reliable short tracklets are obtained, and at the middle level, the Hungarian algorithm is applied to further associate the short tracklets. At the high level, using the computed tracklets, entries/exits and occlusions are estimated, and final trajectories are refined using them. Li et al. [24] proposed a HybridBoost algorithm for learning tracklet affinity models in which the problem of ranking and classification is jointly considered. Kuo et al. [37] proposed online learned discriminative appearance models (OLDAM) to enhance MOT accuracy through discriminative appearance modelling using an AdaBoost algorithm. Bak et al. [28] proposed an algorithm to learn discriminative appearance models based on a mean Riemannian covariance grid descriptor obtained from tracklets given by short-term tracking. Yang et al. [48] devised MOT by online nonlinear motion patterns learning and a multiple instance learning based on incrementally learned entry/exit map. Table 3 shows the comparison results of the proposed approach with the competing MOT methods on the CAVIAR dataset. From Table 3, it is obviously seen that the proposal could achieve the best performance than the others in terms of RC and PRCS, and generally good performance in terms of FAF, MT and IDS. The instances of the tracking results using CRF-Boosting MOT are shown in Figure 5.

*Results and Analysis on the PETS dataset:* Kuo et al. [22] proposed a Person Identity Recognition- based Multi-Person Tracking (PIRMPT) method where they used person recognition and divided reliable tracklets as query tracklets and gallery tracklets in which for each gallery tracklet a target-specific appearance-based affinity model is learned.

PIRMPT used the similar framework of OLDAM [37] in collecting training samples for learning online discriminative appearance models but it further improved by automatic learning of discriminative features obtained from the target-specific appearance information. From Table 4, compared to the other algorithms it can be seen that CRF-Boosting could obtain best performance in terms of ML and IDS and comparable performance in terms of PRCS and FRG. The instances of the tracking results using CRF-Boosting MOT are shown in Figure 6.

*Results and Analysis on the ETH dataset:* Kim et al. [49] proposed an online data association which is formulated as a bipartite matching and solved by structural support vector machines (S-SVM). In Bo and Nevatia [38], an online learned CRF model is used and MOT is formulated as an energy minimization problem where energy functions consists of a set of unary functions based on appearance and motion models to discriminate targets.

From Table 5, it can be trivially seen that the proposed method could outperform the other competitive MOT methods in terms of RC, PRCS, MT, ML, FRG and IDS, which shows the significance and robustness of the proposed synthesizing of CRF matching and online hybrid boosting in associating tracklets. The instances of the tracking results using CRF-Boosting MOT are shown in Figure 7.

*Conclusions from Experimental Results on Different Datasets:* From the experimental results on different datasets, we could show the general outperformance of the proposed MOT approach on the CAVIAR dataset and its good performance compared to the other online MOT methods is also verified on the ETH dataset. However, from the results on the PETS dataset, we found that it may be required for the proposed MOT approach to adopt a motion pattern learning approach to improve MOT performance further through modelling nonlinear motion affinity. Also, as the other MOT methods, CRF-Boosting MOT also suffers from performance degradation problems for densely crowded and long-term occlusions. To remedy these issues, it would be beneficial to devise more an advanced appearance modelling approach (e.g., considering different poses and person re-identification module) and robust motion modelling approach (e.g., by learning different types of motion patterns).

*Discussion on Efficiency of CRF-Boosting Hybridization:* As we can easily can be seen from Table 6 that (i) ‘CRF-Boosting MOT w/o Boosting’ (i.e., only using CRF matching) produced the worst performance in terms of all metrics; (ii) ‘CRF-Boosting MOT w/o CRF Matching’ (i.e., only using online hybrid boosting) was slightly better than ‘CRF-Boosting MOT w/o Boosting’; and (iii) CRF-Boosting MOT (i.e., with CRF matching and online hybrid boosting) outperformed the others. From this, we can conclude that by synthesizing the two components together we could improve MOT performance.

*Discussion on Computational Speed:* We tested our proposed system on a PC equipped with an Intel^{®} Core™ i7-3770 CPU @ 3.40 GHz and 32 GB RAM, and the program was coded in Visual Studio Professional 2010 C++ without any parallel programming. As shown in Table 7, the tracking speed of our system is approximately 17 FPS on the image size of 400 × 300. This indicates that that the proposed online MOT system has high feasibility to be executed in real-time with reasonable tracking accuracy.

We have presented an online hybrid data association method based on hybrid boosting employing CRF matching to facilitate robust online MOT systems. In the proposed approach, called CRF-boosting, for data association, learned CRF is used to construct reliable low-level tracklets and then they are used as the input of the hybrid boosting. Due to the synergetic cascaded learning procedure, CRF-boosting is capable of ensuring sufficient robustness with noisy detection results (i.e., without accurate ground truth information). Also, a hierarchical association framework is established to improve tracking accuracy. Experiments on public datasets show that the proposed approach could generally outperform the other competitive methods, from which we could naturally conclude that such a hybridized proposal is effective. We only demonstrated hierarchical association of simple features. Although the challenging hand-crafted features such as color similarity-based histograms of oriented gradients with the HSV color space [50] can be also adopted, we did not consolidate such computationally expensive features in this work considering the tracking speed. As a future work, we will further optimize the codes to get better performance in terms of MOT speed. Also, the challenging features will be also incorporated into the hierarchical feature association framework. Finally, we note that the study of substituting the data association scheme based on deep learning methodology is being carried out to obtain significant performance enhancement in terms of tracking accuracy.

This work was supported by the ICT R&D program of MSIP/IITP. (B0101-16-0525, Development of global multi-target tracking and event prediction techniques based on real-time large-scale video analysis), and the Brain Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2016M3C7A1905477, NRF-2014M3C7A1046050).

Author Contributions

E. Yang designed the initial model, performed the experiments, and wrote the initial rough draft; J. Gwak further refined and modified the research proposal, carried out experimental analysis, and wrote the final manuscript and responses; M. Jeon administered the experiments and gave technical support and conceptual advice.

Conflicts of Interest

The authors declare no conflict of interest.

1. Fleuret F., Berclaz J., Lengagne R., Fua P. Multicamera people tracking with a probabilistic occupancy map. IEEE Trans. Pattern Anal. Mach. Intell. 2008;30:267–282. doi: 10.1109/TPAMI.2007.1174. [PubMed] [Cross Ref]

2. Berclaz J., Fleuret F., Turetken E., Fua P. Multiple object tracking using k-shortest paths optimization. IEEE Trans. Pattern Anal. Mach. Intell. 2011;33:1806–1819. doi: 10.1109/TPAMI.2011.21. [PubMed] [Cross Ref]

3. Dalal N., Triggs B. Histograms of oriented gradients for human detection; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; San Diego, CA, USA. 20–26 June 2005.

4. Bourdev L., Maji S., Brox T., Malik J. Detecting people using mutually consistent poselet activations; Proceedings of the 11th European Conference on Computer vision; Crete, Greece. 5–11 September 2010.

5. Felzenszwalb P.F., Girshick R.B., McAllester D., Ramanan D. Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 2010;32:1627–1645. doi: 10.1109/TPAMI.2009.167. [PubMed] [Cross Ref]

6. Jiang H., Fels S., Little J.J. A linear programming approach for multiple object tracking; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Minneapolis, MN, USA. 18–23 June 2007.

7. Zhang L., Li Y., Nevatia R. Global data association for multi object tracking using network flows; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Anchorage, AK, USA. 24–26 June 2008.

8. Perera A.G.A., Srinivas C., Hoogs A., Brooksby G., Hu W. Multi-object tracking through simultaneous long occlusions and spilt-merge condition; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; New York, NY, USA. 17–22 June 2006.

9. Yang E., Gwak J., Jeon M. Multi-human tracking using part-based appearance modelling and grouping-based tracklet association for visual surveillance applications. Multimedia Tools Appl. 2016 doi: 10.1007/s11042-015-3219-8. [Cross Ref]

10. Milan A., Schindler K., Roth S. Multi-target tracking by discrete-continuous energy minimization. IEEE Trans. Pattern Anal. Mach. Intell. 2016 doi: 10.1109/TPAMI.2015.2505309. [PubMed] [Cross Ref]

11. Dehghan A., Assari S.M., Shah M. GMMCP Tracker: Globally Optimal Generalized Maximum Multi Clique Problem for Multiple Object Tracking; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Boston, MA, USA. 8–10 June 2015.

12. Milan A., Leal-Taixe L., Schindler K., Reid I. Joint Tracking and Segmentation of Multiple Targets; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Boston, MA, USA. 8–10 June 2015.

13. Chari V., Lacoste-Julien S., Laptev I., Sivic J. On Pairwise Costs for Network Flow Multi-Object Tracking; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Boston, MA, USA. 8–10 June 2015.

14. Tang S., Andres B., Andriluka M., Schiele B. Subgraph Decomposition for Multi-Target Tracking; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Boston, MA, USA. 8–10 June 2015.

15. Dehghan A., Tian Y., Torr P.H.S., Shah M. Target Identity-aware Network Flow for Online Multiple Target Tracking; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Boston, MA, USA. 8–10 June 2015.

16. Xu Y., Liu X., Liu Y., Zhu S. Multi-view People Tracking via Hierarchical Trajectory Composition; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Lasvegas, NV, USA. 27–30 June 2016.

17. Yu S., Meng D., Zuo W., Hauptmann A. The Solution Path Algorithm for Identity-Aware Multi-Object Tracking; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Lasvegas, NV, USA. 27–30 June 2016.

18. Milan A., Rezatofighi S.H., Dick A., Schindler K., Reid I. Online Multi-target Tracking using Recurrent Neural Networks. IEEE Conf. Comput. Vis. Pattern Recognit. 2016

19. Xiang Y., Alahi A., Savarese S. Learning to Track: Online Multi-Object Tracking by Decision Making; Proceedings of the International Conference on Computer Vision; Santiago, Chile. 10–18 December 2015.

20. Wen L., Lei Z., Lyu S., Li S.Z., Yang M. Exploiting Hierarchical Dense Structures on Hypergraphs for Multi-Object Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2016;38:1983–1996. doi: 10.1109/TPAMI.2015.2509979. [PubMed] [Cross Ref]

21. Lafferty J., McCallum A., Pereira F.C.N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data; Proceedings of the International Conference on Machine Learning; Williamstown, MA, USA. 28 June–1 July 2001.

22. Kuo C.H., Nevatia R. How does person identity recognition help multi-person tracking?; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Colorado Springs, CO, USA. 20–25 June 2011.

23. Freund Y., Iyer R., Schapire R.E., Singer Y. An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 2003;4:933.

24. Li Y., Huang C., Nevatia R. Learning to Associate: Hybrid Boosted Multi-Target Tracker for Crowded Scene; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Miami, FL, USA. 20–25 June 2009.

25. Milan A., Schindler K., Roth S. Detection- and trajectory-level exclusion in multiple object tracking; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Portland, OR, USA. 23–28 June 2013.

26. Poiesi F., Mazzon R., Cavallaro A. Multi-target tracking on confidence maps: An application to people tracking. Comput. Vis. Image Underst. 2013;117:1257–1272. doi: 10.1016/j.cviu.2012.08.008. [Cross Ref]

27. Bae S., Yoon K. Robust Online Multi-Object Tracking based on Tracklet Confidence and Online Discriminative Appearance Learning; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Columbus, OH, USA. 24–27 June 2014.

28. Bak S., Chau D., Badie J., Corvee E., Bremond F., Thonnat M. Multi-target tracking by Discriminative analysis on Riemannian Manifold; Proceedings of the IEEE International Conference on Image Processing; Orlando, FL, USA. 30 September–3 October 2012.

29. Ba S., Alameda-Pineda X., Xompero A., Horaud R. An on-line variational Bayesian model for multi-person tracking from cluttered scenes. Comput. Vis. Image Underst. 2016;153:64–76. doi: 10.1016/j.cviu.2016.07.006. [Cross Ref]

30. Ukita N., Okada A. High-order framewise smoothness-constrained globally-optimal tracking. Comput. Vis. Image Underst. 2016;153:130–142. doi: 10.1016/j.cviu.2016.05.012. [Cross Ref]

31. Milan A., Roth S., Schindler K. Continuous energy minimization for multitarget tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2014;36:58–72. doi: 10.1109/TPAMI.2013.103. [PubMed] [Cross Ref]

32. Kim T.-K., Stenger B., Kittler J., Cipolla R. Incremental linear discriminant analysis using sufficient spanning sets and its applications. Int. J. Comput. Vis. 2011;91:216–232. doi: 10.1007/s11263-010-0381-3. [Cross Ref]

33. Yang B., Huang C., Nevatia R. Learning Affinities and Dependencies for Multi-Target Tracking using a CRF Model; Proceedings of the IEEE Computer Vision and Pattern Recognition; Colorado Springs, CO, USA. 21–23 June 2011.

34. Yang M., Lv F., Xu W., Gong Y. Detection driven adaptive multi-cue integration for multiple human tracking; Proceedings of the IEEE International Conference on Computer Vision; Kyoto, Japan. 29 September–2 October 2009.

35. Wu B., Nevatia R. Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors. Int. J. Comput. Vis. 2007;75:247–266. doi: 10.1007/s11263-006-0027-7. [Cross Ref]

36. Liu D.C., Nocedal J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989;45:503. doi: 10.1007/BF01589116. [Cross Ref]

37. Kuo C.-H., Huang C., Nevatia R. Multi-Target Tracking by On-Line Learned Discriminative Appearance Model; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; San Francisco, CA, USA. 13–18 June 2010.

38. Yang B., Nevatia R. An Online Learned CRF Model for Multi-Target Tracking; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Providence, RI, USA. 16–21 June 2012.

39. Huang C., Wu B., Nevatia R. Robust object tracking by hierarchical association of detection responses; Proceedings of the 10th European Conference on Computer vision; Marseille, France. 12–18 October 2008.

40. Ramos F., Fox D., Durrant-Whyte H. CRF-Matching: Conditional random fields for feature-based scan matching; Proceedings of the Robotics Science and Systems; Atlanta, GA, USA. 27–30 June 2007.

41. Sutton C., Mccallum A. An Introduction to Conditional Random Fields for Relational Learning. In: Getoor L., Taskar B., editors. Introduction to Statistical Relational Learning. MIT Press; Cambridge, MA, USA: 2007.

42. Besag J. Statistical Analysis of Non-lattice Data. Statistician. 1975;24:179. doi: 10.2307/2987782. [Cross Ref]

43. Tuzel O., Porikli F., Meer P. Region covariance: A fast descriptor for detection and classification; Proceeding of the 9th European Conference on Computer Vision; Graz, Austria. 7–13 May 2006.

44. CAVIAR Test Case Scenarios. [(accessed on 15 August 2016)]. Available online: http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/

45. PETS 2009 Benchmark Data. [(accessed on 15 August 2016)]. Available online: http://www.cvg.rdg.ac.uk/PETS2009/a.html.

46. ETH Data. [(accessed on 15 August 2016)]. Available online: https://data.vision.ee.ethz.ch/cvl/aess/dataset/

47. Wu B., Nevatia R. Tracking of multiple, partially occluded humans based on static body part detection; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; New York, NY, USA. 17–22 June 2006.

48. Yang B., Nevatia R. Multi-target tracking by online learning of non-linear motion patterns and robust appearance models; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Providence, RI, USA. 16–21 June 2012.

49. Kim S., Kwak S., Feyereusl J., Kim B.H. Online Multi-Target Tracking by Large Margin Structured Learning; Proceedings of the 11th Asian Conference on Computer Vision; Daejeon, Korea. 5–9 November 2012.

50. Goto Y., Yamauchi Y., Fujiyoshi H. CS-HOG: Color similarity-based hog; Proceedings of the Korea–Japan Joint Workshop on Frontiers of Computer Vision; Incheon, Korea. 30 January–1 February 2013.

Articles from Sensors (Basel, Switzerland) are provided here courtesy of **Multidisciplinary Digital Publishing Institute (MDPI)**