|Home | About | Journals | Submit | Contact Us | Français|
Due to the reasonably acceptable performance of state-of-the-art object detectors, tracking-by-detection is a standard strategy for visual multi-object tracking (MOT). In particular, online MOT is more demanding due to its diverse applications in time-critical situations. A main issue of realizing online MOT is how to associate noisy object detection results on a new frame with previously being tracked objects. In this work, we propose a multi-object tracker method called CRF-boosting which utilizes a hybrid data association method based on online hybrid boosting facilitated by a conditional random field (CRF) for establishing online MOT. For data association, learned CRF is used to generate reliable low-level tracklets and then these are used as the input of the hybrid boosting. To do so, while existing data association methods based on boosting algorithms have the necessity of training data having ground truth information to improve robustness, CRF-boosting ensures sufficient robustness without such information due to the synergetic cascaded learning procedure. Further, a hierarchical feature association framework is adopted to further improve MOT accuracy. From experimental results on public datasets, we could conclude that the benefit of proposed hybrid approach compared to the other competitive MOT systems is noticeable.
Multiple object tracking (MOT) [1,2] is one of the most important and hectic areas in the field of computer vision research, and recent advances on detection and tracking of multiple objects have led to its application to diverse practical problems such as bio-medical imaging, visual surveillance systems and augmented reality. The main tasks of establishing MOT systems are to extract positions of objects, to generate the trajectories of each individual object, and to maintain the identity of each object, even for crowded environments. There are several issues that increase MOT complexity such as imprecise and noisy detections, occlusions by the other objects or background, and dynamic interactions among objects.
Due to the success in developing robust object detectors [3,4,5], many recent studies on MOT adopt tracking-by-detection approaches [6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], where the key research topic is data association to link object detections or tracklets (i.e., track fragments) in a sequence of frames for assembling the final trajectories of the objects. Such MOT systems based on data association consist of two main components: (1) a tracklet affinity model measuring the likelihood (or linking probability) that two detection responses or tracklets belong to the same target; and (2) a global optimization framework determining which detection responses or tracklets should be linked based on the affinity measurement, which is commonly formulated as a maximum a posteriori problem.
Although many methods have been proposed to develop global optimization frameworks based on linear programming , min-cost flow algorithm  and Hungarian algorithm , relatively less effort has been devoted to improving the affinity model. Simple affinity models widely adopted for efficiency purposes are mostly based on straightforward parametric models (e.g., Gaussian distributions for object location changes and distance between color histograms for object appearance affinity measurement). Moreover, in many cases, the model parameters and the relative emphases of different cues are determined depending on prior knowledge or human observation of the data. When environmental changes or different cues (e.g., appearance, motion, and context information) are combined into one affinity model, it is almost impossible to tune the model manually.
To overcome such difficulties, we propose a hybrid data association algorithm combining conditional random field (CRF) [21,22] and online hybrid boosting for building robust MOT. Existing data association approaches adopting different machine learning techniques such as boosting need training sets with ground truth information [23,24] for higher accuracy. While rank boost  achieves better performance than binary boost , it is very difficult to design an online algorithm for this because of its ranking concept. CRF is a powerful model adopted in many computer vision research fields, but not widely utilized in data association for MOT. In our work, with the aim of designing an online MOT system, we incorporate CRF, which enables low-level data association into a hybrid boosting-based data association approach with a ranking concept. Specifically, we represent the association of detection responses between two frames as a graph for CRF, and design an online algorithm by applying the results of CRF-based pairwise similarity matching to build the training data. Finally, the CRF learning output is used for the input to the hybrid boosting algorithm that learns tracklet affinity models. To this end, the contributions of this work are as follows:
The rest of this paper is organized as follows: the preliminaries of this work, CRF and boosting approaches, are described in Section 2. Section 3 describes the details of the proposed hybrid MOT approach. The experimental results and analysis are given in Section 4. Finally, the conclusions and future work are given in Section 5.
One key issue in MOT is how to distinguish targets from background and other objects. To do this, researchers usually try to find or learn proper appearance models which have the capabilities of identifying one target from among all other objects or background. Also, to perform effective tracklets associations, data association frameworks have been widely studied. Most of the MOT methods usually take the tracking-by-detection approaches [6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,25,26,27,28,29,30,31,32,33,34,35] and can be classified into two categories: (1) MOT utilizing past and current frames for association decisions (e.g., [34,35]); and (2) MOT using all the frames, including past, current, and future frames (e.g., [7,8,9,24,36]). The former usually adopts a particle filtering framework based on detection responses, and it is more suitable for time-critical applications and systems because it does not require future frames. However, it is very vulnerable to noisy observations and long-term occlusions of targets. To obtain further improved results, the latter uses all the frames and adopts global optimization. The tracking-by-detection-based MOT methods usually associate detection responses obtained from a pre-trained detector into tracklets progressively and finally construct trajectories for all targets. Appearance models, whether pre-trained or online learned, are commonly adopted to distinguish targets. In addition, motion models can be also adopted to predict the feasible position of objects in the future frames, which reduces the search space. The appearance and motion models may be optimized, but for differentiating all targets each other, there are still some challenges such as: (1) similar appearance of targets; (2) complex interactions among objects; (3) frequent occlusions; (4) different size of targets, and (5) initialization and termination of tracks.
Among the related works, we mainly focus on reviewing the closely related works on boosting-based MOT [9,24,27,37] and CRF-based MOT [33,38]. Boosting-based MOT is easier to implement than CRF-based MOT, and boosting can be used in combination with different learning algorithms to improve its performance. In boosting-based MOT, most studies focus on improving the robustness and effectiveness of appearance models which can be used as distinctive feature information. In contrast, studies on CRF-based MOT have usually focused on data association to generate final trajectories. In Li et al. , HybridBoost was used to learn an appearance model which is integrated in a hierarchical data association framework  to progressively grow tracklets. In Yang et al.  devised a part-based appearance modelling and grouping-based data association framework to alleviate the problems of frequent occlusions and similar appearances among objects. A boosting algorithm was used to learn a part-based appearance model. In Kuo et al. , an online learning approach to build a discriminative appearance model was proposed. The AdaBoost algorithm is used to combine effective image descriptors and their corresponding similarity measurements. To make online learning possible, positive and negative training samples are obtained from the results of short but reliable tracklets using a dual-threshold method . Bae and Yoon  proposed online MOT based on tracklet confidence and online discriminative appearance learning. Effective tracklets are obtained by sequentially linking detections/tracklets using local and global association according to their confidence levels, and incremental linear discriminant analysis  is used for online discriminative appearance model learning. Yang et al.  proposed a CRF model to consider both tracklet affinities and dependences among tracklets, and to transform the problem of MOT into an energy minimization task. In Yang and Nevatia , an online learned CRF model was used to generate final trajectories. For online learning, low-level tracklets are required and are generated by simply using color or location information between two consecutive frames. However, in many cases, this is not practical because it can increase association errors under noisy observation conditions.
In this section, the two key elements, CRF and hybrid boosting methods, which are used to build an effective tracklet affinity model in Section 3, are explained in detail.
CRFs are discriminative undirected probabilistic graphical models developed for labeling/segmenting structural and sequential data [21,40], and it is shown in  that they are competent in modelling spatial relationships. We can define conditional distribution over the hidden variables x given observation z where nodes represents hidden states and nodes indicates data. Using the nodes xi and their connectivity structure represented by undirected edges, we define the conditional distribution over x. Suppose C is the set of cliques which are fully connected subsets in the graph of a CRF, the CRF can factorize the conditional distribution into a product of pairwise clique potentials , where every is a clique in the graph, xc is the variable of the hidden node and z is the observation in the clique. By clique potentials, the conditional distribution over hidden states is written as:
where is the normalizing partition function. Also, is described by log-linear combinations of feature functions fc as follows:
where is a weight vector, and is a feature function. Then, (1) can be rewritten as:
The weights of the feature functions in (3) are determined by the CRF parameter learning. CRF learns the weights discriminatively through maximizing the conditional likelihood of labeled training data. We can find the global optimum of (3) using a numerical gradient method, but it is very inefficient because the inference procedure of the optimization should be executed at each iteration. Thus, we adopt the method of maximizing the pseudo-likelihood of the training data and it is given by the sum of local likelihoods , where is the xi’s Markov blanket indicating the set of the immediate neighbors of xi in the CRF graph . The optimization is performed by minimizing:
where the rightmost term represents a Gaussian shrinkage prior with mean and variance . We use unconstrained L-BFGS  as a gradient descent method to optimize (4). Then, at the inference stage using a new test data, the learned CRF estimate the most likely configuration of all hidden variables x using belief propagation .
Boosting has been successfully used in a variety of machine learning tasks and widely applied to computer vision tasks as well. In this section, for learning an appearance affinity model, we introduce a hybrid boosting algorithm having the property of both a ranking function and a binary classifier.
A ranking problem includes an instance space X with a ranking function H that defines a linear ordering of instances in X. H takes the form of . Proposed by Freund et al. , rank boost is an algorithm invented for this purpose. In rank boost, a set of instance pairs constitute training data, where xj should be ranked higher than xi, . The aim is finding such H that describes the ranking over X.
We can map the ranking problem onto the data association problem. We define instance X to be T × T where T is the set of tracklets to be possibly associated. For example, given tracklets , if T1 and T3 are the real trajectory that should be correctly linked, then the ranking must be and . When Tt is the terminating tracklet of a target trajectory, to prevent associating Tt to any other tracklet Tc, it is defined as where is a rejection threshold. Also, objects in different tracklets in a frame (i.e., at the same time) cannot be the same target. In these cases, it becomes the problem of both ranking and binary classification to define an impossible association link.
To resolve the problem, in the hybrid boosting algorithm, the training set is composed of a ranking sample set R and a binary sample set B. The ranking sample set is denoted by:
where each xi,0 and xi,1 represents a pair of tracklets, and means that the association of xi,1 is ranked higher than xi,0. The binary sample set is denoted by:
where yj = 1 indicates the corresponding xj should be associated at any time, and yj = −1 means the corresponding xj should not be associated. A loss function for the hybrid boosting is defined as a linear combination of the ranking loss function and the binary classification loss function given as:
where β is a constant coefficient and ω0 is the initial weight function. In the boosting algorithm, to find H(x), we need to minimize Z, and H can be obtained by adding new weak ranking classifiers sequentially. Therefore, (7) can be written using weak ranking classifier and its weight αt as follows:
Finally, the final strong ranking classifier is the weighted combination of the selected weak ranking classifiers as follows:
where n is the number of boosting rounds. Attributed to the loss function Z, H(x) contains the advantage of both a ranking classifier and a binary classifier.
In this section, based on the CRF and hybrid boosting discussed in Section 3, we demonstrate how to design a robust online MOT system called CRF-boosting.
For tracking multiple objects robustly under difficult conditions such as with noisy or missed detections, many boosting-based data association methods have used training data with ground truth (GT) information or the like. In many cases, due to the impracticality and inconvenience of obtaining training data with accurate GT information in different situations, offline learning of an affinity model was commonly adopted. However, in such a way, it is very difficult to implement robust online MOT with real-time processing capability. To overcome this drawback, in this work, we generate a CRF model for intermittent temporary tracklet association between two consecutive frames, and the results (i.e., those with selected good samples) from the CRF model are used as the training data for hybrid boosting to establish an online MOT system called CRF-boosting. In addition, based on hierarchical feature association through online hybrid boosting algorithm, detection responses are progressively linked into longer ones to form final tracking outcomes in an online manner. Figure 1 shows the overall schematics of the proposed system.
At the first step, as input data, detection responses are obtained from image sequences. In the hybrid boosting algorithm, we use not only ranking information, but also binary information, and thus it is very crucial to utilize accurate and reliable tracklet information in its training process. To do this, we use a learned CRF model  which can give the similarity information between objects in two consecutive frames. The construction of the CRF model is described in Section 4.2. The reliable short tracklets constructed by the CRF model are used as the input of the hybrid boosting-based data association algorithm that produces the final trajectory information. The details of the hybrid boosting are described in Section 4.3.
In CRF, intermittent temporary connections among detected objects between frames are made with the feature information of the objects. To find the links between two frames, we generate a CRF graph that contains hidden node indicating object i in frame t. In generating a graph of CRF, node is not connected with all nodes at the next frame t; Node is connected with within certain boundary σ from its position (i.e., only neighboring objects are connected) using regional (i.e., local proximity-based) connectivity assuming that the object is not moving suddenly far away between two consecutive frames. Here, we set the . Then, considering the local proximity, an efficient CRF model can be constructed. An example is given in Figure 2.
Node in Figure 2 corresponds to the local features (i.e., observation data) of hidden node (i.e., object i). In this work, we use the spatial distance  and visual appearance including color histogram  and covariance [37,43] as the features. Then, (3) in Section 3.1 expresses conditional distribution of the CRF, and the function of each feature for similarity measurement is defined as the differences of features among the objects. In this CRF, the feature function of spatial distance between object j in frame t − 1 and object i in frame t is defined as:
where is the position of individual points in i, is the position of individual points in j, and σ2 is the variance of the distances in the training data. The feature function of color histogram is defined as:
where is the color histogram of i, is the color histogram of j, and σ2 is the variance of the color histogram differences in the training data. Single channel histograms are concatenated to construct a single vector with 8 bins for each channel, resulting a 24-dimensional vector. Next, the feature function of covariance is computed by:
where are the generalized eigenvalues of Ci and Cj computed from where are generalized eigenvectors; Ci corresponds to the covariance matrix defined as:
where P is the number of pixels in the region of i, denoted as , is the pixel mean vector over , I is the intensity of the pixel and is the vector consists of the first and second derivatives of at p-th pixel, which is given as:
Similar to , the image derivatives are computed using the filters and , resulting covariance of a region is a 9 × 9 matrix.
For learning a hybrid boosting algorithm in an online manner, we have to compose training sets automatically. In this work, the information of matched detection responses as a result of CRF matching (Section 4.2) in consecutive frames are employed for the purpose. The spatio-temporal distance information is used for composing training dataset. The training datasets are divided into the ranking dataset and the binary dataset, where each dataset consists of positive and negative datasets for learning the boosting algorithm. Then, we assume that each tracklet corresponds to an object and the targets at a frame (i.e., at the same time) constitutes the tracklets different from each other. That is, since it is trivial that the objects in different trajectories cannot be the same target, we use this spatio-temporal constraint for building the training data. In this way, using the reliable tracklets output of the CRF matching, we can construct the training dataset for the boosting algorithm. We used the ranking training set defined in (5) and the binary training set defined in (6). Figure 3 shows an example of constructing the training dataset.
Similar to , as shown in Table 1, 13 types of hierarchical features, representing length of tracklets (idx 1 to 3), appearance information of tracklets (idx 4, 5), frame gap information of tracklets (idx 6 to 9), and motion information of tracklets (idx 10, 11), are adopted in this work. The online hybrid boosting algorithm is given in Algorithm 1 (details was discussed in Section 3.2). In the boosting algorithm, each feature is a function , which takes a pair of tracklets as its input and outputs a real value. The weak ranking classifier is defined as:
As described in Figure 4, we design the boosting algorithm with two stages in its training procedure. In constructing trajectories, the two stage training procedures can help to exploit more accurate ranking information, e.g., by appearance affinity models with different poses in a trajectory, through considering different length of tracklets. For this, the maximum length in the first stage is defined as the 1/4 of the full training image sequences, and that in the second stage is 1/2 of the sequences. By training incrementally, we can obtain the more accurate tracklets information rather than utilizing all image sequences at once, which improves the MOT system robustness (i.e., capable of reducing tracking errors).
The procedure of the proposed CRF-boosting algorithm is given in Algorithm 2. In the proposed CRF-Boosting tracker, two-stage training is performed. As a result of CRF-based pairwise similarity matching, robust low-level tracklets are obtained and using them, ranking and binary classification samples are formed in an online manner. Then, a strong ranking classifier is learned using hybrid boosting in Algorithm 1. The CRF-boosting tracker using as the tracklet affinity model is then applied to generate the 1st stage association. The above procedures are repeated to establish the 2nd stage association. Finally, trajectories for all targets are constructed.
|Algorithm 1: Online Hybrid Boosting Algorithm|
|Algorithm 2: CRF-Boosting Tracker with the Two-Stage Training Procedure|
In this section, the experimental results, their analyses, and the experimental conclusions that can be drawn are discussed. We evaluate the effectiveness of our proposed MOT system with three widely used public surveillance datasets: CAVIAR , PETS2009  and ETH . The CAVIAR dataset contains 26 video sequences of corridor in a shopping mall taken by a single camera with frame size of 384 × 288 and frame rate of 25 FPS. The PETS2009 dataset include the “S2.L1” (sparsely crowded scenes), “S2.L2” (moderately crowded scenes), “S2.L3” (densely crowded scenes) videos taken by a multiple static camera with frame size of 768 × 576 pixels and frame rate of 25 FPS. The ETH dataset contains video sequences taken by a stereo forward-looking camera mounted on a moving children’s stroller on busy street scenes. The frame rate is 14 FPS and the image size is 640 × 480 pixels for the videos. We chose the “Bahnhof” and “Sunny day” sequences from the ETH dataset. The human detection results are the same as used in [37,38] and are provided by courtesy of authors of .
Following the metrics used in , we use the evaluation metrics described in Table 2. The better MOT performance is obtained for the higher values in RC and MT and for the lower values in FAF, ML, FRG and IDS. By the definitions, the total sum of MT, PT and ML should be 100%. In general, a higher value of PT is better, but if MOT improves MT by better association capability PT can be decreased because it can result in lesser partial trajectories. That is, PT depends on the tracklet association performance of MOT. Therefore, we exempt PT from the analyses of the experimental results, but it is remained in the resulting tables, Table 3, Table 4, Table 5 and Table 6, for the readers’ reference.
Results and Analysis on the CAVIAR dataset: Wu and Nevatia  presented body-part detection based MOT in which a human are represented by four body parts including full-body, head-shoulder, torso and legs. Zhang et al.  introduced a min-cost flow network based data association framework with a non-overlap constraint on trajectories. Huang et al.  devised three-level hierarchical data association approach. At the low level, reliable short tracklets are obtained, and at the middle level, the Hungarian algorithm is applied to further associate the short tracklets. At the high level, using the computed tracklets, entries/exits and occlusions are estimated, and final trajectories are refined using them. Li et al.  proposed a HybridBoost algorithm for learning tracklet affinity models in which the problem of ranking and classification is jointly considered. Kuo et al.  proposed online learned discriminative appearance models (OLDAM) to enhance MOT accuracy through discriminative appearance modelling using an AdaBoost algorithm. Bak et al.  proposed an algorithm to learn discriminative appearance models based on a mean Riemannian covariance grid descriptor obtained from tracklets given by short-term tracking. Yang et al.  devised MOT by online nonlinear motion patterns learning and a multiple instance learning based on incrementally learned entry/exit map. Table 3 shows the comparison results of the proposed approach with the competing MOT methods on the CAVIAR dataset. From Table 3, it is obviously seen that the proposal could achieve the best performance than the others in terms of RC and PRCS, and generally good performance in terms of FAF, MT and IDS. The instances of the tracking results using CRF-Boosting MOT are shown in Figure 5.
Results and Analysis on the PETS dataset: Kuo et al.  proposed a Person Identity Recognition- based Multi-Person Tracking (PIRMPT) method where they used person recognition and divided reliable tracklets as query tracklets and gallery tracklets in which for each gallery tracklet a target-specific appearance-based affinity model is learned.
PIRMPT used the similar framework of OLDAM  in collecting training samples for learning online discriminative appearance models but it further improved by automatic learning of discriminative features obtained from the target-specific appearance information. From Table 4, compared to the other algorithms it can be seen that CRF-Boosting could obtain best performance in terms of ML and IDS and comparable performance in terms of PRCS and FRG. The instances of the tracking results using CRF-Boosting MOT are shown in Figure 6.
Results and Analysis on the ETH dataset: Kim et al.  proposed an online data association which is formulated as a bipartite matching and solved by structural support vector machines (S-SVM). In Bo and Nevatia , an online learned CRF model is used and MOT is formulated as an energy minimization problem where energy functions consists of a set of unary functions based on appearance and motion models to discriminate targets.
From Table 5, it can be trivially seen that the proposed method could outperform the other competitive MOT methods in terms of RC, PRCS, MT, ML, FRG and IDS, which shows the significance and robustness of the proposed synthesizing of CRF matching and online hybrid boosting in associating tracklets. The instances of the tracking results using CRF-Boosting MOT are shown in Figure 7.
Conclusions from Experimental Results on Different Datasets: From the experimental results on different datasets, we could show the general outperformance of the proposed MOT approach on the CAVIAR dataset and its good performance compared to the other online MOT methods is also verified on the ETH dataset. However, from the results on the PETS dataset, we found that it may be required for the proposed MOT approach to adopt a motion pattern learning approach to improve MOT performance further through modelling nonlinear motion affinity. Also, as the other MOT methods, CRF-Boosting MOT also suffers from performance degradation problems for densely crowded and long-term occlusions. To remedy these issues, it would be beneficial to devise more an advanced appearance modelling approach (e.g., considering different poses and person re-identification module) and robust motion modelling approach (e.g., by learning different types of motion patterns).
Discussion on Efficiency of CRF-Boosting Hybridization: As we can easily can be seen from Table 6 that (i) ‘CRF-Boosting MOT w/o Boosting’ (i.e., only using CRF matching) produced the worst performance in terms of all metrics; (ii) ‘CRF-Boosting MOT w/o CRF Matching’ (i.e., only using online hybrid boosting) was slightly better than ‘CRF-Boosting MOT w/o Boosting’; and (iii) CRF-Boosting MOT (i.e., with CRF matching and online hybrid boosting) outperformed the others. From this, we can conclude that by synthesizing the two components together we could improve MOT performance.
Discussion on Computational Speed: We tested our proposed system on a PC equipped with an Intel® Core™ i7-3770 CPU @ 3.40 GHz and 32 GB RAM, and the program was coded in Visual Studio Professional 2010 C++ without any parallel programming. As shown in Table 7, the tracking speed of our system is approximately 17 FPS on the image size of 400 × 300. This indicates that that the proposed online MOT system has high feasibility to be executed in real-time with reasonable tracking accuracy.
We have presented an online hybrid data association method based on hybrid boosting employing CRF matching to facilitate robust online MOT systems. In the proposed approach, called CRF-boosting, for data association, learned CRF is used to construct reliable low-level tracklets and then they are used as the input of the hybrid boosting. Due to the synergetic cascaded learning procedure, CRF-boosting is capable of ensuring sufficient robustness with noisy detection results (i.e., without accurate ground truth information). Also, a hierarchical association framework is established to improve tracking accuracy. Experiments on public datasets show that the proposed approach could generally outperform the other competitive methods, from which we could naturally conclude that such a hybridized proposal is effective. We only demonstrated hierarchical association of simple features. Although the challenging hand-crafted features such as color similarity-based histograms of oriented gradients with the HSV color space  can be also adopted, we did not consolidate such computationally expensive features in this work considering the tracking speed. As a future work, we will further optimize the codes to get better performance in terms of MOT speed. Also, the challenging features will be also incorporated into the hierarchical feature association framework. Finally, we note that the study of substituting the data association scheme based on deep learning methodology is being carried out to obtain significant performance enhancement in terms of tracking accuracy.
This work was supported by the ICT R&D program of MSIP/IITP. (B0101-16-0525, Development of global multi-target tracking and event prediction techniques based on real-time large-scale video analysis), and the Brain Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2016M3C7A1905477, NRF-2014M3C7A1046050).
E. Yang designed the initial model, performed the experiments, and wrote the initial rough draft; J. Gwak further refined and modified the research proposal, carried out experimental analysis, and wrote the final manuscript and responses; M. Jeon administered the experiments and gave technical support and conceptual advice.
The authors declare no conflict of interest.