|Home | About | Journals | Submit | Contact Us | Français|
Recent developments in social media and cloud storage lead to an exponential growth in the amount of multimedia data, which increases the complexity of managing, storing, indexing, and retrieving information from such big data. Many current content-based concept detection approaches lag from successfully bridging the semantic gap. To solve this problem, a multi-stage random forest framework is proposed to generate predictor variables based on multivariate regressions using variable importance (VIMP). By fine tuning the forests and significantly reducing the predictor variables, the concept detection scores are evaluated when the concept of interest is rare and imbalanced, i.e., having little collaboration with other high level concepts. Using classical multivariate statistics, estimating the value of one coordinate using other coordinates standardizes the covariates and it depends upon the variance of the correlations instead of the mean. Thus, conditional dependence on the data being normally distributed is eliminated. Experimental results demonstrate that the proposed framework outperforms those approaches in the comparison in terms of the Mean Average Precision (MAP) values.
The complexity and cost of the data storage and retrieval for multimedia research and applications have increased tremendously [10,14,21,25,26,28,47]. How to store and index multimedia data in various media types including video, audio, image, text, etc. for efficient and effective data retrieval has drawn a lot of attention [16, 31, 42, 43]. To solve this problem, multimedia data is labeled with respect to their real high-level semantic meanings such as “Person”, “Boat”, and “Football”. These labels are often referred to as “concepts” or “semantic concepts” [8, 32, 41, 44]. The foremost challenge in this research domain is to reduce the gap between the low-level features [19, 29] and high-level semantic concepts [7,10,15,29,48], i.e., to build a connection between the different meanings and conceptions formed by different representation systems.
To bridge the semantic gap [27, 58, 59], a lot of effort has been put into Scale Invariant Feature Transform (SIFT) and Histogram of Oriented Gradients (HOG) based feature detectors [9, 11–13, 15, 45]. Other methods try to increase the ratio of positive and negative data (for example, video frames) to improve the classification accuracy for automatic labeling and to build the correlations between the labeled concepts to utilize underlying predictors [6, 30, 40, 46, 55, 57]. Some notable solutions include the conditional random field (CRF) methods that improve object classification by maximizing its inter-label agreements [12, 37]. In , the CRF method is extended by creating a database of semantic concepts for event detection. On a similar pattern, the ontology based methods utilize the fusion of concept detection confidence scores such as fused Neural Network and concept ontologies to improve the concept identification . In , the authors fused the ontologies with fuzzy logic to deduce the correlations among concepts. Other correlation based frameworks such as  introduced a Domain Adaptive Semantic Diffusion (DASD) based approach to capture the correlations using Pearson Product. More recent ontology based models use linguistic ontology models to correlate different concepts . For instance, [3, 45] united the WordNet model and Association Rule Mining (ARM) for video retrieval. A more recent and promising approach is to use tree based frameworks that model the contextual correlation using a probabilistic tree method and the conditional probability to evaluate the scores using weights [1,17]. The bag-of-words (BoW) model in  effectively uses random forests and K-Nearest Neighbor (KNN) for large datasets. Similar models assign each descriptor to a single concept or multiple concepts using KNN [36, 52, 56].
Random forests are a notion of the general technique of random decision forests that are an ensemble learning method for classification, regression and other tasks. Using random forest classifiers,  proposed a framework for similarity based labeling of concepts to cluster the training images. It has been observed in  that the soft assignment to multiple concepts improves the prediction at the cost of an increased computation time. An interesting framework using random forests and supervised learning reported an improvement in the processing time with a smaller number of classes . An extension of  uses random forests in their image segmentation stage by applying the forest on image pixels . However, several random forest based methods reported challenges with noisy attributes and error propagation and their effects on inter-concept collaboration; while others reported shortcomings on either relying on the conditional independence within concepts and depending highly on the prior knowledge and domain knowledge of the data. Some of the data-oriented approaches rely on the assumption that the data is normally distributed and the distribution of the training and testing datasets are the same. These conditions served as the motivation to our work because several of these requirements are not necessarily valid in video dataset detection. Our proposed framework tries to overcome these shortcomings by extending the work from [33,52,56] where the noise issue was minimized and a good retrieval accuracy was achieved by using unsupervised random forests and large datasets.
The paper is organized as follows. In Section 2, the proposed framework is introduced and descriptions are provided for the important components of the developed random forests. Experimental setup based on the TRECVID dataset and the results are discussed in Section 3. Section 4 concludes the paper with the summary of the key findings and important future directions.
Our framework is modeled as a random forest based regression problem with big data. The model utilizes the semantic content of images to improve the confidence scores in the retrieval of video shots (keyframes). It was deduced that utilizing the correlations of the concepts assume that the data is normally distributed and centered at zero. This represents a case of conditional expectation and the optimal way to improve the annotation would be to calculate the covariance matrix. However, this is not always the real case so that the proposed model was developed for such cases without the normal distribution assumption. Since there is no “mean” at all, the problem is just a multivariate regression problem with correlation due conditional expectation to calculate the predicted value. This is achieved by using an unsupervised multivariate regression forest that does not require any domain knowledge or does not necessitate any distribution requirement. In classical multivariate statistics, estimating the value of one coordinate using other coordinates standardizes them and the predicted outcome, instead of the mean, depends upon the variance of the correlations.
We consider the scores of 346 concepts from the IACC.1.B dataset in TRECVID 2015 as a 346-dimensional multivariate vector and there are more than 130,000 observations (video shots). Sample images for some of the concepts are depicted in Figure 1.
Our proposed framework first splits the TRECVID 2015 data equally into a training data and a testing data. The two data sets are used in the training and testing parts respectively as shown in Figures 2 and and3.3. The goal is to improve the confidence scores of each concept for all of the observations. Since there is no output variable, we model each instance as a conditional regression problem to predict its best estimate. For any given testing instance, to predict Ci, we take all other variables from C1, C2, C3, …, Ci−1, Ci+1, …, C346 and regress the value of Ci, using random forests, against this high dimensional large dataset. This process is repeated for all concepts and video shots.
In the training part, a state-of-the-art concept detection framework is applied to the video shots in the training data set and the detection confidence scores for each concept are evaluated. Please note that the focus of this paper is not on the initial concept detection performance but rather on the score improvement in the latter step. Thus, the central part of the proposed framework is kept flexible so that the scores output from any concept detection framework could be utilized with our framework. The variable importance (VIMP) evaluator permutes all 346 concepts and identifies the most significant concepts in the prediction of each concept. This results in significantly reducing the dimensionality and the output of this essential component is used in the testing part. We also grow a synthetic forest to empirically identify the most suitable forest tuning parameters such as mtry and node size for the domain of multimedia concepts detection.
In the testing part, after the detection scores are generated from the concept detection framework, the scores are forwarded to the multivariate regression forest where each concept is predicted as a missing value problem treated by multivariate regression. The VIMP and tuning parameters are used to reduce the dimensionality and fine tune the forest. Finally, the scores output from all the randomly grown trees are assembled together to give the final predicted confidence scores of each concept.
The prediction of each testing video shot is performed by a process called Bootstrap Aggregating (BAGGING). Bootstrap aggregating and random forests were introduced in  where it was concluded that the model is always overfitted and by randomly perturbing the dataset and taking the ensemble of that dataset will reduce the overall variance and effectively turn the random forests into highly accurate estimators. It was also proposed that the random forest is a great way for noise reduction and for building a model with low variance .
A random forest is an aggregation of ntree number of trees, usually in thousands, and each tree is grown by bootstrapping a randomly sampled vector mtry from the complete dataset. Each tree in the random forest collection is grown non-deterministically with a two stage method. In the first stage, randomization is induced in each tree by randomly selecting sub-sampled data (bootstrapping) from the original data. The second stage randomization is applied at the node level, where each node is split by randomly selecting a variable from the sub-sampled variables and only those variables are utilized to get the best possible split. This process results in substantially de-correlating the trees so that the final ensemble or the average among the trees will have low variance. Each tree is grown to a depth where the terminal nodes contain at least nodesize number of video frames or cases. Algorithm 1 lists the steps of constructing a random forest.
To achieve this, we begin by modeling the prediction based on the regression setting for which we have a numerical outcome called Y. The learned or observed data is assumed to be independently drawn from the joint distribution of (X, Y) and comprises n * (p + 1) samples, namely (x1, y1), …, (xn, yn). X is an n by p matrix indicating the total number of video frames (or samples) and their features Y, where X=[x1, …, xn]T, Y =[y1, …, yn]T, xi is the subsampled vector (of size 1 by p) from X for the ith sample, p is the total number of features (or dimensions), and Y indicates the vector of outcome variables (yi, i=1 to n) that are to be regressed using the random forest.
The random forest for regression is built by growing the trees based on a random vector θk such that the tree predictor h(x, θk) takes on numerical values as opposed to class labels. The vector θk contains regressed values of the outcome variable Y. The output values are numerical values and we assume that the training dataset is independently drawn from the distribution of the random vector X and random vector Y.
Then, the regression based random forest prediction is defined as the unweighted average over the collection of the predictor trees as shown in Equation (2), where h(x; θk), k = 1, …, ntree are the collection of the tree predictors and x represents the observed input variable vector of length mtry with the associated i.i.d random vector θk.
As k → ∞, the Law of Large Numbers ensures:
where θ represents the regressed outcome variable average over ntree trees. The quantity on the right is the prediction (or generalization) error for the random forest, designated . The convergence in Equation (3) implies that the random forests do not overfit. Now the average prediction error for each individual tree is defined in Equation (4).
The common element in all of these procedures is that for the kth tree, a random vector θk is generated, independent of the past random vectors θ1, …, θk−1 but with the same distribution; and a tree is grown using the training dataset and θk, resulting in a classifier h(x, θk) where x is an input vector. After developing the forest, we further fine tune it by reducing the dimensionality of the features. This is achieved by optimizing mtry, nodesize, and variable importance (VIMP) as described in the following subsection.
There are three key factors to optimize the maximum throughput from a random forest, namely nodesize, mtry, and VIMP. Their parameters as used in the proposed framework and subsequent justifications are provided as follows. When deciding upon nodesize, some methods like  argue that large sampled terminal nodes provide consistent results. On the other hand,  advises to grow the random forest trees very deeply, i.e., expanding the trees until the terminal nodes contain only one variable. Although this causes very skewed and deep trees that require relatively longer times to compute, it has been observed empirically that near singular terminal sizes are more effective in high dimensional problems . This is because that the trees are grown to purity, i.e., single sampled terminal nodes resulting in a much lower bias. While deep trees result in low bias values, the final ensemble or aggregation of all the trees reduces the variance. Thus we opt our forest to be grown in near purity.
VIMP is another tuning feature of the random forests that we utilize to rank each variable based on its predictability. VIMP calculates the increase in the prediction error for the forest aggregation by randomly noising up a variable and permuting its value. The larger the VIMP value of each variable, the more predictive the variable is. VIMP helps to select only the most predictive variables in the prediction process and helps implement the dimensionality reduction in an efficient way. Empirical results show that in some cases the number of prediction variables were reduced down to 1%, which also significantly reduced the computation time. The most commonly used permutation method is the Breiman-Cutler importance measure for the random forest. In the method, the variable importance VI of a feature variable Xj in tree k is evaluated as shown in Equation (5).
where Xj is the jth feature from X and k is the out-of-bag (OOB) sample of the variable for a particular tree k, with k 1, …, ntree. Moreover, is described as the selected class for observation i before permuting, is the class for observation i after permuting its value for variable Xj, and I(.) is the identity function. γi represents the observed class for the observation i. Please note that if variable Xj is not in tree k, V I(k)(Xj) = 0 by definition. The raw variable importance score for each variable is then computed as the mean importance over all trees as given in Equation (6).
One of the key techniques in calculating the VIMP variable is to keep the mtry variable very close to p, where p is the total number of predicting variables (in our case 346), and mtry is the number of randomly subsampled variables to be used in each tree. The default setting for choosing mtry is , but it has been argued in  by several empirical studies to keep mtry close to M = 7/8 × p. This is because if the mtry variables chosen for the root node are noisy (i.e., they are not predictive for the outcome), then the predicted variable and the permuted importance of the variable are also noised up . This principle is depicted in Figure 4, i.e., the larger number of mtry helps better identify the variable importance (VIMP). The colors are used to indicate the relevance of the variables with color red being highly predictive.
For this paper, we use TRECVID 2015 dataset which is a huge dataset with lots of imbalanced concepts. The TRECVID conference series is sponsored by the National Institute of Standards and Technology (NIST) with additional support from other U.S. government agencies. The goal of the conference series is to encourage research in information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. The TRECVID dataset is very suitable for our experiment due to its vast volume. We choose the IACC.1.B dataset used in the TRECVID 2015 semantic indexing (SIN) task which aims to detect the semantic concept contained within a video shot. Challenges such as data imbalance , scalability, and the semantic gap  make the SIN task tough.
In the IACC.1.B dataset, there are 137,327 observations by extracting a keyframe from each shot. Totally 346 concepts are given including many popular semantic concepts include “Vehicle”, “Airplane”, and “Cloud” which are common and appear in many papers. It also contains many rare and imbalanced concepts such as “Security Checkpoint”, “Helicopter Hovering”, and “Mosques”. The distribution of some concepts are highly skewed in which the majority of the data instances belong to one class and far fewer data instances belong to others. The list of concepts and detailed explanations can be found in .
The average precision (AP) value is used as a metric which is widely used in the multimedia concept retrieval domain. For a given concept, P re(i) indicates the precision at cut-off i in the item list, and N is for the number of the retrieved data instances. The average precision at N (i.e., AP @N) is defined in Equation (7). If the denominator is zero, AP is set to zero. By generating AP for all concepts and calculating the mean value of them, the mean average precision (MAP) value is calculated for evaluation.
In our experiment, we choose 20 highly imbalanced concepts for testing including “Airplane Takeoff”, “Emergency Vehicles”, “Military”, “Natural-Disaster”, “US Flags”, “Airplane Landing”, “Airport Or Airfield”, “Car Crash”, “Cigar Boats”, “Earthquake”, “Military Base”, “Rowboat”, “Election Campaign Debate”, “Election Campaign Greeting”, “Exiting A Vehicle”, “Exiting Car”, “Flags”, “Military Aircraft”, “Rescue Vehicle”, and “Prisoner”. Also, the detection scores from the group of DVMM Lab of Columbia University  for shots are used as the raw scores and the benchmark. Their group got the best performance on TRECVID IACC.1.B dataset but the raw scores for the many imbalanced concepts are relatively low and need to be enhanced.
To conduct the comparison, the proposed framework is evaluated against the following four approaches. The first one, “Benchmark”, is the raw scores we got from  without any modification. The “Naive Bayes” approach is based on applying the Bayes’ theorem with strong independence assumptions between the scores. In the implementation of our approach, the selected 20 imbalanced concepts with the p/n ratio values lower than 0.001 are tested and the VIMP-based random forests are applied. We also compare our work with random forests without VIMP. In the proposed work, the dataset is split in half, one for training and one for testing. The comparison results are shown in Table 1.
As can be seen from Table 1, since the assumption of the “Naive Bayes” approach is not true for many concepts like “sea” and “fish”, the accuracy is very low as expected. The random forests without VIMP also fail to enhance the raw scores as well, and this may be caused by the inappropriate tree built process. Among all the four methods, our proposed framework achieves the best performance and successfully enhances the raw scores, which proves the novelty of using random forests with VIMP and shows good MAP results of our proposed framework.
Many of the multimedia content based semantic data mining methods face a very complex challenge known as the semantic gap problem. This is the problem of connecting low level details of the image with its high level concepts. The problem becomes even more challenging with those concepts that are rare and imbalanced. In this paper, the proposed framework attempts to solve this problem by utilizing the unsupervised random forest classifiers. Several experiments were conducted on the TRECVID dataset and the results were compared with several existing frameworks. The proposed method illustrates the improvement in terms of the Mean Average Precision (MAP) values for the rare and imbalanced concepts. Furthermore, our proposed random forest approach with VIMP successfully reduces the dependency on domain knowledge and the restriction on data distributions.
For Shu-Ching Chen, this research is partially supported by DHS’s VACCINE Center under Award Number 2009-ST-061-CI0001 and NSF HRD-0833093, HRD-1547798, CNS-1126619, and CNS-1461926.