PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Proc IEEE Int Conf Inf Reuse Integr. Author manuscript; available in PMC 2017 September 17.
Published in final edited form as:
PMCID: PMC5600875
NIHMSID: NIHMS902611

Enhancing Multimedia Imbalanced Concept Detection Using VIMP in Random Forests

Abstract

Recent developments in social media and cloud storage lead to an exponential growth in the amount of multimedia data, which increases the complexity of managing, storing, indexing, and retrieving information from such big data. Many current content-based concept detection approaches lag from successfully bridging the semantic gap. To solve this problem, a multi-stage random forest framework is proposed to generate predictor variables based on multivariate regressions using variable importance (VIMP). By fine tuning the forests and significantly reducing the predictor variables, the concept detection scores are evaluated when the concept of interest is rare and imbalanced, i.e., having little collaboration with other high level concepts. Using classical multivariate statistics, estimating the value of one coordinate using other coordinates standardizes the covariates and it depends upon the variance of the correlations instead of the mean. Thus, conditional dependence on the data being normally distributed is eliminated. Experimental results demonstrate that the proposed framework outperforms those approaches in the comparison in terms of the Mean Average Precision (MAP) values.

Keywords: Multimedia imbalanced concept detection, Multivariate regression, Variable importance (VIMP), Random forests

1. Introduction

The complexity and cost of the data storage and retrieval for multimedia research and applications have increased tremendously [10,14,21,25,26,28,47]. How to store and index multimedia data in various media types including video, audio, image, text, etc. for efficient and effective data retrieval has drawn a lot of attention [16, 31, 42, 43]. To solve this problem, multimedia data is labeled with respect to their real high-level semantic meanings such as “Person”, “Boat”, and “Football”. These labels are often referred to as “concepts” or “semantic concepts” [8, 32, 41, 44]. The foremost challenge in this research domain is to reduce the gap between the low-level features [19, 29] and high-level semantic concepts [7,10,15,29,48], i.e., to build a connection between the different meanings and conceptions formed by different representation systems.

To bridge the semantic gap [27, 58, 59], a lot of effort has been put into Scale Invariant Feature Transform (SIFT) and Histogram of Oriented Gradients (HOG) based feature detectors [9, 1113, 15, 45]. Other methods try to increase the ratio of positive and negative data (for example, video frames) to improve the classification accuracy for automatic labeling and to build the correlations between the labeled concepts to utilize underlying predictors [6, 30, 40, 46, 55, 57]. Some notable solutions include the conditional random field (CRF) methods that improve object classification by maximizing its inter-label agreements [12, 37]. In [34], the CRF method is extended by creating a database of semantic concepts for event detection. On a similar pattern, the ontology based methods utilize the fusion of concept detection confidence scores such as fused Neural Network and concept ontologies to improve the concept identification [4]. In [18], the authors fused the ontologies with fuzzy logic to deduce the correlations among concepts. Other correlation based frameworks such as [24] introduced a Domain Adaptive Semantic Diffusion (DASD) based approach to capture the correlations using Pearson Product. More recent ontology based models use linguistic ontology models to correlate different concepts [2]. For instance, [3, 45] united the WordNet model and Association Rule Mining (ARM) for video retrieval. A more recent and promising approach is to use tree based frameworks that model the contextual correlation using a probabilistic tree method and the conditional probability to evaluate the scores using weights [1,17]. The bag-of-words (BoW) model in [51] effectively uses random forests and K-Nearest Neighbor (KNN) for large datasets. Similar models assign each descriptor to a single concept or multiple concepts using KNN [36, 52, 56].

Random forests are a notion of the general technique of random decision forests that are an ensemble learning method for classification, regression and other tasks. Using random forest classifiers, [20] proposed a framework for similarity based labeling of concepts to cluster the training images. It has been observed in [53] that the soft assignment to multiple concepts improves the prediction at the cost of an increased computation time. An interesting framework using random forests and supervised learning reported an improvement in the processing time with a smaller number of classes [35]. An extension of [35] uses random forests in their image segmentation stage by applying the forest on image pixels [39]. However, several random forest based methods reported challenges with noisy attributes and error propagation and their effects on inter-concept collaboration; while others reported shortcomings on either relying on the conditional independence within concepts and depending highly on the prior knowledge and domain knowledge of the data. Some of the data-oriented approaches rely on the assumption that the data is normally distributed and the distribution of the training and testing datasets are the same. These conditions served as the motivation to our work because several of these requirements are not necessarily valid in video dataset detection. Our proposed framework tries to overcome these shortcomings by extending the work from [33,52,56] where the noise issue was minimized and a good retrieval accuracy was achieved by using unsupervised random forests and large datasets.

The paper is organized as follows. In Section 2, the proposed framework is introduced and descriptions are provided for the important components of the developed random forests. Experimental setup based on the TRECVID dataset and the results are discussed in Section 3. Section 4 concludes the paper with the summary of the key findings and important future directions.

2. The Proposed Framework

Our framework is modeled as a random forest based regression problem with big data. The model utilizes the semantic content of images to improve the confidence scores in the retrieval of video shots (keyframes). It was deduced that utilizing the correlations of the concepts assume that the data is normally distributed and centered at zero. This represents a case of conditional expectation and the optimal way to improve the annotation would be to calculate the covariance matrix. However, this is not always the real case so that the proposed model was developed for such cases without the normal distribution assumption. Since there is no “mean” at all, the problem is just a multivariate regression problem with correlation due conditional expectation to calculate the predicted value. This is achieved by using an unsupervised multivariate regression forest that does not require any domain knowledge or does not necessitate any distribution requirement. In classical multivariate statistics, estimating the value of one coordinate using other coordinates standardizes them and the predicted outcome, instead of the mean, depends upon the variance of the correlations.

We consider the scores of 346 concepts from the IACC.1.B dataset in TRECVID 2015 as a 346-dimensional multivariate vector and there are more than 130,000 observations (video shots). Sample images for some of the concepts are depicted in Figure 1.

Figure 1
Sample images of concepts from TRECVID 2015 data

Our proposed framework first splits the TRECVID 2015 data equally into a training data and a testing data. The two data sets are used in the training and testing parts respectively as shown in Figures 2 and and3.3. The goal is to improve the confidence scores of each concept for all of the observations. Since there is no output variable, we model each instance as a conditional regression problem to predict its best estimate. For any given testing instance, to predict Ci, we take all other variables from C1, C2, C3, …, Ci−1, Ci+1, …, C346 and regress the value of Ci, using random forests, against this high dimensional large dataset. This process is repeated for all concepts and video shots.

Figure 2
Forest optimization using the training dataset
Figure 3
Multivariate regression forest grown on the testing dataset

In the training part, a state-of-the-art concept detection framework is applied to the video shots in the training data set and the detection confidence scores for each concept are evaluated. Please note that the focus of this paper is not on the initial concept detection performance but rather on the score improvement in the latter step. Thus, the central part of the proposed framework is kept flexible so that the scores output from any concept detection framework could be utilized with our framework. The variable importance (VIMP) evaluator permutes all 346 concepts and identifies the most significant concepts in the prediction of each concept. This results in significantly reducing the dimensionality and the output of this essential component is used in the testing part. We also grow a synthetic forest to empirically identify the most suitable forest tuning parameters such as mtry and node size for the domain of multimedia concepts detection.

In the testing part, after the detection scores are generated from the concept detection framework, the scores are forwarded to the multivariate regression forest where each concept is predicted as a missing value problem treated by multivariate regression. The VIMP and tuning parameters are used to reduce the dimensionality and fine tune the forest. Finally, the scores output from all the randomly grown trees are assembled together to give the final predicted confidence scores of each concept.

The prediction of each testing video shot is performed by a process called Bootstrap Aggregating (BAGGING). Bootstrap aggregating and random forests were introduced in [5] where it was concluded that the model is always overfitted and by randomly perturbing the dataset and taking the ensemble of that dataset will reduce the overall variance and effectively turn the random forests into highly accurate estimators. It was also proposed that the random forest is a great way for noise reduction and for building a model with low variance [5].

3. VIMP-based Random Forests

3.1. Random Forests

A random forest is an aggregation of ntree number of trees, usually in thousands, and each tree is grown by bootstrapping a randomly sampled vector mtry from the complete dataset. Each tree in the random forest collection is grown non-deterministically with a two stage method. In the first stage, randomization is induced in each tree by randomly selecting sub-sampled data (bootstrapping) from the original data. The second stage randomization is applied at the node level, where each node is split by randomly selecting a variable from the sub-sampled variables and only those variables are utilized to get the best possible split. This process results in substantially de-correlating the trees so that the final ensemble or the average among the trees will have low variance. Each tree is grown to a depth where the terminal nodes contain at least nodesize number of video frames or cases. Algorithm 1 lists the steps of constructing a random forest.

To achieve this, we begin by modeling the prediction based on the regression setting for which we have a numerical outcome called Y. The learned or observed data is assumed to be independently drawn from the joint distribution of (X, Y) and comprises n * (p + 1) samples, namely (x1, y1),, (xn, yn). X is an n by p matrix indicating the total number of video frames (or samples) and their features Y, where X=[x1,, xn]T, Y =[y1,, yn]T, xi is the subsampled vector (of size 1 by p) from X for the ith sample, p is the total number of features (or dimensions), and Y indicates the vector of outcome variables (yi, i=1 to n) that are to be regressed using the random forest.

The random forest for regression is built by growing the trees based on a random vector θk such that the tree predictor h(x, θk) takes on numerical values as opposed to class labels. The vector θk contains regressed values of the outcome variable Y. The output values are numerical values and we assume that the training dataset is independently drawn from the distribution of the random vector X and random vector Y.

Then, the regression based random forest prediction is defined as the unweighted average over the collection of the predictor trees as shown in Equation (2), where h(x; θk), k = 1,, ntree are the collection of the tree predictors and x represents the observed input variable vector of length mtry with the associated i.i.d random vector θk.

h¯(x)=(1/ntree)k=1ntreeh(x;θk).
(2)

As k → ∞, the Law of Large Numbers ensures:

EX,Y(Y-h¯(X))2EX,Y(Y-Eθ(X;θ))2,
(3)

where θ represents the regressed outcome variable average over ntree trees. The quantity on the right is the prediction (or generalization) error for the random forest, designated PEf. The convergence in Equation (3) implies that the random forests do not overfit. Now the average prediction error for each individual tree is defined in Equation (4).

PEt=EθEX,Y(Y-h(X;θ))2.
(4)

The common element in all of these procedures is that for the kth tree, a random vector θk is generated, independent of the past random vectors θ1,, θk−1 but with the same distribution; and a tree is grown using the training dataset and θk, resulting in a classifier h(x, θk) where x is an input vector. After developing the forest, we further fine tune it by reducing the dimensionality of the features. This is achieved by optimizing mtry, nodesize, and variable importance (VIMP) as described in the following subsection.

Algorithm 1

Construction of Random Forests

  1. Draw the ntree bootstrap samples from the original data.
  2. Grow a tree for each bootstrap data set. At each node of the tree, randomly select mtry variables for splitting. Grow the tree so that each terminal node has no fewer than the nodesize cases.
  3. Aggregate the information from the ntree trees for a new data prediction such as majority voting for classification.
  4. Compute an out-of-bag (OOB) error rate by using the data not in the bootstrap samples (Equation (1)).
    MSEOOB=n-1i=1n{yi-y^iOOB}2,
    (1)
    where n indicates the total number of OOB observations (video frames); while yi and y^iOOB are the average predictions for the in-bag and out-of-bag samples in the ith observation.

3.2. Optimizing the Forest

There are three key factors to optimize the maximum throughput from a random forest, namely nodesize, mtry, and VIMP. Their parameters as used in the proposed framework and subsequent justifications are provided as follows. When deciding upon nodesize, some methods like [38] argue that large sampled terminal nodes provide consistent results. On the other hand, [5] advises to grow the random forest trees very deeply, i.e., expanding the trees until the terminal nodes contain only one variable. Although this causes very skewed and deep trees that require relatively longer times to compute, it has been observed empirically that near singular terminal sizes are more effective in high dimensional problems [22]. This is because that the trees are grown to purity, i.e., single sampled terminal nodes resulting in a much lower bias. While deep trees result in low bias values, the final ensemble or aggregation of all the trees reduces the variance. Thus we opt our forest to be grown in near purity.

VIMP is another tuning feature of the random forests that we utilize to rank each variable based on its predictability. VIMP calculates the increase in the prediction error for the forest aggregation by randomly noising up a variable and permuting its value. The larger the VIMP value of each variable, the more predictive the variable is. VIMP helps to select only the most predictive variables in the prediction process and helps implement the dimensionality reduction in an efficient way. Empirical results show that in some cases the number of prediction variables were reduced down to 1%, which also significantly reduced the computation time. The most commonly used permutation method is the Breiman-Cutler importance measure for the random forest. In the method, the variable importance VI of a feature variable Xj in tree k is evaluated as shown in Equation (5).

VI(k)(Xj)=iB¯(k)I(γi=γi(k))B¯(k)-iB¯(k)I(γi=γi,πj(k))B¯(k),
(5)

where Xj is the jth feature from X and Bk is the out-of-bag (OOB) sample of the variable for a particular tree k, with k [set membership] 1,, ntree. Moreover, γi(k) is described as the selected class for observation i before permuting, γi,πj(k) is the class for observation i after permuting its value for variable Xj, and I(.) is the identity function. γi represents the observed class for the observation i. Please note that if variable Xj is not in tree k, V I(k)(Xj) = 0 by definition. The raw variable importance score for each variable is then computed as the mean importance over all trees as given in Equation (6).

VI(Xj)=k=1ntreeVI(k)(xj)ntree.
(6)

One of the key techniques in calculating the VIMP variable is to keep the mtry variable very close to p, where p is the total number of predicting variables (in our case 346), and mtry is the number of randomly subsampled variables to be used in each tree. The default setting for choosing mtry is mtry=p, but it has been argued in [22] by several empirical studies to keep mtry close to M = 7/8 × p. This is because if the mtry variables chosen for the root node are noisy (i.e., they are not predictive for the outcome), then the predicted variable and the permuted importance of the variable are also noised up [50]. This principle is depicted in Figure 4, i.e., the larger number of mtry helps better identify the variable importance (VIMP). The colors are used to indicate the relevance of the variables with color red being highly predictive.

Figure 4
Example of a parametric plot

4. Experiments and Results

4.1 Experiment Setup

For this paper, we use TRECVID 2015 dataset which is a huge dataset with lots of imbalanced concepts. The TRECVID conference series is sponsored by the National Institute of Standards and Technology (NIST) with additional support from other U.S. government agencies. The goal of the conference series is to encourage research in information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. The TRECVID dataset is very suitable for our experiment due to its vast volume. We choose the IACC.1.B dataset used in the TRECVID 2015 semantic indexing (SIN) task which aims to detect the semantic concept contained within a video shot. Challenges such as data imbalance [54], scalability, and the semantic gap [27] make the SIN task tough.

In the IACC.1.B dataset, there are 137,327 observations by extracting a keyframe from each shot. Totally 346 concepts are given including many popular semantic concepts include “Vehicle”, “Airplane”, and “Cloud” which are common and appear in many papers. It also contains many rare and imbalanced concepts such as “Security Checkpoint”, “Helicopter Hovering”, and “Mosques”. The distribution of some concepts are highly skewed in which the majority of the data instances belong to one class and far fewer data instances belong to others. The list of concepts and detailed explanations can be found in [49].

The average precision (AP) value is used as a metric which is widely used in the multimedia concept retrieval domain. For a given concept, P re(i) indicates the precision at cut-off i in the item list, and N is for the number of the retrieved data instances. The average precision at N (i.e., AP @N) is defined in Equation (7). If the denominator is zero, AP is set to zero. By generating AP for all concepts and calculating the mean value of them, the mean average precision (MAP) value is calculated for evaluation.

AP@N=i=1NPre(i)×rel(i)#ofrelevantinstancesatN;rel(i)={0,ifinstanceiisnegative,1,ifinstanceiispositive.
(7)

4.2 Experimental Results

In our experiment, we choose 20 highly imbalanced concepts for testing including “Airplane Takeoff”, “Emergency Vehicles”, “Military”, “Natural-Disaster”, “US Flags”, “Airplane Landing”, “Airport Or Airfield”, “Car Crash”, “Cigar Boats”, “Earthquake”, “Military Base”, “Rowboat”, “Election Campaign Debate”, “Election Campaign Greeting”, “Exiting A Vehicle”, “Exiting Car”, “Flags”, “Military Aircraft”, “Rescue Vehicle”, and “Prisoner”. Also, the detection scores from the group of DVMM Lab of Columbia University [23] for shots are used as the raw scores and the benchmark. Their group got the best performance on TRECVID IACC.1.B dataset but the raw scores for the many imbalanced concepts are relatively low and need to be enhanced.

To conduct the comparison, the proposed framework is evaluated against the following four approaches. The first one, “Benchmark”, is the raw scores we got from [23] without any modification. The “Naive Bayes” approach is based on applying the Bayes’ theorem with strong independence assumptions between the scores. In the implementation of our approach, the selected 20 imbalanced concepts with the p/n ratio values lower than 0.001 are tested and the VIMP-based random forests are applied. We also compare our work with random forests without VIMP. In the proposed work, the dataset is split in half, one for training and one for testing. The comparison results are shown in Table 1.

Table 1
Results Comparison

As can be seen from Table 1, since the assumption of the “Naive Bayes” approach is not true for many concepts like “sea” and “fish”, the accuracy is very low as expected. The random forests without VIMP also fail to enhance the raw scores as well, and this may be caused by the inappropriate tree built process. Among all the four methods, our proposed framework achieves the best performance and successfully enhances the raw scores, which proves the novelty of using random forests with VIMP and shows good MAP results of our proposed framework.

5. Conclusions

Many of the multimedia content based semantic data mining methods face a very complex challenge known as the semantic gap problem. This is the problem of connecting low level details of the image with its high level concepts. The problem becomes even more challenging with those concepts that are rare and imbalanced. In this paper, the proposed framework attempts to solve this problem by utilizing the unsupervised random forest classifiers. Several experiments were conducted on the TRECVID dataset and the results were compared with several existing frameworks. The proposed method illustrates the improvement in terms of the Mean Average Precision (MAP) values for the rare and imbalanced concepts. Furthermore, our proposed random forest approach with VIMP successfully reduces the dependency on domain knowledge and the restriction on data distributions.

Acknowledgments

For Shu-Ching Chen, this research is partially supported by DHS’s VACCINE Center under Award Number 2009-ST-061-CI0001 and NSF HRD-0833093, HRD-1547798, CNS-1126619, and CNS-1461926.

References

1. Aytar Y, Orhan BO, Shah M. Improving semantic concept detection and retrieval using contextual estimates. Proceedings of the IEEE International Conference on Multimedia & Expo; IEEE; 2007. pp. 536–539.
2. Bai L, Lao S, Guo J. Video semantic concept detection using ontology. Proceedings of the International Conference on Internet Multimedia Computing and Service; ACM; 2011. pp. 158–163.
3. Ballan L, Bertini M, Del Bimbo A, Serra G. Video annotation and retrieval using ontologies and rule learning. IEEE MultiMedia. 2010;17(4):80–88.
4. Benmokhtar R, Huet B. An ontology-based evidential framework for video indexing using high-level multimodal fusion. Multimedia Tools and Applications. 2014;73(2):663–689.
5. Breiman L. Random forests. Machine Learning. 2001;45(1):5–32.
6. Chen L-C, Hsieh J-W, Yan Y, Chen D-Y. Vehicle make and model recognition using sparse representation and symmetrical {SURFs} Pattern Recognition. 2015;48(6):1979–1998.
7. Chen S-C, Ghafoor A, Kashyap RL. Semantic Models for Multimedia Database Searching and Browsing. Kluwer Academic Publishers; Norwell, MA, USA: 2000.
8. Chen S-C, Kashyap R. Temporal and spatial semantic models for multimedia presentations. Proceedings of the International Symposium on Multimedia Information Processing; 1997. pp. 441–446.
9. Chen S-C, Rubin S, Shyu M-L, Zhang C. A dynamic user concept pattern learning framework for content-based image retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews. 2006 Nov;36(6):772–783.
10. Chen S-C, Shyu M-L, Kashyap R. Augmented transition network as a semantic model for video data. International Journal of Networking and Information Systems. 2000;3(1):9–25.
11. Chen S-C, Shyu M-L, Zhang C. An intelligent framework for spatio-temporal vehicle tracking. Proceedings of the IEEE International Conference on Intelligent Transportation Systems; August 2001.pp. 213–218.
12. Chen S-C, Shyu M-L, Zhang C. Innovative shot boundary detection for video indexing. In: Deb S, editor. Video Data Management and Information Retrieval. Idea Group Publishing; 2005. pp. 217–236.
13. Chen S-C, Shyu M-L, Zhang C, Kashyap RL. Identifying overlapped objects for video indexing and modeling in multimedia database systems. International Journal on Artificial Intelligence Tools. 2001;10(4):715–734.
14. Chen S-C, Sista S, Shyu M-L, Kashyap R. Augmented transition networks as video browsing models for multimedia databases and multimedia information systems. Proceedings of the IEEE International Conference on Tools with Artificial Intelligence; 1999. pp. 175–182.
15. Chen X, Zhang C, Chen S-C, Chen M. A latent semantic indexing based method for solving multiple instance learning problem in region-based image retrieval. Proceedings of the IEEE International Symposium on Multimedia; Dec 2005.pp. 37–44.
16. Chen X, Zhang C, Chen S-C, Rubin S. A human-centered multiple instance learning framework for semantic video retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews. 2009;39(2):228–233.
17. Choi MJ, Torralba A, Willsky AS. A tree-based context model for object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2012;34(2):240–252. [PubMed]
18. Elleuch N, Zarka M, Ammar AB, Alimi AM. A fuzzy ontology: based framework for reasoning in visual video content analysis and indexing. Proceedings of the International Workshop on Multimedia Data Mining; ACM; 2011. pp. 1:1–1:8.
19. Fan J, Luo H, Elmagarmid AK. Concept-oriented indexing of video databases: toward semantic sensitive retrieval and browsing. IEEE Transactions on Image Processing. 2004;13(7):974–992. [PubMed]
20. Feng H, Shi R, Chua T-S. A bootstrapping framework for annotating and retrieving www images. Proceedings of the ACM International Conference on Multimedia; ACM; 2004. pp. 960–967.
21. Huang X, Chen S-C, Shyu M-L, Zhang C. User concept pattern discovery using relevance feedback and multiple instance learning for content-based image retrieval. Proceedings of the International Workshop on Multimedia Data Mining; July 2002.pp. 100–108.
22. Ishwaran H, Kogalur UB, Chen X, Minn AJ. Random survival forests for high-dimensional data. Statistical Analysis and Data Mining. 2011;4(1):115–132.
23. Jiang Y-G. [Last accessed on September 2011];Prediction scores on TRECVID 2010 data set. 2010 http://www.ee.columbia.edu/ln/dvmm/CU-VIREO374/
24. Jiang Y-G, Wang J, Chang S-F, Ngo C-W. Domain adaptive semantic diffusion for large scale context-based video annotation. Proceedings of the IEEE International Conference on Computer Vision; IEEE; 2009. pp. 1420–1427.
25. Li X, Chen S-C, Shyu M-L, Furht B. An effective content-based visual image retrieval system. Proceedings of the IEEE International Computer Software and Applications Conference; August 2002.pp. 914–919.
26. Li X, Chen S-C, Shyu M-L, Furht B. Image retrieval by color, texture, and spatial information. Proceedings of the International Conference on Distributed Multimedia Systems; September 2002.pp. 152–159.
27. Lin L, Chen C, Shyu M-L, Chen S-C. Weighted subspace filtering and ranking algorithms for video concept retrieval. IEEE Multimedia. 2011 Mar;18(3):32–43.
28. Lin L, Ravitz G, Shyu M-L, Chen S-C. Video semantic concept discovery using multimodal-based association classification. Proceedings of the IEEE International Conference on Multimedia & Expo; July 2007.pp. 859–862.
29. Lin L, Ravitz G, Shyu M-L, Chen S-C. Effective feature space reduction with imbalanced data for semantic concept detection. Proceedings of the IEEE International on Sensor Networks, Ubiquitous, and Trustworthy Computing; June 2008.pp. 262–269.
30. Lin L, Shyu M-L. Weighted association rule mining for video semantic detection. International Journal of Multimedia Data Engineering and Management. 2010;1(1):37–54.
31. Lin L, Shyu M-L, Ravitz G, Chen S-C. Video semantic concept detection via associative classification. Proceedings of the IEEE International Conference on Multimedia & Expo; IEEE; 2009. pp. 418–421.
32. Liu D, Yan Y, Shyu M-L, Zhao G, Chen M. Spatio-temporal analysis for human action detection and recognition in uncontrolled environments. International Journal of Multimedia Data Engineering and Management. 2015 Jan;6(1):1–18.
33. Marszałek M, Schmid C, Harzallah H, Van De Weijer J. Learning object representations for visual object class recognition. Proceedings of the Visual Recognition Challenge Workshop; 2007.
34. Merler M, Huang B, Xie L, Hua G, Natsev A. Semantic model vectors for complex video event recognition. IEEE Transactions on Multimedia. 2012;14(1):88–101.
35. Moosmann F, Triggs B, Jurie F. Fast discriminative visual codebooks using randomized clustering forests. Twentieth Annual Conference on Neural Information Processing Systems; MIT Press; 2007. pp. 985–992.
36. Philbin J, Chum O, Isard M, Sivic J, Zisserman A. Lost in quantization: Improving particular object retrieval in large scale image databases. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE; 2008. pp. 1–8.
37. Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie S. Objects in context. Proceedings of the IEEE International Conference on Computer Vision; IEEE; 2007. pp. 1–8.
38. Segal MR. Machine learning benchmarks and random forest regression. Center for Bioinformatics & Molecular Biostatistics; 2004.
39. Shotton J, Johnson M, Cipolla R. Semantic texton forests for image categorization and segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE; 2008. pp. 1–8.
40. Shyu M-L, Chen S-C, Chen M, Zhang C. A unified framework for image database clustering and content-based retrieval. Proceedings of the ACM International Workshop on Multimedia Databases; New York, NY, USA. ACM; 2004. pp. 19–27.
41. Shyu M-L, Chen S-C, Chen M, Zhang C, Sarinnapakorn K. Image database retrieval utilizing affinity relationships. Proceedings of the ACM International Workshop on Multimedia Databases; New York, NY, USA. ACM; 2003. pp. 78–85.
42. Shyu M-L, Chen S-C, Haruechaiyasak C. Mining user access behavior on the www. Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics; IEEE; 2001. pp. 1717–1722.
43. Shyu M-L, Chen S-C, Kashyap R. Generalized affinity-based association rule mining for multimedia database queries. Knowledge and Information Systems (KAIS): An International Journal. 2001 Aug;3(3):319–337.
44. Shyu M-L, Haruechaiyasak C, Chen S-C. Category cluster discovery from distributed www directories. Information Sciences. 2003;155(3):181–197.
45. Shyu M-L, Haruechaiyasak C, Chen S-C, Zhao N. Collaborative filtering by mining association rules from user access sequences. Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration; April 2005.pp. 128–135.
46. Shyu M-L, Quirino T, Xie Z, Chen S-C, Chang L. Network intrusion detection through adaptive sub-eigenspace modeling in multiagent systems. ACM Transactions on Autonomous and Adaptive Systems. 2007;2(3):9:1–9:37.
47. Shyu M-L, Sarinnapakorn K, Kuruppu-Appuhamilage I, Chen S-C, Chang L, Goldring T. Handling nominal features in anomaly intrusion detection problems. Proceedings of the International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications; IEEE; 2005. pp. 55–62.
48. Shyu ML, Xie Z, Chen M, Chen SC. Video semantic event/concept detection using a subspace-based multimedia data mining framework. IEEE Transactions on Multimedia. 2008 Feb;10(2):252–259.
49. Smeaton AF, Over P, Kraaij W. Evaluation campaigns and TRECVid. Proceedings of the ACM International Workshop on Multimedia Information Retrieval; October 2006.pp. 321–330.
50. Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A. Conditional variable importance for random forests. BMC bioinformatics. 2008;9(1):1–11. [PMC free article] [PubMed]
51. Uijlings JR, Smeulders AW, Scha RJ. Real-time visual concept classification. IEEE Transactions on Multimedia. 2010;12(7):665–681.
52. Van de Sande KE, Gevers T, Snoek CG. A comparison of color features for visual concept classification. Proceedings of the International Conference on Content-based Image and Video Retrieval; ACM; 2008. pp. 141–150.
53. Van Gemert JC, Veenman CJ, Smeulders AW, Geusebroek J-M. Visual word ambiguity. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2010;32(7):1271–1283. [PubMed]
54. Yan Y, Chen M, Shyu M-L, Chen S-C. Deep learning for imbalanced multimedia data classification. Proceedings of the IEEE International Symposium on Multimedia; Dec 2015.pp. 483–488.
55. Yan Y, Liu Y, Shyu M-L, Chen M. Utilizing concept correlations for effective imbalanced data classification. Proceedings of the IEEE International Conference on Information Reuse and Integration; Aug 2014.pp. 561–568.
56. Zhang J, Marszałek M, Lazebnik S, Schmid C. Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision. 2007;73(2):213–238.
57. Zhu Q, Lin L, Shyu M-L, Chen S-C. Feature selection using correlation and reliability based scoring metric for video semantic detection. Proceedings of the IEEE International Conference on Semantic Computing; 2010. pp. 462–469.
58. Zhu Q, Lin L, Shyu M-L, Liu D. Utilizing context information to enhance content-based image classification. International Journal of Multimedia Data Engineering and Management. 2011;2(3):34–51.
59. Zhu Q, Shyu ML. Sparse linear integration of content and context modalities for semantic concept retrieval. IEEE Transactions on Emerging Topics in Computing. 2015 Jun;3(2):152–160.