Deformable shape detection is an important problem in computer vision and pattern recognition. However, standard detectors are typically limited to locating only a few salient landmarks such as landmarks near edges or areas of high contrast, often conveying insufficient shape information. This paper presents a novel statistical pattern recognition approach to locate a dense set of salient and non-salient landmarks in images of a deformable object. We explore the fact that several object classes exhibit a homogeneous structure such that each landmark position provides some information about the position of the other landmarks. In our model, the relationship between all pairs of landmarks is naturally encoded as a probabilistic graph. Dense landmark detections are then obtained with a new sampling algorithm that, given a set of candidate detections, selects the most likely positions as to maximize the probability of the graph. Our experimental results demonstrate accurate, dense landmark detections within and across different databases.
Shape modeling; detailed face shape detection; face detection; probabilistic graphical model; landmark detection
Much is known on how facial expressions of emotion are produced, including which individual muscles are most active in each expression. Yet, little is known on how this information is interpreted by the human visual system. This paper presents a systematic study of the image dimensionality of facial expressions of emotion. In particular, we investigate how recognition degrades when the resolution of the image (i.e., number of pixels when seen as a 5.3 by 8 degree stimulus) is reduced. We show that recognition is only impaired in practice when the image resolution goes below 20 × 30 pixels. A study of the confusion tables demonstrates that each expression of emotion is consistently confused by a small set of alternatives and that the confusion is not symmetric, i.e., misclassifying emotion a as b does not imply we will mistake b for a. This asymmetric pattern is consistent over the different image resolutions and cannot be explained by the similarity of muscle activation. Furthermore, although women are generally better at recognizing expressions of emotion at all resolutions, the asymmetry patterns are the same. We discuss the implications of these results for current models of face perception.
resolution; facial expressions; emotion
The manual signs in sign languages are generated and interpreted using three basic building blocks: handshape, motion, and place of articulation. When combined, these three components (together with palm orientation) uniquely determine the meaning of the manual sign. This means that the use of pattern recognition techniques that only employ a subset of these components is inappropriate for interpreting the sign or to build automatic recognizers of the language. In this paper, we define an algorithm to model these three basic components form a single video sequence of two-dimensional pictures of a sign. Recognition of these three components are then combined to determine the class of the signs in the videos. Experiments are performed on a database of (isolated) American Sign Language (ASL) signs. The results demonstrate that, using semi-automatic detection, all three components can be reliably recovered from two-dimensional video sequences, allowing for an accurate representation and recognition of the signs.
American Sign Language; handshape; motion reconstruction; multiple cue recognition; computer vision
We argue that to make robust computer vision algorithms for face analysis and recognition, these should be based on configural and shape features. In this model, the most important task to be solved by computer vision researchers is that of accurate detection of facial features, rather than recognition. We base our arguments on recent results in cognitive science and neuroscience. In particular, we show that different facial expressions of emotion have diverse uses in human behavior/cognition and that a facial expression may be associated to multiple emotional categories. These two results are in contradiction with the continuous models in cognitive science, the limbic assumption in neuroscience and the multidimensional approaches typically employed in computer vision. Thus, we propose an alternative hybrid continuous-categorical approach to the perception of facial expressions and show that configural and shape features are most important for the recognition of emotional constructs by humans. We illustrate how these image cues can be successfully exploited by computer vision algorithms. Throughout the paper, we discuss the implications of these results in applications in face recognition and human-computer interaction.
To fully define the grammar of American Sign Language (ASL), a linguistic model of its nonmanuals needs to be constructed. While significant progress has been made to understand the features defining ASL manuals, after years of research, much still needs to be done to uncover the discriminant nonmanual components. The major barrier to achieving this goal is the difficulty in correlating facial features and linguistic features, especially since these correlations may be temporally defined. For example, a facial feature (e.g., head moves down) occurring at the end of the movement of another facial feature (e.g., brows moves up), may specify a Hypothetical conditional, but only if this time relationship is maintained. In other instances, the single occurrence of a movement (e.g., brows move up) can be indicative of the same grammatical construction. In the present paper, we introduce a linguistic–computational approach to efficiently carry out this analysis. First, a linguistic model of the face is used to manually annotate a very large set of 2,347 videos of ASL nonmanuals (including tens of thousands of frames). Second, a computational approach is used to determine which features of the linguistic model are more informative of the grammatical rules under study. We used the proposed approach to study five types of sentences – Hypothetical conditionals, Yes/no questions, Wh-questions, Wh-questions postposed, and Assertions – plus their polarities – positive and negative. Our results verify several components of the standard model of ASL nonmanuals and, most importantly, identify several previously unreported features and their temporal relationship. Notably, our results uncovered a complex interaction between head position and mouth shape. These findings define some temporal structures of ASL nonmanuals not previously detected by other approaches.
We address the classical computer vision problems of rigid and non-rigid structure from motion (SFM) with occlusion. We assume that the columns of the input observation matrix W describe smooth 2D point trajectories over time. We then derive a family of efficient methods that estimate the column space of W using compact parameterizations in the Discrete Cosine Transform (DCT) domain. Our methods tolerate high percentages of missing data and incorporate new models for the smooth time-trajectories of 2D-points, affine and weak-perspective cameras, and 3D deformable shape. We solve a rigid SFM problem by estimating the smooth time-trajectory of a single camera moving around the structure of interest. By considering a weak-perspective camera model from the outset, we directly compute Euclidean 3D shape reconstructions without requiring post-processing steps such as Euclidean upgrade and bundle adjustment. Our results on real SFM datasets with high percentages of missing data were positively compared to those in the literature. In non-rigid SFM, we propose a novel 3D shape trajectory approach that solves for the deformable structure as the smooth time-trajectory of a single point in a linear shape space. A key result shows that, compared to state-of-the-art algorithms, our non-rigid SFM method can better model complex articulated deformation with higher frequency DCT components while still maintaining the low-rank factorization constraint. Finally, we also offer an approach for non-rigid SFM when W is presented with missing data.
Structure from motion; matrix factorization; missing data; camera trajectory; shape trajectory
Non-rigid structure from motion (NRSFM) is a difficult, underconstrained problem in computer vision. The standard approach in NRSFM constrains 3D shape deformation using a linear combination of K basis shapes; the solution is then obtained as the low-rank factorization of an input observation matrix. An important but overlooked problem with this approach is that non-linear deformations are often observed; these deformations lead to a weakened low-rank constraint due to the need to use additional basis shapes to linearly model points that move along curves.
Here, we demonstrate how the kernel trick can be applied in standard NRSFM. As a result, we model complex, deformable 3D shapes as the outputs of a non-linear mapping whose inputs are points within a low-dimensional shape space. This approach is flexible and can use different kernels to build different non-linear models. Using the kernel trick, our model complements the low-rank constraint by capturing non-linear relationships in the shape coefficients of the linear model. The net effect can be seen as using non-linear dimensionality reduction to further compress the (shape) space of possible solutions.
Non-rigid structure from motion (NRSFM) is a classical underconstrained problem in computer vision. A common approach to make NRSFM more tractable is to constrain 3D shape deformation to be smooth over time. This constraint has been used to compress the deformation model and reduce the number of unknowns that are estimated. However, temporal smoothness cannot be enforced when the data lacks temporal ordering and its benefits are less evident when objects undergo abrupt deformations. This paper proposes a new NRSFM method that addresses these problems by considering deformations as spatial variations in shape space and then enforcing spatial, rather than temporal, smoothness. This is done by modeling each 3D shape coefficient as a function of its input 2D shape. This mapping is learned in the feature space of a rotation invariant kernel, where spatial smoothness is intrinsically defined by the mapping function. As a result, our model represents shape variations compactly using custom-built coefficient bases learned from the input data, rather than a pre-specified set such as the Discrete Cosine Transform. The resulting kernel-based mapping is a by-product of the NRSFM solution and leads to another fundamental advantage of our approach: for a newly observed 2D shape, its 3D shape is recovered by simply evaluating the learned function.
The appearance-based approach to face detection has seen great advances in the last several years. In this approach, we learn the image statistics describing the texture pattern (appearance) of the object class we want to detect, e.g., the face. However, this approach has had a limited success in providing an accurate and detailed description of the internal facial features, i.e., eyes, brows, nose and mouth. In general, this is due to the limited information carried by the learned statistical model. While the face template is relatively rich in texture, facial features (e.g., eyes, nose and mouth) do not carry enough discriminative information to tell them apart from all possible background images. We resolve this problem by adding the context information of each facial feature in the design of the statistical model. In the proposed approach, the context information defines the image statistics most correlated with the surroundings of each facial component. This means that when we search for a face or facial feature we look for those locations which most resemble the feature yet are most dissimilar to its context. This dissimilarity with the context features forces the detector to gravitate toward an accurate estimate of the position of the facial feature. Learning to discriminate between feature and context templates is difficult however, because the context and the texture of the facial features vary widely under changing expression, pose and illumination, and may even resemble one another. We address this problem with the use of subclass divisions. We derive two algorithms to automatically divide the training samples of each facial feature into a set of subclasses, each representing a distinct construction of the same facial component (e.g., closed versus open eyes) or its context (e.g., different hairstyles). The first algorithm is based on a discriminant analysis formulation. The second algorithm is an extension of the AdaBoost approach. We provide extensive experimental results using still images and video sequences for a total of 3, 930 images. We show that the results are almost as good as those obtained with manual detection.
Face detection; facial feature detection; shape extraction; subclass learning; discriminant analysis; adaptive boosting; face recognition; American sign language; nonmanuals
We present an information theoretic approach to define the problem of structure from motion (SfM) as a blind source separation one. Given that for almost all practical joint densities of shape points, the marginal densities are non-Gaussian, we show how higher-order statistics can be used to provide improvements in shape estimates over the methods of factorization via Singular Value Decomposition (SVD), bundle adjustment and Bayesian approaches. Previous techniques have either explicitly or implicitly used only second-order statistics in models of shape or noise. A further advantage of viewing SfM as a blind source problem is that it easily allows for the inclusion of noise and shape models, resulting in Maximum Likelihood (ML) or Maximum a Posteriori (MAP) shape and motion estimates. A key result is that the blind source separation approach has the ability to recover the motion and shape matrices without the need to explicitly know the motion or shape pdf. We demonstrate that it suffices to know whether the pdf is sub-or super-Gaussian (i.e., semi-parametric estimation) and derive a simple formulation to determine this from the data. We provide extensive experimental results on synthetic and real tracked points in order to quantify the improvement obtained from this technique.
Structure from motion; Bundle adjustment; Blind source separation; Subspace analysis; Bayesian analysis
Facial expressions of emotion are essential components of human behavior, yet little is known about the hierarchical organization of their cognitive analysis. We study the minimum exposure time needed to successfully classify the six classical facial expressions of emotion (joy, surprise, sadness, anger, disgust, fear) plus neutral as seen at different image resolutions (240 × 160 to 15 × 10 pixels). Our results suggest a consistent hierarchical analysis of these facial expressions regardless of the resolution of the stimuli. Happiness and surprise can be recognized after very short exposure times (10–20 ms), even at low resolutions. Fear and anger are recognized the slowest (100–250 ms), even in high-resolution images, suggesting a later computation. Sadness and disgust are recognized in between (70–200 ms). The minimum exposure time required for successful classification of each facial expression correlates with the ability of a human subject to identify it correctly at low resolutions. These results suggest a fast, early computation of expressions represented mostly by low spatial frequencies or global configural cues and a later, slower process for those categories requiring a more fine-grained analysis of the image. We also demonstrate that those expressions that are mostly visible in higher-resolution images are not recognized as accurately. We summarize implications for current computational models.
facial expressions of emotion; exposure time threshold; computational model; cognitive hierarchy; emotion categorization; spatial frequencies; configural processing; face perception
Kernel mapping is one of the most used approaches to intrinsically derive nonlinear classifiers. The idea is to use a kernel function which maps the original nonlinearly separable problem to a space of intrinsically larger dimensionality where the classes are linearly separable. A major problem in the design of kernel methods is to find the kernel parameters that make the problem linear in the mapped representation. This paper derives the first criterion that specifically aims to find a kernel representation where the Bayes classifier becomes linear. We illustrate how this result can be successfully applied in several kernel discriminant analysis algorithms. Experimental results using a large number of databases and classifiers demonstrate the utility of the proposed approach. The paper also shows (theoretically and experimentally) that a kernel version of Subclass Discriminant Analysis yields the highest recognition rates.
Kernel functions; kernel optimization; feature extraction; discriminant analysis; nonlinear classifiers; face recognition; object recognition; pattern recognition; machine learning
Research suggests that configural cues (second-order relations) play a major role in the representation and classification of face images; making faces a “special” class of objects, since object recognition seems to use different encoding mechanisms. It is less clear, however, how this representation emerges and whether this representation is also used in the recognition of facial expressions of emotion. In this paper, we show how configural cues emerge naturally from a classical analysis of shape in the recognition of anger and sadness. In particular our results suggest that at least two of the dimensions of the computational (cognitive) space of facial expressions of emotion correspond to pure configural changes. The first of these dimensions measures the distance between the eyebrows and the mouth, while the second is concerned with the height-width ratio of the face. Under this proposed model, becoming a face “expert” would mean to move from the generic shape representation to that based on configural cues. These results suggest that the recognition of facial expressions of emotion shares this expertise property with the other processes of face processing.
Many problems in paleontology reduce to finding those features that best discriminate among a set of classes. A clear example is the classification of new specimens. However, these classifications are generally challenging because the number of discriminant features and the number of samples are limited. This has been the fate of LB1, a new specimen found in the Liang Bua Cave of Flores. Several authors have attributed LB1 to a new species of Homo, H. floresiensis. According to this hypothesis, LB1 is either a member of the early Homo group or a descendent of an ancestor of the Asian H. erectus. Detractors have put forward an alternate hypothesis, which stipulates that LB1 is in fact a microcephalic modern human. In this paper, we show how we can employ a new Bayes optimal discriminant feature extraction technique to help resolve this type of issues. In this process, we present three types of experiments. First, we use this Bayes optimal discriminant technique to develop a model of morphological (shape) evolution from Australopiths to H. sapiens. LB1 fits perfectly in this model as a member of the early Homo group. Second, we build a classifier based on the available cranial and mandibular data appropriately normalized for size and volume. Again, LB1 is most similar to early Homo. Third, we build a brain endocast classifier to show that LB1 is not within the normal range of variation in H. sapiens. These results combined support the hypothesis of a very early shared ancestor for LB1 and H. erectus, and illustrate how discriminant analysis approaches can be successfully used to help classify newly discovered specimens.
Pattern recognition; paleontology; discriminant analysis; morphological model; physical anthropology; Homo floresiensis
Microarray-based tumor classification is characterized by a very large number of features (genes) and small number of samples. In such cases, statistical techniques cannot determine which genes are correlated to each tumor type. A popular solution is the use of a subset of pre-specified genes. However, molecular variations are generally correlated to a large number of genes. A gene that is not correlated to some disease may, by combination with other genes, express itself.
In this paper, we propose a new classiification strategy that can reduce the effect of over-fitting without the need to pre-select a small subset of genes. Our solution works by taking advantage of the information embedded in the testing samples. We note that a well-defined classification algorithm works best when the data is properly labeled. Hence, our classification algorithm will discriminate all samples best when the testing sample is assumed to belong to the correct class. We compare our solution with several well-known alternatives for tumor classification on a variety of publicly available data-sets. Our approach consistently leads to better classification results.
Studies indicate that thousands of samples may be required to extract useful statistical information from microarray data. Herein, it is shown that this problem can be circumvented by using the information embedded in the testing samples.