The purpose of this article was to present a set of facial expression stimuli and to present data describing the judgments of these stimuli by untrained, healthy adult research participants. This face set is large in number, is multiracial, has contemporary looking, professional actors, contains a variety of expressions, and is available to the scientific community online. For these reasons, this set may be a resource for scientists who study face perception.
Validity was indexed as how accurate participants were at identifying each emotional expression, and these scores were high. We examined proportion correct, as others have, in order to compare our results with those from other sets. The calculations included ratings from the entire set, and therefore included stimuli of both high and low posing quality. Nonetheless, the overall mean proportion correct obtained with this set was 0.79, which is well above the 0.70 criterion of the other sets that include models from non-European backgrounds. The scores are comparable to those reported for the Pictures of Facial Affect (Ekman and Friesen, 1976
), where the mean accuracy was 88%, and for the JACFEE set (Ekman and Matsumoto, 1993–2004
), where the average percent correct was 74% (Biehl et al., 1997
Different face sets have provided different response options to participants. As Russell (1994)
points out, the response option can bias the level of accuracy obtained in expression recognition studies. Forced choice methods, like those commonly used in validating face expressions sets, can inflate accuracy because these procedures bias the participant towards a particular hypothesis. However, the other extreme of freely chosen labels is also not ideal because participants tend to provide scenarios (e.g., “she saw a ghost”) rather than expressions (e.g., fear) or, as a group, they rarely choose the same word to describe the expression, forcing researchers to make judgments regarding individual responses. The current study chose a semi-forced choice method that was less strict than forced choice while being more interpretable than the free label method. Participants were provided with a “none of the above” option. As a result, participants labeled only those faces that they felt met a criterion of adequately conveying an emotional expression. In light of this more stringent procedure, the high accuracy scores obtained from the NimStim set are all the more striking.
Calculating proportion correct does not account for false alarms (e.g., the number of times participants called a non-fear face “fear”), and therefore, we also calculated kappa for each stimulus to measure concordance between participants’ labels and the intended expressions. Landis and Koch (1977)
have defined “moderate” concordance as kappas with values between 0.4 and 0.6, “substantial” concordance between 0.6 and 0.8, and “almost perfect” concordance of 0.8 and greater. The mean kappa obtained in this set was 0.79, which was well within the “substantial” range. Kappas calculated for each actor were all in the “substantial” and “almost perfect” range. Only one actor resulted in an average kappa below the “substantial” range; that actor’s images have been removed from the set, and now Version 2.0 of the NimStim set is available online. This set only contains stimuli from actors with kappas within the “substantial” and “almost perfect” range.
The high variability in scores across emotion categories was expected since emotion recognition differs across expressions (Strauss and Moscovitch, 1981
; Calder et al., 2003
), and this variability has been shown in other sets (Ekman and Friesen, 1976
). It is unlikely that the source of the variability is the posed nature of the images since the same variability is observed when viewers judge spontaneously produced faces (Gur et al., 2002
). Typically and here, happy expressions have high recognition rates, and negative expressions (in particular, sad, fear, and surprised) have poor recognition rates. There are many hypotheses one could generate regarding the variability. The greater accuracy in recognizing happy faces may be a result of greater familiarity with happy faces or the result of the rewarding aspects of happy faces (Hare et al., 2005
). While this article cannot address the cause of inter-expression variability, the results from this study replicate what has been shown in other studies where happy expressions are accurately recognized and negative expressions are poorly recognized (Biehl et al., 1997
; Lenti et al., 1999
; Gur et al., 2002
; Elfenbein and Ambady, 2003
Reliability was indexed by comparing judgments from time 1 to judgments from time 2. Kappas were not the appropriate statistic for these ratings since agreement was very high and the corresponding rates of disagreement were too low (Kraemer et al., 2002
); therefore, we calculated proportion agreement between times 1 and 2 for each stimulus. Test–retest reliability was high as measured by proportion agreement. There was variability from one expression to another and little variability from one actor to another. No other study has reported test-retest reliability for judgments of face stimuli, so it is not possible to say whether this pattern of reliability is common. These data suggest that the stimuli are rated consistently across multiple sessions.
The method of expression creation itself can bias interpretation. Models could be instructed to move certain muscle groups to produce an expression (Ekman and Friesen, 1976
), which produces uniform expressions across models, but might jeopardize the ecological validity of the images (Russell, 1994
). On the other hand, naturally occurring facial expressions might lead to greater authenticity in the images, but these types of images can result in greater variability from one stimulus to another (Russell, 1994
), which may not be ideal for experimental paradigms. To maintain the natural variability across models while maintaining some degree of uniformity across exemplars so viewers easily interpret them, the actors in the NimStim set were given an emotion category and instructed to create the expression themselves (Mandal, 1987
). The results of this study demonstrate that this method, like other posed methods (Ekman and Friesen, 1976
), results in highly accurate judgments, although the ecological validity of these faces cannot be determined by the current study. There is growing interest in the dynamics of facial expressions (Ambadar et al., 2005
), and the stimuli presented in this article are deliberate, strong in intensity, and displayed as static photographs. While they may at times exaggerate aspects of naturally occurring facial expressions, the merit of self-posed expressions is that they make possible the creation of a large bank of uniform stimuli.
Because of concerns regarding the emotional “neutrality” of neutral faces (Thomas et al., 2001
; Donegan et al., 2003
), this set of facial expressions included a calm face. The intention was to create a facial expression that participants would explicitly label as neutral/plain, but may actually interpret as a less emotionally significant face relative to a neutral face. Scores were lower for the calm/neutral validity ratings relative to ratings of the entire set, which was expected considering the increased difficulty in judging the difference between calm and neutral. Despite this level of difficulty, a significant proportion of the neutral and calm faces were correctly labeled above chance levels (although neutral was correctly identified more often than calm). These expressions were posed consistently within actor such that when an actor posed a calm face well, he or she also posed a neutral face well, making a complete set of calm and neutral faces available to researchers.
There are some shortcomings of the set. First, although the rating system used here was conservative (i.e., not a forced choice paradigm), it would be even more informative if future studies employed a system of responding where participants rated faces on a continuum for each type of emotion label. This “quantitative rating” (Russell, 1994
) approach is a sensitive measure to capture subtle combinations of emotions within a face. Given the large number of images in the set, collecting data in this quantitative manner would preclude a within-participants design (e.g., instead, group A would rate the first third, group B would rate the second third, and group C would rate the third), and a between-participants design would weaken conclusions made regarding the consensus of the interpretations. A second issue concerns the open-/closed-mouth versions, which were created to control for the strong perceptual feature of an open mouth. We tested recognition across both open- and closed-mouth versions of each expression, and on average, this manipulation disrupts recognition of some facial expressions. However, on an individual model level, there are expressions that are accurately identified with both open- and closed-mouth versions. So, on average manipulating the mouth while maintaining the intended expression is difficult for most models to do, but there are individual actors who produce both versions well. Thirdly, unlike other stimulus sets that strip faces of extra paraphernalia (Erwin et al., 1992
), actors were not instructed to remove make-up, jewelry, or facial hair. This decision could bias judgments, but it also results in a set of faces that are more representative of faces people see every day. Finally, while images from this set are available for the scientific community for use in experiments at no cost, only a subset the images may be printed in scientific publications, and these models are listed online at http://www.macbrain.org/resources.htm
. The remaining models may not be published in any form.
The goal in creating this set of facial expressions was to provide a large, multiracial set of photos of professional actors, posing expressions that untrained experimental participants could identify. Having untrained participants identify the emotions on the faces is the best way to assess how similar participants in future studies will interpret the expressions. The data presented in this article should raise experimenters’ confidence level about both the validity and reliability of these expressions in so far as untrained participants in a face processing study perceive them. The versatility of this set makes it a useful resource to ask new questions and draw conclusions about social perception that are generalizable to a broader range of faces.