In this study, we took advantage of the rich, but simple, face code supplied by cartoon faces to probe strategies for face detection and differentiation in the middle face patch. By studying this problem in a small cortical volume, we identified new coding principles that may be of general importance to the extraction of complex form in inferotemporal cortex.
Cells in the middle face patch detect a wide range of faces, as evidenced by their vigorous responses to both real and cartoon faces compared with objects (). However, different cells accomplish this by different means. No cell required the presence of a whole face to respond, indicating that the detection process is not strictly holistic. Instead, responses to systematically decomposed cartoon faces showed that different cells were selective for different face parts and interactions between parts (), and even the same cell can respond maximally to different combinations of face parts (). Thus, there is no single blueprint for detecting the form of a face in the middle face patch.
The mechanism for distinguishing between individual faces appears to rely on a division of labor among cells tuned to different subsets of facial features. This was revealed by dense parametric mapping
17; responses were measured to a cartoon stimulus in which all of the face parameters were independently varied. Tuning to individual features was almost universal (found in 90% of cells) and each cell was tuned, on average, to only three feature dimensions (), which were integrated in a separable manner (). Together, cells in the middle face patch span a face space
18,
19 with three salient characteristics. First, the axes of the space, represented by the tuning curves of individual cells, correspond to basic face features and not to holistic exemplars. Second, the dimensionality of the space is reduced compared with the physical face space
20 () and the population focused on features related to the eye and face layout geometry. Finally, the location in the space is coded predominantly by the firing rates of cells with broad, monotonic tuning curves (). A majority of tuning curves peaked at one extreme and showed a minimal response at the opposite extreme, and the dynamic range of tuning often spanned or slightly exceeded the range of physical plausibility. Monotonic tuning allows for simple readout
21 and may be a general principle for high-level coding of visual shapes
22,
23. It may also aid in emphasizing what makes an individual face unique (that is, separates it from the standard face)
7,
24–
26, as population response variance is highest to such ‘unusual’ features (). Finally, the breadth of tuning underscores the fact that cells in the middle face patch encode axes and not individual faces. This finding indicates that coding in the middle face patch is coarse, and is not sparse as it is at higher stages of the processing hierarchy
27, and substantiates theoretical proposals that coarse population codes are advantageous for representing high-dimensional stimulus spaces
28,
29.
Psychophysicists have long proposed that perception of face identity has a holistic component
12,
13,
30, in which a face is obligatorily processed as a whole. We found two lines of evidence for holistic processing of face geometry. First, we found that the presence of a whole, upright face increased the gain of feature tuning curves by an average factor of 2.2 (). Our finding that holistic coding uses gain modulation underscores the idea that gain modulation may be a computational mechanism of general importance to cortical function
31,
32, even beyond coordinate transformations
33,
34 and for attention
35,
36. Second, by comparing responses to upright and inverted cartoon faces, we found that the identity of individual features is interpreted according to the heuristics of an upright face template (). These two results demonstrate specific neural mechanisms by which the presence of an upright facial gestalt influences feature measurement in single cells. In our experiments, cartoons were presented rapidly, putting the system to a test in a feedforward mode
37–
39 in the sense that no expectations about the upcoming feature values could be formed. In real-world face perception, top-down feedback
40 is important and may be necessary for additional effects of holistic processing.
We found a high incidence of tuning to some facial features, mostly to eyes and facial layout, and a paucity of tuning to others, mostly mouth and nose-related ones. It seems plausible that such a spatial bias of tuning preference in the face may be the result of attention or preferential looking rather than a computational strategy for face processing, as attention has been shown to augment feature tuning
35,
36,
41. Several results are, however, incompatible with a spatial attention or preferential looking account of tuning biases. First, these accounts would predict stronger tuning to isolated face parts as a result of the absence of other potentially distracting visual stimuli (the rest of the face). However, we found the opposite (). Second, preference for face aspect ratio over both internal features (nose, mouth, eyes and eye brows) and external ones (hair) can only be explained by a donut-shaped spotlight of attention and this would not cause preferential tuning for face direction or height of feature assembly, two popular parameters defined by the relative positioning of the internal features to the face layout. Third, the feature tuning bias occurred independently of slight gaze-direction biases above or below the fixation spot (
Supplementary Text 1); for example, the preference for eye parameters remained even during fixations below the fixation spot. This rules out the possibility of a preferential looking account and renders a spatial attention confound unlikely, as spatial attention is tied to eye movements. Thus, preferential tuning for facial layout and eye parameters seems to be influenced little, if at all, by attention and eye positioning, but instead seems to be the result of computational mechanisms of shape analysis in the middle face patch.
For the same reasons, it seems unlikely that preferential representation of extreme feature values is a byproduct of attentional capture. Furthermore, the attentional capture account would predict response maxima for both extremes, that is, U-shaped tuning curves, because both ends of the shape spectrum are equally extreme shapes in most feature dimensions. Instead, we found that tuning curves were ramp shaped and even more response minima than maxima occurred at extremes. Similarly, the special status of extreme feature values cannot be explained by shape changes (whether attention capturing or not) rather than genuine shape preferences, as significant interactions between responses to successive feature values were found in less than 2% of all tuned feature dimensions (P < 0.05).
Our results expand existing conceptions about inferotemporal organization
42–
44 in two major ways. First, it has been suggested that an IT cell can be characterized by its ‘critical feature’, defined as the simplest stimulus that still elicits a maximal response
42. Our results suggest that such a characterization is incomplete and needs to be augmented by a description of the cell’s feature tuning and its full selectivity for parts and part interactions. Cells in the middle face patch are not only selective for the presence of subsets of face parts (), but also show tuning to subsets of face features (). The critical feature for a cell would be a face optimized along all dimensions to which the cell is tuned. However, knowing this single best image would not allow one to distinguish between features to which the cell is tuned, and parts that are simply required to be present (in whatever shape). Furthermore, the predominance of broad, ramp-shaped tuning suggests that all levels of response to a tuned feature, including minimal responses, are important (minimal responses are just as informative about what feature is present as maximal responses, see ref.
8 for a related idea). This notion, that all levels of response to a tuned feature are informative, is not included in the critical feature account of IT. On average, the response to a full face was less than the sum of the responses to each part and cells often fired maximally to different combinations of face parts (). Therefore, an IT cell, at least in the middle face patch, is only incompletely characterized by a single critical feature; instead, it is necessary to describe all of the parts and part combinations for which the cell is selective.
The second major insight from our findings concerns the functional organization of IT. It has been suggested that cells selective for visually similar critical features are grouped into columns. Our results indicate that what cells in the middle face patch have in common is a strong preference for faces over other objects, but this preference is a true form selectivity that cannot be captured by common selectivity to any fixed visual feature. There was a marked diversity in part selectivity and feature tuning in the middle face patch, and the tuned features of two neighboring face cells often shared no visual similarity at all (for example, hair width versus eyebrow slant). This diversity of feature tuning provides the brain with a rich vocabulary to describe faces and shows how a high-dimensional parameter space may be encoded even in a small region of IT. The macaque temporal lobe contains three face patches anterior to the middle face patch, and future experiments may reveal how the vocabulary of the middle face patch is used by the anterior face patches.