We have constructed a model for visual pattern representation in the mid-level ventral stream, based on local correlations amongst V1 responses within eccentricity-dependent pooling regions. We have developed a method for generating images with identical model responses, and used these synthetic images to show that: (1) when the pooling region sizes of the model are set correctly, images with identical model responses are indistinguishable (metameric) to human observers, despite severe distortion of features in the periphery; (2) the critical pooling size required to produce metamericity is robust to bottom-up and top-down manipulations of discrimination performance; (3) critical pooling sizes are consistent with the eccentricity dependence of receptive field sizes of neurons in ventral visual area V2; and (4) the model can predict degradations of peripheral recognition known as “crowding”, as a function of both spacing and eccentricity.
Perceptual deficits in peripheral vision have been recognized for centuries. Most early literature focuses on the loss of acuity that results from eccentricity-dependent sampling and blurring in the earliest visual stages. Crowding is a more complex deficit39
. In a prescient article in 1976, Jerome Lettvin gave a subjective account of this phenomenon, describing letters embedded in text as having “lost form without losing crispness”, and concluding that “the embedded [letter] only seems to have a ‘statistical’ existence.”20
Lettvin’s article seems to have drifted into obscurity, but these ideas have been formalized in recent literature that explains crowding in terms of excessive averaging or pooling of features21–24
. Balas et. al. (2009), in particular, hypothesized that crowding is a manifestation of the representation of peripheral visual content with local summary statistics. They showed that human recognition performance for crowded letters was matched to that of foveally viewed images synthesized to match the statistics of the original stimulus (computed over a localized region containing both the letter and flankers).
Our model provides an instantiation of these pooling hypotheses that operates over the entire visual field, which, in conjunction with the synthesis methodology, enabled several scientific advances. First, we validated the model with a metamer discrimination paradigm, which provides a more direct test than comparisons to recognition performance in a crowding experiment. Second, the parameterization of eccentricity dependence allowed us to estimate the size of pooling regions, and thus to associate the model with a distinct stage of ventral stream processing. Third, the full-field implementation allowed us to examine crowding in stimuli extending beyond a single pooling region, and thus to account for the dependence of recognition on both eccentricity and spacing — the defining properties of crowding18
Finally, the fact that the model operates on arbitrary photographic images allows generalization of the laboratory phenomenon of crowding to complex scenes and everyday visual tasks. For example, crowding places limits on reading speed, because only a small number of letters around each fixation point are recognizable40
. Model-synthesized metamers can be used to examine this “uncrowded” window (). We envision that the model could be used to optimize fonts, letter spacings, or line spacings for robustness to crowding effects, potentially improving reading performance. There is also some evidence linking dyslexia to crowding with larger-than-normal critical spacing18,41,42
, and the model might serve as a useful tool for investigating this hypothesis. Additional examples are provided in , which show how camouflaged objects, which are already difficult to recognize foveally, blend into the background when viewed peripherally.
Figure 7 Effects of crowding on reading and searching. (a) Two metamers, matched to the model responses of a page of text from the first paragraph of Herman Melville’s “Moby Dick”. Each metamer was synthesized using a different foveal location (more ...)
The interpretation of our experimental results relies on assumptions about the representation of, and access to, information in the brain. This is perhaps best understood by analogy to trichromacy14
. Color metamers occur because information is lost by the cones and cannot be recovered in subsequent stages. But color appearance judgements clearly do not imply direct, conscious, access to the responses of those cones. Analogously, our experiments imply that the information loss ascribed to areas V1 and V2 cannot be recovered or accessed by subsequent stages of processing (two stimuli that are V1 metamers, for example, should also be V2 metamers). But this does not imply that observers directly access the information represented in V1 or V2. Indeed, if observers could access V1 responses, then any additional information loss incurred when those responses are combined and pooled in V2 would have no perceptual consequence, and the stimuli generated by the mid-ventral model would not appear metameric.
The loss of information in our model arises directly from its architecture — the set of statistics, and the pooling regions over which they are computed — and this determines the set of metameric stimuli. Discriminability of non-metameric stimuli depends on the strength of the information preserved by the model, relative to noise. As seen in the presentation time and attention control experiments, manipulations of signal strength do not alter the metamericity of stimuli, and thus do not affect estimates of critical scaling. These results are also consistent with the crowding literature. Crowding effects are robust to presentation time43
, and attention can increase performance in crowding tasks while yielding small or no changes in critical spacing19,44
. Certain kinds of exogenous cues, however, may reduce critical spacing45
, and perceptual learning has been shown to reduce critical spacing through several days of intensive training46
. If either manipulation were found to reduce critical scaling (as estimated from a metamer discrimination experiment), we would interpret this as arising from a reduction in receptive field sizes, which could be verified through electrophysiological measurements.
From a physiological perspective, our model is deliberately simplistic: We expect that incorporating more realistic response properties (e.g., spike generation, feedback circuitry) would not significantly alter the information represented in model populations, but would render the synthesis of stimuli computationally intractable. Despite the simplicity of the model, the metamer experiments do not uniquely constrain the response properties of individual model neurons. This may again be understood by analogy with the case of trichromacy: color matching experiments constrain the linear subspace spanned by the three cone absorption spectra, but do not uniquely constrain the spectra of the individual cones14
. Thus, identification of V2 as the area in which the model resides does not imply that responses of individual V2 neurons encode local correlations. Our results, however, do suggest new forms of stimuli that could be used to explore such responses in physiological experiments. Within a single pooling region, the model provides a parametric representation of local texture features17
. Stochastic stimuli containing these features are more complex than sine gratings or white noise, but better controlled (and more hypothesis driven) than natural scenes or objects, and are thus well suited for characterizing responses of individual cells47
Finally, one might ask why the ventral stream discards such a significant amount of information. Theories of object recognition posit that the growth of receptive field sizes in consecutive areas confers invariance to geometric transformations, and cascaded models based on filtering, simple nonlinearities, and successively broader spatial pooling have been used to explain such invariances measured in area IT8–10,48
. Our model closely resembles the early stages of these models, but our inclusion of eccentricity-dependent pooling, and the invariance to feature scrambling revealed by the metamericity of our synthetic stimuli, seems to be at odds with the goal of object recognition. One potential resolution of this conundrum is that the two forms of invariance arise in distinct parallel pathways. An alternative possibility is that a texture-like representation in the early ventral stream provides a substrate for object representations in later stages. Such a notion was suggested by Lettvin, who hypothesized that “texture, somewhat redefined, is the primitive stuff out of which form is constructed”20
. If so, the metamer paradigm introduced here provides a powerful tool for exploring the nature of invariances arising in subsequent stages of the ventral stream.