To understand why human sensitivity for complex objects is so low, we study how word identification combines eye and ear or parts of a word (features, letters, syllables). Our observers identify printed and spoken words presented concurrently or separately. When researchers measure threshold (energy of the faintest visible or audible signal) they may report either sensitivity (one over the human threshold) or efficiency (ratio of the best possible threshold to the human threshold). When the best possible algorithm identifies an object (like a word) in noise, its threshold is independent of how many parts the object has. But, with human observers, efficiency depends on the task. In some tasks, human observers combine parts efficiently, needing hardly more energy to identify an object with more parts. In other tasks, they combine inefficiently, needing energy nearly proportional to the number of parts, over a 60∶1 range. Whether presented to eye or ear, efficiency for detecting a short sinusoid (tone or grating) with few features is a substantial 20%, while efficiency for identifying a word with many features is merely 1%. Why? We show that the low human sensitivity for words is a cost of combining their many parts. We report a dichotomy between inefficient combining of adjacent features and efficient combining across senses. Joining our results with a survey of the cue-combination literature reveals that cues combine efficiently only if they are perceived as aspects of the same object. Observers give different names to adjacent letters in a word, and combine them inefficiently. Observers give the same name to a word’s image and sound, and combine them efficiently. The brain’s machinery optimally combines only cues that are perceived as originating from the same object. Presumably such cues each find their own way through the brain to arrive at the same object representation.
The external world is mapped retinotopically onto the primary visual cortex (V1). We show here that objects in the world, unless they are very dissimilar, can be recognized only if they are sufficiently separated in visual cortex: specifically, in V1, at least 6 mm apart in the radial direction (increasing eccentricity) or 1 mm apart in the circumferential direction (equal eccentricity). Objects closer together than this critical spacing are perceived as an unidentifiable jumble. This is called “crowding”. It severely limits visual processing, including speed of reading and searching. The conclusion about visual cortex rests on three findings. First, psychophysically, the necessary “critical” spacing, in the visual field, is proportional to (roughly half) the eccentricity of the objects. Second, the critical spacing is independent of the size and kind of object. Third, anatomically, the representation of the visual field on the cortical surface is such that position in V1 (and several other areas) is the logarithm of eccentricity in the visual field. Furthermore, we show that much of this can be accounted for by supposing that each “combining field”, defined by the critical spacing measurements, is implemented by a fixed number of cortical neurons.
Binding of features helps object recognition in contour integration, but hinders it in crowding. In contour integration, aligned adjacent objects group together to form a path. In crowding, flanking objects make the target unidentifiable. But, to date, the two tasks have only been studied separately. May and Hess (2007) suggested that the same binding mediates both tasks. To test this idea, we ask observers to perform two different tasks with the same stimulus. We present oriented grating patches that form a “snake letter” in the periphery. Observers report either the identity of the whole letter (contour integration task) or the phase of one of the grating patches (crowding task). We manipulate the strength of binding between gratings by varying the alignment between them, i.e. the Gestalt goodness of continuation, measured as “wiggle”. We find that better alignment strengthens binding, which improves contour integration and worsens crowding. Observers show equal sensitivity to alignment in these two very different tasks, suggesting that the same binding mechanism underlies both phenomena. It has been claimed that grouping among flankers reduces their crowding of the target. Instead, we find that these published cases of weak crowding are due to weak binding resulting from target-flanker misalignment. We conclude that crowding is mediated solely by the grouping of flankers with the target and is independent of grouping among flankers.
crowding; wiggle; grouping; binding; Gestalt; contour integration; good continuation; alignment; object recognition; snake letter
Amblyopia is a much-studied but poorly understood developmental visual disorder that reduces acuity, profoundly reducing contrast sensitivity for small targets. Here we use visual noise to probe the letter identification process and characterize its impairment by amblyopia. We apply five levels of analysis — threshold, threshold in noise, equivalent noise, optical MTF, and noise modeling — to obtain a two-factor model of the amblyopic deficit: substantially reduced efficiency for small letters and negligibly increased cortical noise. Cortical noise, expressed as an equivalent input noise, varies among amblyopes but is roughly 1.4× normal, as though only 1/1.4 the normal number of cortical spikes are devoted to the amblyopic eye. This raises threshold contrast for large letters by a factor of √1.4 = 1.2×, a negligible effect. All 16 amblyopic observers showed near-normal efficiency for large letters (> 4× acuity) and greatly reduced efficiency for small letters: 1/4 normal at 2× acuity and approaching 1/16 normal at acuity. Finding that the acuity loss represents a loss of efficiency rules out all models of amblyopia except those that predict the same sensitivity loss on blank and noisy backgrounds. One such model is the last-channel hypothesis, which supposes that the highest-spatial-frequency channels are missing, leaving the remaining highest-frequency channel struggling to identify the smallest letters. However, this hypothesis is rejected by critical band masking of letter identification, which shows that the channels used by the amblyopic eye have normal tuning for even the smallest letters. Finally, based on these results, we introduce a new “Dual Acuity” chart that promises to be a quick diagnostic test for amblyopia.
amblyopia; noise; efficiency; cortical noise; Pelli-Levi Dual Acuity Chart
The Gestalt psychologists reported a set of laws describing how vision groups elements to recognize objects. The Gestalt laws “prescribe for us what we are to recognize ‘as one thing’” (Köhler, 1920). Were they right? Does object recognition involve grouping? Tests of the laws of grouping have been favourable, but mostly assessed only detection, not identification, of the compound object. The grouping of elements seen in the detection experiments with lattices and “snakes in the grass” is compelling, but falls far short of the vivid everyday experience of recognizing a familiar, meaningful, named thing, which mediates the ordinary identification of an object. Thus, after nearly a century, there is hardly any evidence that grouping plays a role in ordinary object recognition. To assess grouping in object recognition, we made letters out of grating patches and measured threshold contrast for identifying these letters in visual noise as a function of perturbation of grating orientation, phase, and offset. We define a new measure, “wiggle”, to characterize the degree to which these various perturbations violate the Gestalt law of good continuation. We find that efficiency for letter identification is inversely proportional to wiggle and is wholly determined by wiggle, independent of how the wiggle was produced. Thus the effects of three different kinds of shape perturbation on letter identifiability are predicted by a single measure of goodness of continuation. This shows that letter identification obeys the Gestalt law of good continuation and may be the first confirmation of the original Gestalt claim that object recognition involves grouping.
Gestalt; Grouping; Contour integration; Good continuation; Letter identification; Object recognition; Features; Snake in the grass; Snake letters; Dot lattice
Reading speed matters in most real-world contexts, and it is a robust and easy aspect of reading to measure. Theories of reading should account for speed.
Unless we fixate directly on it, it is hard to see an object among other objects. This breakdown in object recognition, called crowding, severely limits peripheral vision. The effect is more severe when objects are more similar. When observers mistake the identity of a target among flanker objects, they often report a flanker. Many have taken these flanker reports as evidence of internal substitution of the target by a flanker. Here, we ask observers to identify a target letter presented in between one similar and one dissimilar flanker letter. Simple substitution takes in only one letter, which is often the target but, by unwitting mistake, is sometimes a flanker. The opposite of substitution is pooling, which takes in more than one letter. Having taken only one letter, the substitution process knows only its identity, not its similarity to the target. Thus, it must report similar and dissimilar flankers equally often. Contrary to this prediction, the similar flanker is reported much more often than the dissimilar flanker, showing that rampant flanker substitution cannot account for most flanker reports. Mixture modeling shows that simple substitution can account for, at most, about half the trials. Pooling and nonpooling (simple substitution) together include all possible models of crowding. When observers are asked to identify a crowded object, at least half of their reports are pooled, based on a combination of information from target and flankers, rather than being based on a single letter.
Electronic supplementary material
The online version of this article (doi:10.3758/s13414-011-0229-0) contains supplementary material.
Crowding; Substitution; Pooling; Mixture modeling
Previous cue integration studies have examined continuous perceptual dimensions (e.g., size) and have shown that human cue integration is well described by a normative model in which cues are weighted in proportion to their sensory reliability, as estimated from single-cue performance. However, this normative model may not be applicable to categorical perceptual dimensions (e.g., phonemes). In tasks defined over categorical perceptual dimensions, optimal cue weights should depend not only on the sensory variance affecting the perception of each cue but also on the environmental variance inherent in each task-relevant category. Here, we present a computational and experimental investigation of cue integration in a categorical audio-visual (articulatory) speech perception task. Our results show that human performance during audio-visual phonemic labeling is qualitatively consistent with the behavior of a Bayes-optimal observer. Specifically, we show that the participants in our task are sensitive, on a trial-by-trial basis, to the sensory uncertainty associated with the auditory and visual cues, during phonemic categorization. In addition, we show that while sensory uncertainty is a significant factor in determining cue weights, it is not the only one and participants' performance is consistent with an optimal model in which environmental, within category variability also plays a role in determining cue weights. Furthermore, we show that in our task, the sensory variability affecting the visual modality during cue-combination is not well estimated from single-cue performance, but can be estimated from multi-cue performance. The findings and computational principles described here represent a principled first step towards characterizing the mechanisms underlying human cue integration in categorical tasks.
Understanding foreign speech is difficult, in part because of unusual mappings between sounds and words. It is known that listeners in their native language can use lexical knowledge (about how words ought to sound) to learn how to interpret unusual speech-sounds. We therefore investigated whether subtitles, which provide lexical information, support perceptual learning about foreign speech. Dutch participants, unfamiliar with Scottish and Australian regional accents of English, watched Scottish or Australian English videos with Dutch, English or no subtitles, and then repeated audio fragments of both accents. Repetition of novel fragments was worse after Dutch-subtitle exposure but better after English-subtitle exposure. Native-language subtitles appear to create lexical interference, but foreign-language subtitles assist speech learning by indicating which words (and hence sounds) are being spoken.
It is now emerging that vision is usually limited by object spacing rather than size. The visual system recognizes an object by detecting and then combining its features. ‘Crowding’ occurs when objects are too close together and features from several objects are combined into a jumbled percept. Here, we review the explosion of studies on crowding—in grating discrimination, letter and face recognition, visual search, selective attention, and reading—and find a universal principle, the Bouma law. The critical spacing required to prevent crowding is equal for all objects, although the effect is weaker between dissimilar objects. Furthermore, critical spacing at the cortex is independent of object position, and critical spacing at the visual field is proportional to object distance from fixation. The region where object spacing exceeds critical spacing is the ‘uncrowded window’. Observers cannot recognize objects outside of this window and its size limits the speed of reading and search.
Along with physical luminance, the perceived brightness is known to depend on the spatial structure of the stimulus. Often it is assumed that neural computation of the brightness is based on the analysis of luminance borders of the stimulus. However, this has not been tested directly. We introduce a new variant of the psychophysical reverse-correlation or classification image method to estimate and localize the physical features of the stimuli which correlate with the perceived brightness, using a brightness-matching task. We derive classification images for the illusory Craik-O'Brien-Cornsweet stimulus and a “real” uniform step stimulus. For both stimuli, classification images reveal a positive peak at the stimulus border, along with a negative peak at the background, but are flat at the center of the stimulus, suggesting that brightness is determined solely by the border information. Features in the perceptually completed area in the Craik-O'Brien-Cornsweet do not contribute to its brightness, nor could we see low-frequency boosting, which has been offered as an explanation for the illusion. Tuning of the classification image profiles changes remarkably little with stimulus size. This supports the idea that only certain spatial scales are used for computing the brightness of a surface.
We investigate the channels underlying identification of second-order letters using a critical-band masking paradigm. We find that observers use a single 1–1.5 octave-wide channel for this task. This channel’s best spatial frequency (c/letter) did not change across different noise conditions (indicating the inability of observers to switch channels to improve signal-to-noise ratio) or across different letter sizes (indicating scale invariance), for a fixed carrier frequency (c/letter). However, the channel’s best spatial frequency does change with stimulus carrier frequency (both in c/letter); one is proportional to the other. Following Majaj et al. (Majaj, N. J., Pelli, D. G., Kurshan, P., & Palomares, M. (2002). The role of spatial frequency channels in letter identification. Vision Research, 42, 1165–1184), we define “stroke frequency” as the line frequency (strokes/deg) in the luminance image. That is, for luminance-defined letters, stroke frequency is the number of lines (strokes) across each letter divided by letter width. For second-order letters, letter texture stroke frequency is the number of carrier cycles (luminance lines) within the letter ink area divided by the letter width. Unlike the nonlinear dependence found for first-order letters (implying scale-dependent processing), for second-order letters the channel frequency is half the letter texture stroke frequency (suggesting scale-invariant processing).
Letter identification; Second-order vision; Critical-band masking; Scale invariance; Channel switching
The Gestalt psychologists reported a set of laws describing how vision groups elements to recognize objects. The Gestalt laws “prescribe for us what we are to recognize ‘as one thing’.” (Köhler, 1920). Were they right? Does object recognition involve grouping? Tests of the laws of grouping have been favorable, but mostly assessed only detection, not identification, of the compound object. The grouping of elements seen in the detection experiments with lattices and ‘snakes in the grass’ is compelling, but falls far short of the vivid everyday experience of recognizing a familiar, meaningful, named thing, which mediates the ordinary identification of an object. Thus, after nearly a century, there is hardly any evidence that grouping plays a role in ordinary object recognition. To assess grouping in object recognition, we made letters out of grating patches and measured threshold contrast for identifying these letters in visual noise as a function of perturbation of grating orientation, phase, and offset. We define a new measure, “wiggle,” to characterize the degree to which these various perturbations violate the Gestalt law of good continuation. We find that efficiency for letter identification is inversely proportional to wiggle, and is wholly determined by wiggle, independent of how the wiggle was produced. Thus the effects of three different kinds of shape perturbation on letter identifiability are predicted by a single measure of goodness of continuation. This shows that letter identification obeys the Gestalt law of good continuation, and may be the first confirmation of the original Gestalt claim that object recognition involves grouping.
Gestalt; grouping; contour integration; good continuation; letter identification; object recognition; features; snake in the grass; snake letters; dot lattice
The speed and accuracy of decision-making have a well-known trading relationship: hasty decisions are more prone to errors while careful, accurate judgments take more time. Despite the pervasiveness of this speed-accuracy trade-off (SAT) in decision-making, its neural basis is still unknown.
Using functional magnetic resonance imaging (fMRI) we show that emphasizing the speed of a perceptual decision at the expense of its accuracy lowers the amount of evidence-related activity in lateral prefrontal cortex. Moreover, this speed-accuracy difference in lateral prefrontal cortex activity correlates with the speed-accuracy difference in the decision criterion metric of signal detection theory. We also show that the same instructions increase baseline activity in a dorso-medial cortical area involved in the internal generation of actions.
These findings suggest that the SAT is neurally implemented by modulating not only the amount of externally-derived sensory evidence used to make a decision, but also the internal urge to make a response. We propose that these processes combine to control the temporal dynamics of the speed-accuracy trade-off in decision-making.
Research in object recognition has tried to distinguish holistic recognition from recognition by parts. One can also guess an object from its context. Words are objects, and how we recognize them is the core question of reading research. Do fast readers rely most on letter-by-letter decoding (i.e., recognition by parts), whole word shape, or sentence context? We manipulated the text to selectively knock out each source of information while sparing the others. Surprisingly, the effects of the knockouts on reading rate reveal a triple dissociation. Each reading process always contributes the same number of words per minute, regardless of whether the other processes are operating.
A Noh mask, worn by expert actors during performance on the Japanese traditional Noh drama, conveys various emotional expressions despite its fixed physical properties. How does the mask change its expressions? Shadows change subtly during the actual Noh drama, which plays a key role in creating elusive artistic enchantment. We here describe evidence from two experiments regarding how attached shadows of the Noh masks influence the observers’ recognition of the emotional expressions.
In Experiment 1, neutral-faced Noh masks having the attached shadows of the happy/sad masks were recognized as bearing happy/sad expressions, respectively. This was true for all four types of masks each of which represented a character differing in sex and age, even though the original characteristics of the masks also greatly influenced the evaluation of emotions. Experiment 2 further revealed that frontal Noh mask images having shadows of upward/downward tilted masks were evaluated as sad/happy, respectively. This was consistent with outcomes from preceding studies using actually tilted Noh mask images.
Results from the two experiments concur that purely manipulating attached shadows of the different types of Noh masks significantly alters the emotion recognition. These findings go in line with the mysterious facial expressions observed in Western paintings, such as the elusive qualities of Mona Lisa’s smile. They also agree with the aesthetic principle of Japanese traditional art “yugen (profound grace and subtlety)”, which highly appreciates subtle emotional expressions in the darkness.
We conducted a preliminary study to examine whether Chinese readers’ spontaneous word segmentation processing is consistent with the national standard rules of word segmentation based on the Contemporary Chinese language word segmentation specification for information processing (CCLWSSIP). Participants were asked to segment Chinese sentences into individual words according to their prior knowledge of words. The results showed that Chinese readers did not follow the segmentation rules of the CCLWSSIP, and their word segmentation processing was influenced by the syntactic categories of consecutive words. In many cases, the participants did not consider the auxiliary words, adverbs, adjectives, nouns, verbs, numerals and quantifiers as single word units. Generally, Chinese readers tended to combine function words with content words to form single word units, indicating they were inclined to chunk single words into large information units during word segmentation. Additionally, the “overextension of monosyllable words” hypothesis was tested and it might need to be corrected to some degree, implying that word length have an implicit influence on Chinese readers’ segmentation processing. Implications of these results for models of word recognition and eye movement control are discussed.
We tested whether eye color influences perception of trustworthiness. Facial photographs of 40 female and 40 male students were rated for perceived trustworthiness. Eye color had a significant effect, the brown-eyed faces being perceived as more trustworthy than the blue-eyed ones. Geometric morphometrics, however, revealed significant correlations between eye color and face shape. Thus, face shape likewise had a significant effect on perceived trustworthiness but only for male faces, the effect for female faces not being significant. To determine whether perception of trustworthiness was being influenced primarily by eye color or by face shape, we recolored the eyes on the same male facial photos and repeated the test procedure. Eye color now had no effect on perceived trustworthiness. We concluded that although the brown-eyed faces were perceived as more trustworthy than the blue-eyed ones, it was not brown eye color per se that caused the stronger perception of trustworthiness but rather the facial features associated with brown eyes.