|Home | About | Journals | Submit | Contact Us | Français|
Human stereopsis has two well-known constraints: the disparity-gradient limit, which is the inability to perceive depth when the change in disparity within a region is too large, and the limit of stereoresolution, which is the inability to perceive spatial variations in disparity that occur at too fine a spatial scale. We propose that both limitations can be understood as byproducts of estimating disparity by cross-correlating the two eyes’ images, the fundamental computation underlying the disparity-energy model. To test this proposal, we constructed a local cross-correlation model with biologically motivated properties. We then compared model and human behaviors in the same psychophysical tasks. The model and humans behaved quite similarly: they both exhibited a disparity-gradient limit and had similar stereoresolution thresholds. Performance was affected similarly by changes in a variety of stimulus parameters. By modeling the effects of stimulus blur and of using different sizes of image patches, we found evidence that the smallest neural mechanism humans use to estimate disparity is 3–6 arcmin in diameter. We conclude that the disparity-gradient limit and stereoresolution are indeed byproducts of using local cross-correlation to estimate disparity.
Stereopsis, the ability to perceive depth from binocular disparity, is limited by a number of factors. The variation in disparity from one part of the stimulus to another must be large enough to produce a discernible variation in perceived depth. This just-noticeable variation is the disparity threshold. But the magnitude of disparity variation must not be too great; otherwise, the two eyes’ images cannot be fused and the depth percept collapses. This maximum disparity is the fusion limit or Dmax. Finally, the spatial variation in disparity from one part of the stimulus to another must not occur at too fine a scale. The finest perceptible variation is the stereoresolution limit. These limits to stereopsis are summarized in Figure 1. The upper panel is a stereogram of sinusoidal corrugations in which disparity amplitude increases from left to right and spatial frequency increases from bottom to top. View the stereogram at a distance of 40 cm and cross-fuse or divergently fuse to see the corrugations. One can perceive the sinusoidal depth variation in the middle of the stereogram but not elsewhere. The lower panel is a graph, replotted from Tyler (1975), showing the combinations of disparity amplitude and spatial frequency for which the corrugation in depth is perceived and the combinations for which it is not. Our purpose is to better understand the determinants of the boundary conditions for stereopsis.
To estimate disparity, the visual system must determine which parts of the two retinal images correspond. Doing this by cross-correlating the two eyes’ images has been used successfully in computer vision (Clerc & Mallat, 2002; Kanade & Okutomi, 1994), in modeling human vision (Banks, Gepshtein, & Landy, 2004; Cormack, Stevenson, & Schor, 1991; Fleet, Wagner, & Heeger, 1996; Harris, McKee, & Smallman, 1997), and in modeling binocular interaction in visual cortex. The prevailing model of binocular integration in visual cortex is the disparity-energy model (Cumming & DeAngelis, 2001; Ohzawa, 1998; Ohzawa, DeAngelis, & Freeman, 1990). In this model, the output of binocular complex cells can be expressed as
where SL,R are the responses of simple cells in the left and right eyes and even and odd refer to the symmetry of the simple-cell receptive fields (Prince & Eagle, 2000). In the last two terms, the left eye’s response is multiplied by the right eye’s response. A bank of such cells, each tuned for a different disparity, making this computation performs the equivalent of windowed or local cross-correlation.
In this paper, we investigate whether the limits of stereopsis revealed in Figure 1 are byproducts of using local cross-correlation to estimate disparity from the two eyes’ images. To do so, we construct a local cross-correlator with biologically motivated properties and compare its behavior to that of human observers. In the first section, we consider the cause of the limit on the upper right side of the shaded area in Figure 1; we point out, as have Burt and Julesz (1980) and Tyler (1974, 1975), that the transition with increasing disparity from perceptible to imperceptible is well described by the disparity-gradient limit. By comparing human and model performances, we show that this limit is a byproduct of estimating disparity by correlation. In the second section, we consider the cause of the limit on the upper part of the shaded region in Figure 1. We again compare human and model performances and show that this stereoresolution limit is also a byproduct of estimating disparity by correlation.
Sinusoidal corrugations in random-element stereograms cannot be discriminated when the product of corrugation spatial frequency and disparity amplitude exceeds a critical value (Tyler, 1974, 1975; Ziegler, Hess, & Kingdom, 2000). A similar phenomenon was observed by Burt and Julesz (1980) who reported that two-element stereograms cannot be fused when the angular separation between the elements is less than the disparity. Burt and Julesz argued that the limit to fusion is not disparity per se, as suggested by the notion of Panum’s fusional area (Panum, 1858). Rather the limit is a ratio: the separation divided by the disparity. This critical ratio is the disparity-gradient limit.
The disparity gradient is clearly defined for two-element stimuli. For elements P and Q in a stereogram, the coordinates in the left eye’s image are (xPL, yPL) and (xQL, yQL), and the coordinates in the right eye are (xPR, yPR) and (xQR, yQR). The separation is the vector S from the average position of P to the average position of Q. Its magnitude is
and its direction is
The disparity is the vector D; its magnitude is
and its direction is
The disparity gradient is |D|/|S|. In Burt and Julesz (1980), the direction of S was varied, but D was always horizontal. They found that two-element stereograms could not be fused when the disparity gradient exceeded 1 regardless of the direction of S. In other words, they found that the disparity-gradient limit was unaffected by the tilt of the stimulus (Stevens, 1979).
On a surface, the disparity gradient is not clearly defined because the gradient can in principle be measured in any direction. The gradient is, however, largest in the direction in which depth increases most rapidly (i.e., parallel to surface tilt). For this reason, we will define the disparity gradient in the direction of most rapidly increasing depth: i.e., S will be parallel to tilt. The definition of the disparity gradient for a horizontally oriented sawtooth corrugation is schematized in Figure 2.
The observation that the product of spatial frequency and amplitude must not exceed a critical value is also a manifestation of the disparity-gradient limit. As shown in Figure 2, the disparity gradient for sawtooth slats can be expressed as
where A is the amplitude of the sawtooth wave and SF is the spatial frequency. The disparity gradient for the discontinuities between the slats is infinite. While sine waves do not have a constant disparity gradient, the disparity gradient of the steepest part of the waveform will have a similar relationship to the amplitude and spatial frequency. The corrugations in Figure 1 are horizontal (as they were in Tyler, 1973, 1974, 1975) and are defined by horizontal disparity only; thus, S is vertical and D is horizontal.
Why does the disparity-gradient limit exist? There are two general hypotheses. First, the limit might be a manifestation of constraints built into the visual system to help minimize false matches between the two eyes’ images; we will refer to this as the constraint hypothesis. Second, the disparity-gradient limit might be a byproduct of the manner in which binocular correspondence is solved; we will refer to this as the correlation hypothesis.
The constraint hypothesis states that the disparity-gradient limit is a topology constraint. Consider a small frontoparallel surface with a vertical rotation axis (tilt = 0°). As we rotate the surface, the slant increases. At some point, one eye will see the surface “edge on” such that points in the direction of most rapidly increasing depth will be superimposed in one eye’s retinal image and not in the other eye’s image. When this occurs, the disparity gradient is 2 (Trivedi & Lloyd, 1985). If we rotate the surface further, the gradient exceeds 2, and the order of the image points is reversed in the two eyes (specifically, the points occur in opposite orders along an epipolar line in the two eyes’ images). This observation is, however, only correct for surfaces with a tilt of 0° (i.e., when the direction of S and D are both horizontal). Trivedi and Lloyd (1985) argued that the visual system avoids matching elements with different orderings in the two eyes by invoking the disparity-gradient limit. Specifically, matches consistent with a small gradient should be favored over matches with a large one. Invoking a disparity-gradient constraint in solving the correspondence problem is consistent with other constraints that have been proposed such as the uniqueness, ordering, and smoothness constraints (Li & Hu, 1996; Marr & Poggio, 1976; Pollard, Mayhew, & Frisby, 1985; Prazdny, 1985).
Although Trivedi and Lloyd (1985) showed that ordering is preserved when the disparity gradient is less than 2, they also noted that the converse is not true: all correctly ordered points will not necessarily have a disparity gradient less than 2. To expand this observation, we calculated the disparity gradient at which occlusion occurs in one eye for different surface tilts. Figure 3 shows the results. The critical gradient is indeed 2 at tilt = 0°, but it increases with increasing tilt until it becomes infinite at tilt = 90°. Thus, the value of the disparity gradient at which the order of image points in the two eyes’ images reverses is strongly dependent on surface tilt. If the constraint serves the function of avoiding matches of one element with two in the other eye (uniqueness constraint) or of matching elements in different orders in the two eyes (ordering constraint), one would expect the disparity-gradient limit to increase with increasing surface tilt. The fact that it does not (Burt & Julesz, 1980) suggests that the limit is caused by something else.
The correlation hypothesis states that the disparity-gradient limit is a byproduct of the fundamental calculation involved in solving the correspondence problem. The normalized cross-correlation (Equation 9) between two images reaches its greatest value of 1 when the two eyes’ images are identical. Spatial variations in disparity cause differences in the two images and thereby cause a decrease in correlation. For this reason, one expects the correlation to decrease as the disparity gradient increases. The decrease in correlation occurs whether the disparity gradient is greatest horizontally (surface with tilt = 0°), vertically (tilt = 90°), or anything in-between. The fact that the disparity-gradient limit is not dependent on tilt is consistent with the correlation hypothesis and not with the constraint hypothesis.
We tested the idea that the disparity-gradient limit is a byproduct of estimating disparity by correlation by examining how the disparity gradient affects human stereopsis, how it affects the performance of a correlation model, and then comparing the two on the same task with the same stimuli.
The observers were the two authors and two other adults; the latter were unaware of the experimental hypotheses. All had normal visual acuity and stereopsis.
Stimuli were displayed on a haploscope consisting of two monochrome CRT displays (58 cm on the diagonal) each seen in a mirror by one eye (Backus, Banks, van Ee, & Crowell, 1999; Hillis & Banks, 2001). The lines of sight from the eyes to the centers of the displays were perpendicular to the display surfaces. The displays were 39 cm from the eyes. Despite the short distance, the visual locations of the elements in our stimuli were specified to within ~30 arcsec. We fixed eye position relative to the displays by using custom bite bars.
The stimuli were random-dot stereograms with a dot density of 10 dots/deg2 and an extent of 15° horizontally and vertically. The average luminous intensity of a dot was 1.72 × 10−6 cd and the size was 0.53 arcmin. The dots were randomly distributed in the half-images. Two methods were used to create disparity. In the first, we shifted the dots horizontally in screen coordinates (which correspond to horizontal in Helmholtz coordinates). This is the most common method for creating stereograms, but such stimuli presented in a haploscope do not have the vertical disparities that are produced by real-world stimuli at finite distances (Held & Banks, 2008). In the second method, we used “back projection” to create the appropriate horizontal and vertical disparities (Backus et al., 1999).
The disparities specified a horizontally oriented sawtooth corrugation (Figure 2). Each slat in the sawtooth had a constant disparity gradient that was proportional to both the spatial frequency and amplitude of the corrugation.
Before each trial, a dichoptic nonius fixation target was visible. Observers made sure the nonius lines were aligned before initiating a stimulus presentation, which assured that their vergence eye position was appropriate. With a key press, the observer initiated a 600-ms presentation of the sawtooth stimulus. It was presented in one of two parities: with the slats slanted top-back or top-forward. The observer then indicated which parity had been presented. The absolute phase of the sawtooth was varied randomly from trial to trial, so that the task could not be performed by determining the depth of a single dot. Because the corrugations were horizontal (i.e., tilt = 90°), there were no monocular artifacts; performing the task required perceiving the cyclopean waveform.
We wanted to keep the corrugation waveform fixed for each experimental condition, so we measured parity-discrimination thresholds by adding uninformative noise to the stimulus rather than by manipulating disparity amplitude, which would have changed the disparity gradient of the waveform. The uninformative noise was dots randomly positioned in 3D; the depths of the noise dots were drawn from a uniform random distribution with a fixed range that was greater than the depth range of the corrugation waveform. Coherence is the number of signal dots (those specifying the corrugation) divided by the total number of dots (signal dots plus noise dots). A coherence of 1 means that all dots are signal dots, while a coherence of 0 means that all dots are noise dots. The sum of signal and noise dots was always the same. We varied coherence using the method of constant stimuli in order to determine the threshold value. We fit the psychometric data with a cumulative Gaussian using a maximum-likelihood criterion and used the coherence at 75% correct as the threshold estimate (Wichmann & Hill, 2001).
Our cyclopean discrimination task required that observers perceive at least part of the disparity-defined corrugation. We assume that the two eyes’ images had to be fused to perform this task.
The first measurements were made with stereograms in which the disparities were created by horizontal shifting. The haploscope arms were rotated such that the vergence distance was 39 cm, which matched the physical distance to the CRTs.
Figures 4 and and55 plot the results. In Figure 4, coherence threshold is plotted as a function of disparity amplitude. The three sets of data points correspond to the thresholds for three different spatial frequencies: 0.15, 0.3, and 0.6 cpd. Coherence threshold generally rose with increasing disparity amplitude, but there was no particular amplitude at which the rise in threshold was the same. Figure 5 plots the same data as a function of disparity gradient. With increasing gradient, threshold rose in similar fashion for all three spatial frequencies. Thus, perception of the cyclopean waveform began to collapse as the disparity gradient reached a value of approximately 1, which is consistent with earlier work (Burt & Julesz, 1980; Tyler, 1973, 1974, 1975; Ziegler et al., 2000). The worsening of performance was not precipitous: observers could still discriminate the cyclopean waveform when the gradient was as high as 1.6. This is consistent with previous work showing that humans can perceive surfaces when the disparity gradient exceeds 1 (McKee & Verghese, 2002).
We also measured coherence thresholds for vertically oriented waveforms (tilt = 0°). The task was generally more difficult because vertical corrugations produced severe monocular artifacts (regions in which no dots appeared and regions of high dot density), but we were able to obtain reasonable data from two observers. As with horizontal corrugations, the rise in coherence threshold was determined by the disparity gradient. Thus, the disparity gradient is a key determinant of the ability to perceive cyclopean stimuli whether the variations in depth are vertical (tilt = 90°) or horizontal (tilt = 0°).
A given disparity gradient specifies different slants at different distances. The first experiment was conducted at one simulated viewing distance, so the relationship between disparity gradient and slant was fixed. It is therefore possible that slant rather than the disparity gradient was the primary limit to performance. To investigate this possibility, we tested two observers at different viewing distances. To vary the simulated distance, we manipulated the vergence stimulus by rotating the haploscope arms to the appropriate values. It was also important to present the appropriate vertical-disparity gradient because this gradient is an effective distance cue (Rogers & Bradshaw, 1993, 1995). We therefore used the back-projection method to create disparities in the stereograms. The vertical-disparity gradient and the horizontal vergence of the back-projected stimuli specified viewing distances of 17 or 39 cm. At 17 cm, a gradient of 1 specifies a slant of 71°; at 39 cm, the same gradient specifies a slant of 80°.
The results are shown in Figures 6 and and7.7. The data from the two viewing distances nearly superimposed when plotted as a function of disparity gradient (Figure 7) and did not when plotted as a function of slant (Figure 6). Thus, the disparity gradient rather than slant was the primary limit to performance.
The human results show, as others have (Tyler, 1974, 1975; Ziegler et al., 2000), that the disparity gradient is a key determinant of the ability to construct stereoscopic percepts. As the gradient increases, percept construction becomes successively, but not precipitously, more problematic. The constraints imposed by high disparity gradients do not depend on stimulus tilt. We next examined the cause of the degradation in performance with increasing disparity gradient.
If the constraints imposed by large disparity gradients are a byproduct of estimating disparity via correlation, we should see similar behavior in humans and a local cross-correlator like the one described by Banks et al. (2004). As noted earlier, such a correlation algorithm has the same fundamental properties as the disparity-energy calculation that characterizes binocular neurons in visual cortex (Anzai, Ohzawa, & Freeman, 1999; Ohzawa et al., 1990).
We presented the same stimuli and task to the model that were presented to human observers. The stimuli were again random-dot stereograms specifying sawtooth corrugations and the task was again to determine the parity of the sawtooth.
The stereo half-images were blurred according to the optics of the human eye. Specifically, we convolved the half-images with the point-spread function of the well-focused eye with a 3-mm pupil (h(x, y)):
and a = 0.583, s1 = 0.443 arcmin, and s2 = 2.04 arcmin (Geisler & Davila, 1985). The resulting images were scaled such that the spacing between rows and columns was 0.6 arcmin, corresponding roughly to the spacing between foveal cones (Geisler & Davila, 1985). These values were chosen to best approximate the analogous viewing situation for the human observers.
The half-images were then sent to the binocular cross-correlator, which computed the correlation between samples of the left and right half-images:
where L(x, y) and R(x, y) are the image intensities in the left and right half-images, WL and WR are the windows applied to the half-images, µL and µR are the mean intensities within the two windows, and δx is the displacement of WR relative to WL (where the displacement is disparity). The normalization by mean intensity assures that the correlation is always between −1 and 1; the correlation for identical images is 1. Without the normalization, the resultant would depend on the contrast and average intensity of the half-images. WL and WR were identical two-dimensional Gaussian weighting functions:
We used isotropic functions, so σx and σy had the same values (for an example of the use of anisotropic functions, see Kanade & Okutomi, 1994). These weighting functions were used to select patches of the left and right half-images, which were then cross-correlated. Throughout the manuscript, we refer to the size of these weighting functions as “window size”; the window size we report is the diameter of the part of the Gaussian containing ±1σ. The actual windows used in the simulations extended to ±3σ until they were truncated. Our weighting functions mimic the envelopes associated with cortical receptive fields, but not the even- and odd-symmetric weighting functions of the disparity-energy model (Ohzawa et al., 1990).
To estimate disparity across the stimulus, we shifted WL along a vertical line perpendicular to the sawtooth corrugations in the middle of the left eye’s half-image. For each position of WL, we then computed the correlation for different horizontal positions of WR relative to WL (Equation 9; horizontal defined in Helmholtz coordinates). The restriction of shifting WL along one vertical line greatly reduced computation time but did not affect the main results.
The lower part of Figure 8 provides an example of the output of the cross-correlator. The abscissa represents the position of WL along the vertical search line and the ordinate represents the horizontal position of WR relative to WL; thus, the ordinate is the horizontal disparity. Red corresponds to high correlations, green to correlations near 0, and blue to negative correlations.
To compare model and human behaviors for the same task, we needed a decision rule for the model. It would have been best to use an ideal decision rule because such a rule is information preserving and is therefore best at revealing constraints imposed by earlier processing stages (Watson, 1985). However, to construct an ideal rule, we would need to know the means and standard deviations for each pixel in the correlation image for all the relevant parameters. The required computation was too complex and time consuming, so we chose a simpler rule: template matching (Watson, Barlow, & Robson, 1983). We first constructed templates of the post-optics stimuli in disparity-estimation space. Specifically, we constructed a bank of templates with the same dimensions as the cross-correlator output. Each template in this bank had the same spatial frequency as the corrugation waveform. All relevant amplitudes were included as were the two parities. To minimize computation time, phase was varied in steps of 10° (the phase of the stimulus was also limited to 10° steps). We next found the similarity of each template to the output of the cross-correlator by using an abbreviated form of cross-correlation: each template was multiplied element by element by the cross-correlator output and the result was summed across both dimensions. The model then picked the stimulus whose template had the highest correlation with the cross-correlator output. The model therefore knew the spatial frequency of the stimulus but was uncertain about everything else. By recording the model’s responses, we were able to construct psychometric functions like the ones generated in the human experiments. These functions were fit by cumulative Gaussians using a maximum-likelihood criterion; the threshold estimates were the means of the best-fitting functions (Wichmann & Hill, 2001).
The results are shown in Figures 9 and and10.10. In Figure 9, the model’s coherence threshold is plotted against disparity amplitude. Threshold rose with increasing amplitude, but at different rates depending on the spatial frequency of the horizontal corrugation stimulus. In Figure 10, the same data are plotted as a function of disparity gradient. Now the curves superimpose, so the disparity gradient, not the disparity amplitude, was the primary constraint for the local cross-correlator.
We also conducted simulations with vertically oriented sawtooth corrugations and observed quite similar behavior (not shown).
The size of the window used to correlate the two eyes’ images is an important aspect of disparity estimation (Banks et al., 2004; Harris et al., 1997; Kanade & Okutomi, 1994). Larger windows contain more variation in luminance and thus generally allow better discrimination between correct and false matches. However, when disparity varies at a finer scale than the window, the ability to estimate the variation is reduced. To investigate the influence of window size on disparity estimation, we ran the local cross-correlator with window sizes of 6, 18, and 30 arcmin. Figure 11 shows the results. For every window size, threshold grew systematically with an increase in disparity gradient, and the growth was the same for different spatial frequencies. Thus, the fall-off in performance with increasing disparity gradient is similar regardless of the size of the correlation window. Overall performance was somewhat reduced with the smallest window of 6 arcmin because the dot density of 10 dot/deg2 was too low to provide sufficient luminance variation over that small a region. We conclude that disparity estimation via correlation suffers from a disparity-gradient limit, and that the limit cannot be avoided by choosing smaller window sizes.
We next compared the performance of the local cross-correlator model with human performance. Figure 12 plots coherence thresholds for the model and a representative human observer for the same stimulus and task. In both cases, the rise in threshold was determined by the disparity gradient, not the disparity amplitude. Thus, the human visual system is constrained in much the same way as a local cross-correlator when it is presented two images that differ greatly.
There was one systematic difference between human and model behaviors: The rise in threshold as a function of disparity gradient was steeper in humans. This difference is probably caused by differences in the way large absolute disparities are treated by humans and the model. In humans, disparity-discrimination thresholds increase with the value of absolute disparity because disparity estimation is not as precise off the horopter as it is near the horopter (Blakemore, 1970; Ogle, 1953). Furthermore, the detection of correlation in dichoptic stimuli worsens systematically with increases in absolute disparity (Stevenson, Cormack, Schor, & Tyler, 1992). As the disparity gradient increased in our experiment, more of the stimulus fell farther from the horopter, which would have made it more difficult for humans to perceive the signal corrugation. Our model had no provision for penalizing solutions involving large disparities, so it did not suffer in the same way as humans when parts of the stimulus created large absolute disparities.
In summary, the disparity gradient of a stimulus is a critical determinant of humans’ ability to construct stereoscopic percepts. Its influence seems to be unaffected by the orientation or tilt of the depth variations. These same properties are exhibited by a local cross-correlation model. Thus, the disparity-gradient limit appears to be a byproduct of using this method to estimate disparities.
Although humans can detect very small disparities, stereoresolution—the finest spatial variation in disparity that can be reliably perceived—is relatively poor (Bradshaw & Rogers, 1999; Tyler, 1973, 1974, 1975). The highest detectable spatial frequency for disparity-defined corrugations is only 2–3 cpd (Figure 1); in contrast, the highest detectable frequency for luminance-defined waveforms is about 50 cpd (Campbell & Robson, 1968). Banks et al. (2004) proposed that relatively coarse stereoresolution is a byproduct of estimating disparity by correlating the two eyes’ images. To further investigate this possibility, we compared the performance of humans and the local cross-correlator in the same stereoresolution task.
The human data are from Banks et al. (2004); here we describe the most important aspects of the methods; for details, refer to the original paper.
The stimuli were random-dot stereograms specifying sinusoidal corrugations in depth. The orientations of the corrugations were ±20° from horizontal. There were two viewing distances: 39 and 154 cm. At the shorter viewing distance, the stimuli were viewed using the haploscope described in the disparity-gradient experiment. At the longer distance, the CRTs were detached from the haploscope arms and the stimuli were free-fused. The stimuli subtended 35.5 × 35.5° and 9.3 × 9.3° at the short and long distances, respectively. The apparent positions of the dots were accurate to better than 30 and 8 arcsec at the two distances. By using nearly horizontal corrugations, we minimized monocular artifacts (signal-dependent changes in dot density in the monocular half-images). We varied dot density in the half-images from 0.15 to 145 dots/deg2. The disparity amplitude was 4.8 and 16 arcmin.
Stimuli were presented for 600 ms at the shorter viewing distance and 1500 ms at the longer one. After each presentation, observers indicated which of the two corrugation orientations had been presented. The spatial frequency of the corrugation was varied by an adaptive staircase procedure to find the just-discriminable value.
Figure 13 plots human stereoresolution as a function of dot density. The three curves represent the data when different amounts of spatial blur were applied to the half-images.
The red circles represent the data when no blur was applied. For this condition, resolution increased systematically with increasing density until reaching a plateau at 3–4 cpd. The initial rise in resolution is understandable from sampling considerations. In random-dot stereograms, the discrete dot sampling in the half-images limits the highest frequency one can reconstruct to the Nyquist sampling frequency:
where N is the number of dots and A is the stimulus area. The diagonal line in Figure 13 is fN for the various dot densities. Human stereoresolution followed this sampling limit up to a density of ~30 dots/deg2, so the highest resolvable spatial frequency was determined by dot sampling in the half-images at low to medium densities. At higher densities, it was restricted by something else. Banks et al. (2004) showed that the restriction was not the disparity-gradient limit; they showed this by reducing disparity and thereby reducing the disparity gradient, and observing that the asymptotic spatial frequency was unaffected.
A logical candidate for the cause of the asymptote is low-pass spatial filtering at the front end of the visual system because such filtering limits performance in many resolution tasks including luminance contrast sensitivity at high spatial frequency (Banks, Geisler, & Bennett, 1987; Banks, Sekuler, & Anderson, 1991; MacLeod, Williams, & Makous, 1992) and various types of visual acuity (Levi & Klein, 1985). Banks et al. (2004) examined the contribution of low-pass spatial filtering to stereoresolution by blurring the stimulus. Such filtering reduces fine-scale luminance variation in the half-images. The three sets of curves in Figure 13 represent the data when different amounts of blur were applied. Stereoresolution always followed the Nyquist frequency at low densities and always asymptoted at higher densities. The asymptotic resolution was ~4 cpd when the standard deviation of the stimulus blur was 0, ~2 cpd when the standard deviation was 3 arcmin, and ~1 cpd when it was 6 arcmin. Thus, stereoresolution in humans is limited by the sampling properties of the stimulus at low dot densities and seems to be limited by the luminance spatial-frequency content of the half-images at high densities.
We examined these determinants further by comparing human behavior to the behavior of the cross-correlator. We did so for the same stimuli and task in order to determine whether the relatively coarse stereoresolution of the human visual system is a byproduct of using correlation to estimate disparity.
The stimuli and task were nearly identical to those in Banks et al. (2004). Sine wave corrugations were presented with orientations that were ±10° from horizontal. The model judged on each presentation which orientation had been presented. The spatial frequency of the corrugations was varied to find the highest value for which orientation discrimination was 75% correct. We varied dot density from 0.6 to 300 dots/deg2. Peak-to-trough disparity amplitude was fixed at 10 arcmin, a value midway between the values of 4.8 and 16 arcmin used in Banks et al. (2004). For some modeling runs, we blurred the half-images with isotopic Gaussians before presenting them to the rest of the model.
The properties of the model were identical to those in the analysis of the disparity-gradient limit with the following exceptions.
When the stimulus frequency exceeded the Nyquist frequency, aliases were created. Because of this, a lower frequency template fit the stimulus as well as a template at the stimulus frequency. By including alias templates, we did not unfairly bias the model toward picking the correct template. As in the human experiments, the model did not know the spatial frequency of the stimulus but did know the amplitude.
Banks et al. (2004) argued that there are two primary limiting factors in estimating disparity via correlation: the rate at which disparity changes spatially relative to the correlation window and the amount of luminance variation within the correlation window. With respect to the first factor, the cross-correlator should fail to detect disparity variation when the variation occurs at too fine a spatial scale relative to the size of the window because the correlation signal is adversely affected and because the now noisier disparity estimate at each position will smear rather than follow the spatial variation. As a consequence, the highest correlation frequency the cross-correlator can resolve should level off at a value that is inversely proportional to window size: fA 1/w, where fA is the asymptotic spatial frequency and w is the window size. The model should be able to detect disparity variation at a finer scale simply by using smaller windows. However, for each window size there should also be a dot density below which estimation fails because the number of dots falling within the correlation window becomes on average too small, and correlation rises for false matches leading to failures in disparity estimation. From this argument, the limiting dot density would be inversely proportional to window area: d 1/w2. Thus, for each combination of corrugation frequency and dot density, there should be an optimal size for the correlation window, a size constrained by the two limiting factors.
We investigated whether the stereoresolution of the cross-correlation model follows these expectations. Figure 14 shows the model’s stereoresolution as a function of dot density; different symbols represent resolution for correlation windows of different sizes. In each case, the model’s stereoresolution followed the Nyquist frequency up to a particular dot density and then leveled off.
We next investigated the asymptotes where performance fell short of the Nyquist frequency. To examine the possibility that low-pass spatial filtering at the front end of the visual system limits stereoresolution, we manipulated the blur of the images delivered to the cross-correlation stage (Equation 8) and re-measured the model’s stereoresolution. We manipulated blur by varying the blur of the stimulus and the optical point-spread function of the pre-correlation stages. Figure 15 plots the asymptotic spatial frequency against window size for different amounts of blur. The squares, diamonds, and triangles represent the results when the optical point-spread function was the standard for the well-focused eye (Geisler & Davila, 1985) and before entering the eyes the stereo half-images were convolved with isotropic Gaussians with standard deviations of 0, 2.4, and 4.8 arcmin, respectively. The circles represent the results when the spread of the optical function was half the standard value and no blur was applied to the stimulus before entering the eyes. For large windows, the data are well fit by fA = 45/w (diagonal line; fA in cpd and w in arcmin). (The absolute values of the spatial frequency plateaus can depend on other stimulus parameters, such as disparity amplitude, but a description of the effects of those parameters is beyond the scope of this paper.) For smaller window sizes, however, the asymptotic frequency leveled off at values lower than predicted. Those asymptotic frequencies depended strongly on blur magnitude: lower asymptotes for greater blur. Thus, there are two things that limit the highest discriminable corrugation frequency: the size of the correlation window (summarized by fA = 45/w), and the blur associated with the images sent to the cross-correlator (with greater blur, the luminance variation is insufficient to yield robust correlations between the two eyes’ images).
The most important issue: Is the model’s stereoresolution similar to humans? Figure 16 plots model and human stereoresolutions for similar conditions. The similarity is striking. In both cases, resolution grew with increasing dot density in a fashion limited by the Nyquist frequency until it leveled off at a particular spatial frequency. The asymptotic spatial frequency of the model and humans was inversely related to the magnitude of spatial blur. To examine whether the effect of blur was similar in machine and man, we plotted in Figure 17 the humans’ and model’s asymptotic frequencies as a function of blur magnitude along with the model’s asymptotic frequencies. We could manipulate the size of the model’s correlation window, but we obviously could not directly manipulate the size of the mechanism humans use. It is significant to note, however, that the model’s stereoresolution is most similar to human resolution when the model’s window size was set to 6 arcmin: the slopes of the two lines are nearly the same.
We also found, as expected, that for each window size the dot density had to be greater than ~150/w2 (where dot density is in dots/deg2 and w is in arcmin) for the model to estimate disparity reliably. Thus, we were able to confirm that there is indeed an optimal window size for each combination of corrugation frequency and element density (Banks et al., 2004; Kanade & Okutomi, 1994).
Disparity estimation by local cross-correlation has been a useful tool in computer vision (Clerc & Mallat, 2002; Kanade & Okutomi, 1994) and physiology (Cumming & DeAngelis, 2001; Ohzawa, 1998; Ohzawa et al., 1990). Now we see that it can explain some important limitations in human vision as well: namely, our inability to perceive large disparity gradients and our comparatively low stereoresolution. The primary difficulty in using local cross-correlation to estimate disparities is choosing an appropriate size for the image patches sampled in each eye (Banks et al., 2004; Kanade & Okutomi, 1994). Using patches that are too large leads to lower stereoresolution, while patches that are too small may not contain enough luminance variation to enable a reliable disparity estimate (Figure 14). This trade-off between the consequences of using too large or too small a correlation window is fundamental to understanding the disparity-gradient limit and stereoresolution.
As shown in Figure 14, smaller windows generally permit higher stereoresolution, but windows smaller than ~6 arcmin do not yield significant improvements (Figure 15): with optical blur equivalent to the well-focused eye (Geisler & Davila, 1985), the finest discernible corrugation frequency levels off at ~4 cpd no matter what the window size is. Why does decreasing the window size below ~6 arcmin no longer improve the model’s stereoresolution? The answer becomes clear by considering the blur of the inputs to the cross-correlation stage. When we added blur to the stimulus, thereby increasing the input blur, stereoresolution worsened. When we subtracted blur from the inputs by improving the eyes’ optics beyond normal, stereoresolution improved, particularly for window sizes smaller than 6 arcmin. The unfilled symbols in Figure 17 represent the model’s asymptotic corrugation frequency as a function of stimulus blur for different window sizes; the filled symbols represent human asymptotic frequencies as a function of stimulus blur. Interestingly, model and human behaviors are most similar when the model’s window size was 6 arcmin; the model’s threshold frequencies were slightly higher (perhaps because it uses a more efficient decision rule than humans), but the effect of blur was the same when window diameter was 6 arcmin. This suggests that the smallest mechanism in humans has a diameter of roughly 3–6 arcmin, which is the smallest useful size given the optics of the human eye.
Harris et al. (1997) also attempted to determine the smallest spatial mechanism used in disparity estimation. They measured the ability to judge whether a central dot was farther or nearer than surrounding dots as a function of the disparity of the central dot and its separation from the surrounding dots. They compared human performance in that task with that of a binocular-matching algorithm that correlated the two eyes’ images. Human and model performances were most similar when the width of the model’s square correlation window was 4–6 arcmin. In our analysis, we concluded that the smallest mechanism used by human observers has a diameter of 3–6 arcmin (twice the standard deviation of the Gaussian envelope). Thus, despite differences in the psychophysical tasks, we and Harris et al. (1997) both conclude that the smallest useful mechanism in disparity estimation has a diameter of 3–6 arcmin.
As Burt and Julesz (1980) showed, the disparity-gradient limit constrains the fusibility of multiple objects. Because of this limit, a visible point creates zones in front of and behind itself within which a second point cannot be fused. The limit is independent of the direction in which the gradient is measured (i.e., independent of the direction of S in Equation 3), so the forbidden zones are cones with their apices at each visible point.
How constraining is the disparity-gradient limit for everyday vision? To examine this question, we calculated the slants that correspond to different disparity gradients for a variety of distances; the geometric relationship between slant and disparity gradient is tilt independent, so the calculations are valid for horizontal-disparity gradients in any direction. Figure 18 shows the results. For a typical reading distance of 40 cm, a disparity-gradient limit of 1 means that opaque surfaces with slants greater than ±80° cannot be readily perceived stereoscopically. Moreover, slants of 80° or more are relatively unlikely to stimulate a given part of the retina. If slants (S) are uniformly distributed in the world, the distribution of slants that stimulate a region in the retina is proportional to cos(S) defined from −90° to +90° (Arnold & Binford, 1980; Hillis, Watt, Landy, & Banks, 2004). As a consequence, it is quite uncommon for a given part of the retina to be stimulated by slants sufficiently large to exceed the disparity-gradient limit at distances of 40 cm and beyond. We conclude that the gradient limit is generally not problematic for everyday viewing of opaque surfaces. Gradients exceeding the disparity-gradient limit are much more likely when viewing transparent surfaces (Akerstrom & Todd, 1988).
In Equations 2–5, we defined the disparity gradient between two points on a surface as |D|/|S|, where D is the vector representing the binocular disparity between the points and S is the vector representing the separation between them. As far as we know, all previous investigations of the disparity-gradient limit have only considered horizontal disparities. We have argued here that the disparity-gradient limit is caused by the decrease in local cross-correlation that occurs when the two eyes’ images become too different, as happens with large disparity gradients. From this point of view, it should make little difference whether the images differed horizontally or vertically. We wondered, therefore, whether a similar disparity-gradient limit applies for vertical disparity. Figure 19 demonstrates that such a limit does exist for vertical disparity and that the critical value is about the same as it is for horizontal disparity.
It is interesting to note that the causes of vertical disparity are quite different from the causes of horizontal disparity. Specifically, non-zero vertical disparities arise because of differences in the distance of surface patches from the two eyes. Large vertical disparities, therefore, only occur at extreme azimuths (as measured from straight ahead) and near distances; they are, in other words, a product of position relative to the head, not surface slant. Non-zero horizontal disparities arise because of differences in the distance from the two eyes but also because of surface slant. Thus, the smoothness and ordering constraints proposed as the motivation for a disparity-gradient limit with horizontal disparity (Porrill, Mayhew, & Frisby, 1985; Trivedi & Lloyd, 1985) do not apply to vertical disparity. As a consequence, the presence of a disparity-gradient limit with vertical disparities is further evidence that the limit is a byproduct of estimating disparity by correlating the two eyes’ images.
When using local cross-correlation to estimate disparity, the correlation is highest when the two retinal images, once shifted into registration, are very similar. The highest correlations are therefore observed when the disparity gradient is zero and the images are identical, which occurs only when the stimulus surface is straight ahead and frontoparallel (tangent to the Vieth–Müller Circle). Any deviation from those conditions causes dissimilarities in the retinal images and therefore reduces correlation. One could modify the cross-correlation algorithm to overcome this difficulty by warping the images in the two eyes, before the cross-correlation stage, to account for the expected magnitude and direction of the gradient in each region of the image (Panton, 1978). Constructing such an algorithm would, however, be computationally expensive. For each visual direction, the algorithm would require several units, each designed to warp the images in the fashion appropriate for the expected surface tilt. To encompass 360° of tilt, the algorithm would require at least four units tuned to tilts differing by 90° from one another. The fact that the local cross-correlator presented here mimics human behavior reasonably well suggests that such image warping does not occur before the site of binocular interaction.
Nienborg, Bridge, Parker, and Cumming (2004) came to a similar conclusion. They measured the sensitivity of disparity-selective V1 neurons to sinusoidal disparity waveforms and found that such neurons are low-pass: that is, their sensitivity to low corrugation frequencies is not significantly lower than their peak sensitivity. From this, they concluded that V1 neurons are not tuned for any non-zero disparity gradients and therefore they limit the ability of the visual system to resolve fine variations in disparity. If V1 neurons are in fact not selective for slant and tilt, as Nienborg and colleagues propose, how are the slant- and tilt-selective neurons in extrastriate cortex (Nguyenkim & DeAngelis, 2003; Shikata, Tanaka, Nakamura, Taira, & Sakata, 1996) created? It could be done by combining the outputs of V1 neurons tuned for different depths. With the right combination, a higher order neuron selective for a particular magnitude and direction of disparity gradient could be constructed. As Nienborg et al. point out, such higher order neurons could not have greater stereoresolution than observed among V1 neurons. We believe that our psychophysical results manifest this: Although the human visual system is exquisitely sensitive to small disparities, stereoresolution is relatively poor due to the manner in which disparity estimation is done.
The boundary on the top-right and top of the stereo depth region in Figure 1 represents the disparity-gradient limit and stereoresolution limit. We have shown here that the local cross-correlation model exhibits both phenomena. If this is the case, the model should have constraints in the combinations of disparity amplitude and spatial frequency like the ones in Figure 1. To examine this, we constructed an orientation-discrimination task similar to the one used in the Stereoresolution section of this paper and presented it to the model. A random-dot stereogram defined a disparity-modulated sine wave rotated ±10° from horizontal. For the right part of the depth region in Figure 1, spatial frequency was fixed and the disparity amplitude threshold was measured. For the top part of the depth region, amplitude was fixed and the spatial-frequency threshold was found. Figure 20 shows the modeling results (red points) along with Tyler’s data from human observers. As we expected, the model’s fusion limit in the upper right falls on a line with a slope of −1; this is very similar to Tyler’s results except for the fact that the model can handle larger disparities in part because it is not penalized for off-horopter disparity estimates. The model’s stereoresolution levels off at ~6 cpd, a bit higher than the human asymptote. The human data and model data are therefore quite similar in shape except for the model’s generally higher sensitivity.
We showed that the disparity-gradient limit and stereoresolution can both be understood as consequences of using local cross-correlation to estimate disparity. Thus, these limitations of human stereopsis are byproducts of the method used for disparity estimation.
The authors thank Sergei Gepshtein for helpful discussion in the beginning of the project and Björn Vlaskamp for helpful comments on an earlier draft. This research was supported by NIH Research Grant EY-R01-08266 to MSB.
Commercial relationships: none.
Heather R. Filippini, UCSF and UCB Joint Graduate Group in Bioengineering, University of California at Berkeley, Berkeley, CA, USA ; Email: email@example.com.
Martin S. Banks, School of Optometry, Psychology, Wills Neuroscience, University of California at Berkeley, Berkeley, CA, USA ; Email: ude.yelekreb@sknabytram.