The display is shown schematically in . In each eye’s view, a mirror and two plate beam splitters are used to create a light field that is the sum of aligned images drawn at three image planes. We produce the binocular visual field using a periscope assembly. This creates a volumetric stereoscopic display because the light comes from sources at different distances. One way in which our display differs from other volumetric approaches is that we fix the eyes’ positions. By knowing the observer’s position, we can calculate each eye’s view and display the correct disparities and preserve viewpoint-specific lighting effects like occlusion, specularity, and reflection. The display can present disparities and scene geometry at high resolution. It presents focus cues with relatively low resolution in depth because sensitivity to focus cues is far poorer than for spatial position (Campbell, 1957
; Rolland, Krueger, & Goon, 1999
). shows how we create three focal distances for each of the eyes by dividing the screen into six viewports.
Under typical viewing situations, depth of focus is ±0.25 to ±0.3 D (Campbell, 1957
; Charman & Whitefoot, 1977
), which corresponds to a range of 0.5–0.6 D around fixation. In our display, the image planes are placed slightly farther apart at 1.87 (far plane), 2.54 (mid plane), and 3.21 D (near plane), separations of 0.67 D (). The image–plane spacing is therefore just slightly greater than a standard observer’s depth of focus. The monitor, an IBM T221 liquid-crystal display (LCD), has a maximum resolution of 3840 × 2400. At that resolution, pixels subtend 1.38, 1.09, and 0.80 arcmin at the near, mid, and far image planes, which for a standard observer is near the acuity limit. We drive the display with an Nvidia Quadro 4 900XGL graphics card. When the display was driven at the maximum resolution, the card could only support a 12-Hz refresh rate. Because of the sample-and-hold manner in which pixels are illuminated in LCDs, the images did not appear to flicker. For time-sensitive experiments, we ran the display at half resolution, 1920 × 1200, to boost the refresh rate to 41 Hz.
Because the light rays approaching the eyes come from different distances, alignment of the eyes with the viewports is critical. We achieve this by first using a sighting device (Hillis & Banks, 2001
) to adjust the bite bar (and therefore the observer’s eyes) relative to the apparatus so that we can place the eyes in the appropriate position. Once the observer is in position, the separation between the viewing apertures is set equal to the observer’s inter-ocular distance; we also set software parameters to the inter-ocular distance. We then fine-tune the alignment in software using a between-image–plane vernier alignment technique (Akeley et al., 2004
). The alignment is accurate to within seconds of arc and ensures that light from the appropriate pixels on different image planes sums along lines of sight.
No need to track accommodation
With this display, we can simulate the effects of differential focus without tracking the observer’s accommodation. To see this, consider a simple situation: We want to portray a small object at the distance of the far plane (53.6 cm) and another small object at the distance of the near plane (31.1 cm). For real objects positioned at those distances, accommodation to 53.6 cm would make the far object in focus and the near one out of focus. With accommodation to the near object, the reverse occurs. Exactly the same applies for our display because the light comes from different distances. The situation is more complicated for simulated object positions in between the planes. We discuss this in the next section.
To create the retinal images that best approximate images formed in real-world viewing, we render objects as unblurred on the image planes and let the observer’s optics create the appropriate blur.
For all but the very unlikely case that the depth of a point in the scene coincides exactly with the depth of one of the image planes, a rule is required to assign image intensities to image planes. The simplest rule is the box filter (, left): each point in the scene is drawn at the image plane to which it is closest. However, this approach produces blur discontinuities. For example, consider the situation in in which a line extends from a near image plane to a farther one. For a given accommodative state, the retinal-image blur of the line will have one value for parts that are drawn on one plane and another value for parts that are drawn on the second plane. This produces a visible discontinuity in the retinal image of the line. To minimize this problem, we use a tent filter (, right). With this rule, the image intensity at each image plane is weighted according to the dioptric distance of the point from that plane, determined along a line of sight. This approach, which we call depth-weighted blending, eliminates the discontinuity. This blending technique is a significant technical development that can be applied to any multi-plane display.
Figure 4 Pixel lighting with box and tent depth filters. The horizontal lines represent two image planes, the thick blue diagonal line the surface we wish to draw, and the thick blue horizontal lines the pixel intensities on the image planes; dark blue represents (more ...)
Analysis of retinal-image formation
The retinal images formed by viewing real-world stimuli, stimuli on conventional 3D displays, and stimuli in our multi-plane display can differ substantially. In this next section, we explore how the retinal images differ across these three types of viewing situations to better understand the costs and benefits of different kinds of displays.
We measured the optics of an individual eye (the first author’s left eye) using a Shack–Hartmann wavefront sensor. Optical aberrations were represented by Zernike polynomials from which we computed the point-spread function
(PSF; the retinal image created by a point source) for objects at various distances with different accommodative states. We assumed that the aberrations do not change over small visual angles and therefore that we could calculate the retinal image from the convolution of a small object and the PSF. In order to calculate the PSF for varying amounts of defocus, we used the Zernike aberrations measured at one particular accommodative state and then added an appropriate amount of Zernike defocus. This assumed that the higher-order aberrations did not change with accommodation (Cheng et al., 2004
Although we are ultimately interested in image formation with natural, broadband stimuli, it is instructive to study spatial sinusoids because broadband stimuli can be synthesized from sinusoids and because the primary effect of defocus on sinusoids is only a reduction in contrast. The upper, middle, and lower rows of show retinal-image contrast for sinusoids in the real world, on a conventional 3D display, and in our volumetric display. The left, middle, and right columns show the plots for stimulus spatial frequencies of 2, 6, and 18 cpd, respectively. The abscissas represent the real or the simulated focal distance of the stimulus in diopters. The ordinates represent the eye’s focal distance in diopters. Colors represent the contrast ratio: retinal-image contrast divided by stimulus contrast (yellow representing the highest ratio and black the lowest).
Figure 5 Retinal-image contrast for different displays and hypothetical accommodative responses. The top, middle, and bottom rows represent stimuli in the real world, in a conventional 3D display, and in our multi-plane display, respectively. The left, middle, (more ...)
Consider a real stimulus of 6 cpd (top center in ) at a distance of 2.5 D. As the observer accommodates from far to near (i.e., from 1.5 to 3.5 D, a vertical slice in the graph), the retinal-image contrast first increases to a maximum of 88.5% of the stimulus contrast and then decreases. As one would expect, focusing the eye at 2.5 D, the actual object distance, yields maximum retinal contrast. At a lower frequency of 2 cpd, the rise and fall of image contrast is shallower, and peak contrast is higher at 96.4%. At a higher frequency of 18 cpd, the rise and fall is steeper, and peak contrast is much lower at 48.2%. These plots represent the normal relationship between object distance, accommodative response, and retinal-image contrast.
Next consider a conventional 3D display (middle row in ). Because the distance to the display surface is fixed at 40 cm (2.5 D), the relationship between simulated object distance, accommodation, and retinal-image contrast is altogether different: retinal contrast is now maximized by accommodating to the distance of the display surface rather than to the object’s simulated distance. Thus, to maintain a clear and single percept, the observer must hold accommodation fixed despite changes in simulated distance, and this requires the dissociation of accommodation and vergence.
Now consider our multi-plane display (, bottom row). The three image planes are positioned at intervals of 0.67 D, so the workspace is a 1.33-D volume. When the simulated distance is at the distance of an image plane, the retinal contrast produced by viewing our display is identical to the contrast produced by viewing the real world. When the simulated distance is between planes, the retinal image is formed by a depth-weighted blend () of intensities from the two nearest planes. At 2 cpd, the blended image within the display’s workspace is a nearly perfect approximation to the image produced by a real target. Importantly, retinal-image contrast is maximized by focusing at the simulated distance rather than at one of the image planes. At 6 cpd, the blended image is still a good approximation, and retinal contrast is again maximized by focusing at the simulated distance rather than at one of the image planes. At 18 cpd, the blended image is a poorer approximation to the real world: the peak contrast occurs near the image planes rather than at the simulated distance. This analysis shows that the multi-plane approximation to the real world can be very good for spatial frequencies as high as 6 cpd (probably higher) with image planes separated by 0.67 D. The results depend, of course, on the eye’s pupil size and aberrations: An eye with a smaller pupil and/or greater aberrations has greater depth of focus, so the volumetric display provides an even better approximation to the real world in such cases.
shows that our display creates retinal images that are an excellent approximation to the real world at low spatial frequencies, a good approximation at medium frequencies, and a poor approximation at high frequencies. An important question is how good is the approximation visually? In particular, how are blur perception and accommodation affected in multi-plane displays? Although the retinal blur created by our display is an approximation to the blur created by real stimuli, the ability to distinguish stimuli presented at different simulated distances may be similar in the two cases because perceived blur is affected most strongly by medium spatial frequencies (Granger & Cupery, 1972
; Walsh & Charman, 1988
) where the multi-plane approximation is good. Furthermore, we suspect that accommodation to multi-plane and real-world stimuli will be similar because human accommodation is controlled primarily by medium spatial frequencies (4–8 cpd; Mathews & Kruger, 1994
; Owens, 1980
; Phillips, 1974
; Tucker, Charman, & Ward, 1986
; Ward, 1987