|Home | About | Journals | Submit | Contact Us | Français|
Because object and self-motion are ubiquitous in natural viewing conditions, understanding how the human visual system achieves a relatively clear perception for moving objects is a fundamental problem in visual perception. Several studies have shown that the visible persistence of a briefly presented stationary stimulus is approximately 120 ms under normal viewing conditions. Based on this duration of visible persistence, we would expect moving objects to appear highly blurred. However, in human vision, objects in motion typically appear relatively sharp and clear. We suggest that clarity of form in dynamic viewing is achieved by a synergy between masking, perceptual grouping, and motion computation across retinotopic and non-retinotopic representations. We also argue that dissociations observed in masking are essential to create and maintain this synergy.
Studies at the turn of the 20th century analyzed the perception of moving form and laid foundations of important discoveries related to visual masking (e.g., McDougall, 1904; Piéron, 1935) as well as the relationship between form and motion processing (Kolers, 1972). Surprisingly, however, most of the studies during the last three decades have focused on static form perception, and very little is known about mechanisms underlying moving form perception. The goal of this paper is to provide a short overview of findings related to the perception of moving form and to lay the foundations of a theory of dynamic form perception. In this theory, masking, perceptual grouping, and motion computation interact within and across retinotopic and non-retinotopic representations of the stimuli.
The visible persistence of a briefly presented stationary stimulus is approximately 120 ms under normal viewing conditions (e.g., Haber & Standing, 1970; see also Coltheart, 1980). Based on this duration of visible persistence, one would expect moving objects to appear highly blurred. For example, a target moving at a speed of 10 deg/s should generate a comet-like trailing smear of 1.2 deg extent. The situation is similar to pictures of moving objects taken at an exposure duration that mimics visible persistence. As illustrated in Fig. 1, in such a picture, stationary objects are relatively clear but moving objects exhibit extensive blur.
Unlike photographic images, however, visual objects in motion typically appear relatively sharp and clear (e.g., Bex, Edgar, & Smith, 1995; Burr & Morgan, 1997; Farrell, Pavel, & Sperling, 1990; Hammett, 1997; Hogben & Di Lollo, 1985; Ramachandran, Rao, & Vidyasagar, 1974; Westerink & Teunissen, 1995). Because object and self-motion are ubiquitous in natural viewing conditions, understanding how the human visual system achieves a rela-tively clear perception for moving objects is a fundamental problem in visual perception. While pursuit eye movements can retinotopically stabilize a moving target and help reduce its perceived smear (Bedell & Lott, 1996; Tong, Patel, & Bedell, 2005), even under these conditions, the problem of smear remains for other ob-jects present in the scene. Furthermore, the initiation of an eye movement can take about 150–200 ms dur-ing which a moving object can generate considerable smear. In the next section we present evidence that one mechanism that contributes to the perceived clarity of moving objects is metacontrast masking. This is followed by a section that highlights the importance of dissociation properties of metacontrast in achieving this task. In the subsequent section, we argue that, while metacontrast masking can reduce the extent of smear for moving objects, the synthesis of form for moving objects necessitates non-retinotopic feature processing. This leads to the section where, we formulate specific hypotheses for dynamic form perception. Findings from anorthoscopic perception to provide empirical evidence for the proposed non-retinotopic form perception mechanisms are reviewed next. In the following section, we present our recent results indicating that non-retinotopic perception is not limited to anorthoscopic perception but applies to perception in general. Possible neural correlates for non-retinotopic mechanisms are discussed next. The final section concludes the manuscript.
Burr (1980) and Hogben & Di Lollo (1985) measured the perceived extent of motion smear produced by a random array of moving dots as a function of exposure duration. For exposure durations shorter than approximately 40 ms, the extent of perceived smear increased with exposure duration, as one would expect from the visible persistence of static objects. However, for exposure durations longer than 40 ms, the length of perceived smear was much less than that predicted from the persistence of static targets. This reduction of perceived smear for moving objects has been termed “motion deblurring” (Burr, 1980; Burr & Morgan, 1997).
Contrary to the reports of motion deblurring, it has been long known that isolated targets in real motion (e.g., Bidwell, 1899; McDougall, 1904) and in apparent motion (Castet, 1994; Di Lollo & Hogben, 1985; Dixon & Hammond, 1972; Farrell, 1984; Farrell et al., 1990) exhibit extensive smear. In order to reconcile the apparently contradictory observations of motion deblurring for a field of moving dots and extensive smear for isolated moving targets, we conducted experiments in which the density of moving dots was varied systematically, ranging from a single dot to 7.5 dots/sq-deg (Chen, Bedell, & Öğmen, 1995). Our results showed that isolated targets moving on a uniform background are perceived with extensive motion blur and the reduction in the spatial extent of perceived motion blur (motion deblurring) increases as the density of moving dots in the array is increased. In other words, the motion deblurring reported by Burr (1980) is not a general phenomenon and applies principally to displays containing a relatively dense array of moving objects.
Several models have been proposed to explain motion deblurring based on a motion estimation procedure which is used to compensate for the adverse blurring effect resulting from the object motion (e.g. Anderson & van Essen, 1987; Burr, 1980; Burr, Ross & Morone, 1986; Martin & Marshall, 1993). According to Burr (1980) , motion estimation is achieved by the spatio-temporally oriented receptive fields of motion mecha-nisms. Martin and Marshall (1993) proposed a similar model wherein excitatory and inhibitory feedback connections suppress the persistent activity of neurons along the motion path. The “shifter-circuit” model of Anderson and van Essen (1987) uses an estimation of motion in order to generate a cortically localized (i.e. stabilized) representation of moving stimuli thereby avoiding the smear which would result from the change of cortical locus of neural activities. All these motion estimation/compensation models predict that an isolated moving target should produce no visual blur provided that it sufficiently stimulates the motion estimation/compensation mechanisms. However, as stated above, this prediction, is in sharp contradiction with the extensive blur observed for a moving isolated target (e.g. Bidwell, 1899; Chen et al., 1995; Lubimov & Logvinenko, 1993; McDougall, 1904; Smith, 1969a, b). In our study (Chen et al., 1995), by using several paradigms directly tailored to test the predictions of motion compensation models, we showed that the activation of motion mechanisms is not a sufficient condition for motion deblurring and that the reduction of perceived blur requires the presence of spatio-temporally adjacent targets. Taken together, these findings provide strong evidence against motion estimation/compensation models.
Several researchers suggested inhibition as a candidate mechanism for motion deblurring (e.g., Castet, 1994; Di Lollo & Hogben, 1985, 1987; Dixon & Hammond, 1972; Francis, Grossberg, & Mingolla, 1994; McDougall, 1904; Öğmen, 1993). Because inhibition is a rather general concept, it is important to determine how and where it operates to achieve motion deblurring. Empirical evidence supports the view that the inhibitory mechanisms underlying metacontrast masking are the ones involved in motion deblurring. Metacontrast masking refers to the reduced visibility of a target stimulus by a spatially non-overlapping and temporally following mask stimulus (Bachmann, 1984; Breitmeyer & Öğmen, 2000; 2006). Several studies using stimuli in apparent motion showed that the duration of visible persistence decreases as the spatial separation between successively presented targets is reduced (Castet, Lorenceau, & Bonnet, 1993; Di Lollo & Hogben, 1985; Farrell, 1984). Similarly, the metacontrast suppression of the target increases as the spatial separation between the target and mask decreases (e.g., Alpern, 1953; Breitmeyer & Horman, 1981; Growney, Weisstein, & Cox, 1977; Kolers & Rosner, 1960; Lefton, 1973). When the target and mask have similar energy, optimal metacontrast masking occurs when the mask follows the target approximately by 40–100 ms, depending on the stimulus parameters and task (rev. Breitmeyer & Öğmen, 2006). Breitmeyer and Horman (1981) showed that for high-contrast stimuli in apparent motion, optimal metacontrast occurred at a stimulus onset asynchrony of about 65–100 ms, depending on the spatial separation of the targets. Chen et al. (1995) reported that mo-tion deblurring is stronger in the periphery than in the fovea, in agreement with stronger metacontrast in the periphery in general (e.g., Alpern, 1953; Stewart & Purcell, 1974). Motion deblurring is closely related to “sequential masking” (Otto, Öğmen, & Herzog, 2006; Piéron, 1935) which in turn can be viewed as a form of metacontrast (Breitmeyer & Öğmen, 2006).
To test the relationship between metacontrast and motion deblurring computationally, we used a model of REtino-COrtical Dynamics (RECOD) (Öğmen, 1993), which has been applied to both paradigms. The general structure of this model is discussed in the next section. This model suggests that the main inhibitory process in metacontrast is the inhibition of sustained activities, originating from the parvocellular or P pathway, by transient activities, originating from magnocellular of M pathway (“transient-on-sustained inhibition”, see also Breitmeyer & Ganz, 1976). Simulations of the model for a widerange of metacontrast and motion deblurring data provided evidence that metacontrast masking is the key mechanism for motion deblurring (Breitmeyer & Öğmen, 2006; Purushothaman, Öğmen, Chen & Bedell, 1998). There is also clinical evidence supporting the model’s prediction that transient-on-sustained (M-on-P) inhibition plays a major role in motion deblurring: Tassinari et al. (Tassinari, Marzi, Lee, Di Lollo, & Campara, 1999) found that patients with a likely deficit in the M pathway, due to a compression of the ventral part of the pre-geniculate pathway, had substantially less motion deblurring than normal controls.
In summary: (1) isolated targets moving on a uniform background are perceived with extensive motion blur; (2) the presence of spatio-temporally proximal stimuli can reduce the spatial extent of perceived motion blur (motion deblurring); (3) motion mechanisms cannot account for motion deblurring; (4) metacontrast masking (theorized to occur as transient-on-sustained inhibition) can account for motion deblurring.
Figure 2 depicts the stimulus arrangements used by McDougall (1904) and Piéron (1935). McDougall reported that the blur generated by a leading stimulus (“a” in Fig. 2A) could be curtailed by adding a second stimulus (labeled “b” in Fig. 2A) in spatiotemporal proximity. This finding is in agreement with the more recent findings discussed in the previous section. Piéron (1935) modified McDougall’s stimulus to devise a “sequential” version as shown in Figure 2B. A notable aspect of the percept generated by this sequential version (see also Otto et al., 2006) is that, under appropriate parametric conditions, segment “a” can suppress the visibility of segment “b”, segment “b” in turn can suppress the visibility of segment “c”, etc. In other words, even though segment “b’’s visibility is suppressed, its effectiveness as a mask suppressing the visibility of segment “c” remains intact, i.e. a dissociation occurs between the visibility of a stimulus and its masking effectiveness.
Such a dissociation is necessary for metacontrast to act as an effective deblurring mechanism, otherwise motion blur would not be curtailed but transformed into an oscillatory profile. In the example of Figure 2B, without a dissociation between visibility and masking effectiveness, “b” would be invisible, but “c’ would be visible (because “b” would no longer be able to mask “c”) and this cycle of visibility and invisibility would repeat itself. The relationship between visibility and masking effectiveness in metacontrast was investigated systematically by Breitmeyer, Rudd and Dunn (1981). Their findings were modeled (Francis, 1997; Öğmen, Breitmeyer, & Bedell, 2006) and extended (Öğmen, Breitmeyer, Todd, & Mardon, 2006).
Figure 3 provides a schematic description of the RECOD model (Breitmeyer & Öğmen, 2006; Öğmen, 1993) whose dual-channel structure can account for the dissociation between visibility and masking effectiveness. In this model, the input is conveyed to post-retinal networks through two major pathways corresponding to parvocellular (P) and magnocellular (M) pathways of the primate visual system. The post retinal areas receiving their major inputs from P and M pathways are also refered to as sustained and transient channels. In the model, the visibility of a stimulus as it relates to its brightness, contours, etc. is associated with activity in the sustained channels. The major suppressive effect in metacontrast is an inhibition from the transient channel on the sustained channel. Thus, because visibility and metacontrast masking effectiveness relate to two different processes, sustained and transient channel activities, respectively, the model can account for the aforementioned dissociation. The validity of this claim has been demonstrated by quantitative simulations (Öğmen, Breitmeyer, & Bedell, 2006; Öğmen, Breitmeyer, Todd et al., 2006). In summary, the RECOD model provides a mechanistic explanation of how motion deblurring can take place in retinotopic space.
The dual-channel structure of the model also allows it to account for another dissociation observed in visual masking (Öğmen et al., 2003): A U-shaped masking function can be obtained when observers make judgments related to the target’s surface (e.g., perceived brightness, contrast) and contour (e.g., contour completeness, contour shape) properties or figural identity (e.g., letter recognition). Under these conditions, if the observer’s task is changed to report the presence or the spatial location of the target, instead of its visibility, the metacontrast mask has no effect on the observer’s performance, as measured by simple/choice RTs or by response accuracy (e.g., Fehrer & Raab, 1962; Öğmen et al., 2003; Schiller & Smith, 1966). This dissociation can be readily explained by the RECOD model since target localization can be carried out by the activity in the transient channel ragardless whether the activity in the sustained channel is suppressed or not (for detailed predictions and a quantitative analysis, see Öğmen et al., 2003). This dissociation is important for the theory discussed in this manuscript in that, as we argue in the following sections, motion-induced perceptual grouping is essential for the computation of form for moving objects. The aforementioned dissociation suggests that transient, and by extension motion signals, remain intact under conditions in which the visibility of the stimulus is suppressed. As a result, motion-induced grouping operations can operate without being negatively affected by motion deblurring operations in the retinotopic space.
Metacontrast mechanisms solve only partly the motion blur problem. If we consider the example shown in Fig. 1, metacontrast mechanisms would make the motion streaks appear shorter thereby reducing the amount of blur in the picture. Yet, although deblurred, moving objects would still suffer from having a ghost-like appearance. For example, in Fig. 1 notice the appearances of targets moving fast (e.g., the vehicles close to the observer), those that are moving more slowly (e.g., the white truck in the background approaching the traffic jam) and the stationary objects. The vehicles in front have a ghost-like appearance without any significant form information while the vehicles far, which move more slowly, have a more developed form, and finally static objects possess the clearest form. This is because static objects remain long enough on a fixed region of the film to expose sufficiently the chemicals while moving objects expose each part of the film only briefly thus failing to provide sufficient exposure to any specific part of the film. Similarly, in the retinotopic space, a moving object will stimulate each retinotopically localized receptive-field briefly and an incompletely processed form information would spread across the retinotopic space just like the ghost-like appearances in Fig. 1. We hypothesize that information about the form of moving targets is conveyed to a non-retinotopic space where it can accrue over time to allow neural processing to synthesize shape information.
We put forward the following hypotheses for the basis of moving form perception:
Hypothesis 1: Low-level encoding of moving stimuli occurs in a retinotopic space and metacontrast masking, theorized as transient-on-sustained inhibition, controls the extent of motion blur in this reti-notopic space.
Hypothesis 2: Accrual and processing of form information for moving objects occur in non-retinotopic space.
Hypothesis 3: The transfer of information from the retinotopic to the non-retinotopic space is guided by perceptual grouping operations.
Hypothesis 4: Non-retinotopic representation of moving objects consists of a joint representation of form and motion information. Motion vectors are specific to parts of objects.
Hypothesis 5: Phenomenal visibility of form requires correlated activity at both retinotopic and non-retinotopic spaces. Non-retinotopic activity that lacks correlated retinotopic activity leads to “dynamic amodal” perception (defined and discussed in the following section).
Fig. 4 provides a schematic description of the proposed scheme. In the retinotopic space, which is depicted at the bottom of the figure, two objects, one notional triangle composed of three dots and one notional square composed of four dots are shown moving in two different directions. Perceptual grouping operations determine, through space and time, the individual identities of objects. Observers perceptually group the dots into a triangular and a rectangular group based on the Gestalt principles of common fate (same velocity vector) and proximity. These perceptual grouping relations map in real-time the triangular and rectangular shapes to a non-retinotopic space where the accrual of information allows the processing of dynamic form perception. The accrual of information results from the fact that form information for a given object is mapped to the same group of neurons in the non-retinotopic space regardless the position of the object in the retinotopic space. Hence, these neurons can integrate and process this information over time. The dashed double-headed arrows between the retinotopic and non-retinotopic spaces indicate grouping-based mapping of activities. It is highly likely that grouping and form processing are interactive processes. The double-headed arrows in Fig. 4 are intended to depict such interactions. Notice that while the retinotopic position of the stimuli is changing in the retinotopic space, it remains fixed in the non-retinotopic space generating a “position-invariant” representation. This position-invariant representation allows the accrual, processing, and synthesis of form information for moving objects. It is likely that position-invariance involves multiple mechanisms. According to our model, perceptual grouping may play an important role in establishing and maintaining position-invariant representations. Assume that an object moves in the retinotopic space. When perceptual grouping identifies a stimulus configuration at some retinotopic neighborhood, R0 at time t0 to be the same object as a stimulus at a retinotopic location R1 at time t1, the corresponding activities are mapped to the same locus in the non-retinotopic space leading to a position-invariant representation. Perceptual grouping itself also involves several mechanisms and principles, “common fate” being one of them. Another example of motion-based grouping and non-retinotopic representation will be discussed in the section “Non-retinotopic perception is not restricted to anorthoscopic perception”.
The dashed ellipses around the objects in the non-retinotopic space highlight separate groups (objects). As depicted in the figure, form information in the non-retinotopic space is represented jointly with motion vectors. These motion vectors are associated with different parts of objects. Neurophysiologically, it is likely that these motion vectors are encoded in a separate area (e.g., MT+) and linked to the non-retinotopic space through grouping relations (see the Section “Potential neural correlates”). The dynamic grouping-based mapping between the two layers provides temporal correspondences between abstract form information in the non-retinotopic space and its underlying retinotopic activity. Our Hypothesis 5 states that phenomenal visibility at a given instant requires a correlated activity at both of these levels. This hypothesis is elaborated further in the next section where we apply the theory to anorthoscopic perception.
The first part of the evidence to support our Hypothesis 2 (accrual and processing of form information for moving objects takes place in non-retinotopic space) comes from the classical phenomenon known as anorthoscopic perception (rev. Rock, 1981). The term anorthoscope refers to a device invented by Plateau in 19th century to demonstrate how static percepts can be generated from moving stimuli (Plateau, 1836). The anorthoscope consists of two disks rotating in opposite directions. One of the disks has slits through which parts of an image painted on the second disk are visible. In addition to leading to the development of contemporary cinematographic equipment, the anorthoscope also found use in scientific laboratories to study human perception (e.g., Helmholtz, 1867; Rothschild, 1922; Zöllner, 1862). The designs of this device and its contemporary computer emulations include a variety of versions depending on the number of slits, and on the combinations of whether the slit, the partially occluded image, and/or the eyes are moving (e.g., Anstis & Atkinson, 1967; Casco & Morgan, 1984; Fahle & Poggio, 1981; Haber & Nathanson, 1968; Mateeff, Popov, & Hohnsbein, 1993; Morgan, Findlay & Watt, 1982; Nishida, 2004). It is important to make distinctions between different configurations because they can activate drastically different visual mechanisms. Our main focus in this manuscript is for the case where there is only one stationary slit, the eyes are also stationary and an image moves behind the slit (Figure 5). Under these conditions, all information about the moving object’s shape collapses temporally on a narrow retinotopic locus in a fragmented manner, i.e. there is no spatially extended retinotopic image of the shape. Yet, observers perceive a spatially extended shape moving behind the slit instead of a fragmented pattern that is confined to the region of the slit. Thus, a retinotopic image is not necessary for the perception of form.
The mechanisms underlying anorthoscopic perception are poorly understood. One of the early explanations, the “retinal painting” hypothesis (Helmholtz, 1867), was based on eye movements. If the eyes move while viewing the stimulus, then successive parts of the stimulus fall on adjacent retinotopic loci thereby “painting” a retinotopic picture of the figure. Subsequent research showed that while retinal painting can give rise to the perception of form, it cannot explain anorthoscopic perception in general: Measurement of eye movements and studies using retinal stabilization showed that anorthoscopic perception does occur in the absence of eye movements (Fendrich, Rieger, & Heinze, 2005; Morgan et al., 1982). The percepts resulting from eye movements can be explained simply by using the visible persistence characteristics of the human visual system. The critical findings to assess our theory are the ones in which an extended perception of the object occurs in the absence of eye movements. There have been two types of theories to explain anorthoscopic percepts in the absence of eye movements. According to Parks (1965), a post-retinal mechanism stores in memory the information available through the slit and reconstructs the figure according to a “time-of-arrival coding”. Figure 6 shows a stimulus configuration used to test both the retinal painting and the time-of-arrival reconstruction theories (McCloskey & Watkins, 1978; Sohmiya & Sohmiya, 1992, 1994). The stimulus consists of two triangular shapes moving in opposite directions. The tips of the triangles pass through the slit simultaneously, followed by the middle segments and finally the longest segments. Assume that the tip, the middle, and the base of the triangles cross the slit at t0, t1, and t2, respectively with t0<t1<t2. Observers are required to fixate on the fixation cross and report the perceived shape of stimuli. The time-of-arrival coding theory states that the time-of-arrival will be used to construct spatial form. As shown in Fig. 7, according to this theory these time-of-arrivals are converted to spatial positions s0, s1, and s2, respectively with s0<s1<s2. As a result, the theory predicts that the observers should perceive the two triangles pointing in the same direction. The same prediction is made by the retinal painting theory. This theory assumes that an involuntary eye movement shifts the retina with respect to the stimulus. Assume that the eye movement brings retinotopic positions s0, s1, and s2 in alignment with the slit at time instants t0, t1, and t2, respectively. As depicted in Fig. 8, this would result in the two triangles pointing in the same direction. However, observers’ perception corresponds to the actual stimulus configuration, i.e. the upper and the lower triangles pointing to the left and right, respectively (McCloskey & Watkins, 1978; Sohmiya & Sohmiya, 1992, 1994). Not only does this experiment reject these two theories but it also highlights an essential part of anorthoscopic perception: If the direction of motion is not known, the stimulus is ambiguous in that a leftward moving image and its mirror-symmetric version moving rightward generate identical patterns in the slit. Therefore, the determination of the direction of motion is critical for anorthoscopic perception. Indeed, anorthoscopic percepts consist of the shape moving in the correct direction. Our Hypothesis 4 incorporates this critical observation into our theory.
Fig. 9 depicts how our theory can account for the perception generated by the stimulus in Fig. 6. At time t0, the slit and the two dots constituting the tips of the triangles generate their representations in the non-retinotopic space with their corresponding motion vectors. For depiction purposes, each “object” in the non-retinotopic space is highlighted by a dashed ellipse. Note that because representation in the non-retinotopic space is position-invariant, the relative positions of different objects in this space do not carry spatial information. The relative spatial positions of different objects at a given time instant are encoded by their mapping via grouping operations to the retinotopic space (shown by the dashed lines with double-headed arrows between the retinotopic and non-retinotopic representations). At time t1 the array of middle dots are mapped according to grouping relations based on common fate and proximity so that the upper and lower dots map to the corresponding tip points. The relative positions of different parts of a given object in the non-retinotopic space are important, because they encode the shape of that object. We suggest that the motion direction vectors determine the relative position of the middle array with respect to the tip positions. For an object moving to the right (left), the temporally lagging part of the shape would be placed to the left (right) as is the case for the lower (upper) triangle. The same operation occurs at time t2 for the bases of the triangles. To complete the account of what is perceived, we need to consider the effects of occlusions.
When viewing the stimulus shown in Fig. 10, observers typically “perceive” a circle and a square even though part of the square is not directly visible. This type of figural completion is called amodal completion (Michotte, Thinès, & Crabbé, 1964). From a terminological point of view, to distinguish this type of perception from the perception that arises in response to “directly visible” stimulus, we use the term amodal visibility as opposed to phenomenal visibility. What is perceived behind the slit in anorthoscopic perception can be viewed as a dynamic version of amodal visibility. Even though all parts of the figure passing behind the slit are not simultaneously visible, observers “perceive” the complete shape. For example, after the tip of the triangle falls behind the occluder, observers continue to perceive the tip moving forward even though they do not directly see it. To accommodate this amodal effect, we simply assume that, at any given instant, the retinotopic and non-retinotopic activities that are linked by perceptual grouping (e.g., the tips of the triangle for t0, the middle parts of the triangles for t1, etc. in Fig. 9) become phenomenally visible. At any instant, the activity in the non-retinotopic space that has no correlated activity in the retinotopic space would be perceived “amodally”. We designate this as dynamic amodal perception in that the non-retinotopic activity without correlated retinotopic activity will appear to move according to the velocity vector associated with that part of the figure.
Finally, let us point out that, due to the “aperture problem”, the recovery of motion and form information in anorthoscopic perception is illposed (e.g., Shimojo & Richards, 1986). Our theory relates shape and motion distortions reported in anorthoscopic percepts to the errors in estimation of velocity vectors.
While anorthoscopic perception shows clearly that form perception can take place in the absence of a retinotopic image, generalization of underlying non-retinotopic mechanisms to normal viewing requires the demonstration of non-retinotopic perception without the use of occluders or slits. Previous research revealed illusions where features of objects are perceived non-retinotopically, i.e. at different locations than their retinotopic location. Treisman and Schmidt (1982) showed examples of illusory feature conjunctions when observers’ attention is divided. For example, in a small number of trials observers may report seeing a green square in response to a display containing red squares and green circles. This indicates that the retinotopic loci of the shape and color information can be incorrectly combined (Treisman & Schmidt, 1982). Because such illusory feature conjunctions typically occur when observers’ attention is divided, this illusion has been interpreted to reflect an error resulting from the limited attentional resources of the observer.
Similarly, many other feature mislocalizations in human vision have been attributed to “errors” stemming from limitations of perceptual processing such as masking (Stewart & Purcell, 1970; Stoper & Banffy, 1977; Werner, 1935; Wilson & Johnson, 1985), feature migration (Butler, Mewhort, & Browse, 1991; Herzog & Koch, 2001), feature misbinding in object substitution (Enns, 2002), crowding (Parkes, Lund, Angelucci, Solomon, & Morgan, 2001), pooling (Baldassi & Burr, 2000), sampling of continuous information stream (Cai & Schlag, 2001), distributed micro-consciousness (Zeki, 2001; Zeki & Bartels, 1998), and differential latencies (Arnold & Clifford, 2002; Bedell, Chung, Öğmen, & Patel, 2003). On the other hand, to provide support for our theory, we need to demonstrate cases of non-retinotopic perception that result not from errors of the visual system, but rather from its fundamental and lawful aspects. In particular, our Hypothesis 3 states that the transfer of information from the retinotopic to the non-retinotopic space is guided by perceptual grouping operations. Recently, by using a stimulus known as the “Ternus-Pikler display” (e.g., Dawson & Wright, 1994; Grossberg & Rudd, 1989; He & Ooi, 1999; Kramer & Yantis, 1997; Pantle & Picciano, 1976; Petersik, Schellinger, & Geiger, 2003; Petersik & Rice, 2006; Pikler, 1917; Ternus, 1926) we showed a new illusion where non-retinotopic feature perception obeys rules of perceptual grouping. Introduced by Gestalt psychologists, the basic Ternus-Pikler display consists of two frames separated by an inter-stimulus interval (ISI). The first frame contains a given number of elements (e.g., three line segments) and the second frame consists of a spatially shifted version of the elements of the first frame such that a subset of the elements spatially overlaps in the two frames. An example is shown in Fig. 11A where the two frames contain three elements arranged in such a way that two of the elements spatially overlap.
These displays are designed to investigate factors that control how objects, or parts thereof, maintain their identities during motion. When ISI is short, the prevailing percept is that of element motion (Fig. 11B), i.e. the leftmost element in the first frame is seen to move directly to the rightmost element in the second frame while the two central elements are perceived stationary (as depicted by the dashed arrows in Fig. 11B). When ISI is long, the prevailing percept is that of group motion, i.e. the three elements in the first frame move as a single group to match the corresponding three elements in the second frame (as depicted by the dashed arrows in Fig. 11C). Thus the resulting percepts can be understood in terms of motion-induced grouping operations. In element motion, the leftmost element in the first frame and the rightmost element in the second frame are perceived as “one object” moving from left to right. The remaining two elements form together a second group. This latter two-element group is perceived stationary and matched with the two element group in the second frame according to the arrows in Fig. 11B. In group motion, the three elements in the first frame form a single group to match the corresponding elements of the three-element group in the second frame as shown by the arrows in Fig. 11C. Inserting a figural feature (a Vernier offset) to the central element in the first frame (“probe Vernier” in Fig. 12A) allowed us to investigate whether features are perceived according to retinotopic or according to perceptual grouping relations. Observers were instructed to attend to one of the Ternus-Pikler elements in the second frame, labeled as 1, 2, or 3 (see Fig. 12A) and to report the perceived direction of the Vernier offset (left or right) for this attended element. The offset direction for the probe Vernier was selected randomly in each trial. Naïve observers had no knowledge about where the Vernier offset was physically presented and no feed-back was given.
To derive specific predictions from retinotopic versus grouping-based hypotheses of feature attribution, consider first the retinotopic hypothesis. According to this hypothesis, features are perceived at the retinotopic positions where they are presented. Furthermore, features can be integrated retinotopically due to temporal integration properties of the visual system (Herzog, Parish, Koch, & Fahle, 2003). Consider for example the static control condition (Fig. 12C) which is identical to the Ternus display in Fig. 12A with the exception that the leftmost element of the first and the rightmost element of the second frame are omitted. In this control experiment no motion percept is elicited and the spatiotemporal integration combines the probe Vernier offset information retinotopically across the two frames. As shown in Fig. 12C, the percentage of responses in agreement with the probe Vernier is high for element 1 and near chance for element 2 for ISI = 0 and 100 ms. If the attribution of features in the two-frame display were made according to retinotopic relationships, we would expect a similar outcome for the Ternus-Pikler display provided that ISI is short enough to fall in the range where temporal integration occurs. Thus, we would expect the percentage of responses in agreement with the probe Vernier to be high for element 1 and near chance for elements 2 and 3 for ISI = 0 and 100 ms.
On the other hand, if attribution of features were made according to non-retinotopic relations, in particular according to motion-induced grouping, two different outcomes would be expected according to ISI: For short ISIs, because the central element in the first frame is perceptually identified with the element labeled 1 in the second frame (Fig. 11B), we would expect the percentage of responses in agreement with the probe Vernier to be high for element 1 as in the retinotopic case. At long ISIs, however, because the central element in the first frame is perceptually grouped with the element labeled 2 in the second frame (Fig. 11C), we would expect the percentage of responses in agreement with the probe Vernier to be high for element 2 even though there is no Vernier information at this retinotopic position.
Results in Fig. 12A and andBB support the predictions of grouping based non-retinotopic feature perception hypothesis. Not only does this experiment [additional data in (Öğmen, Otto, & Herzog, 2006)] show non-retinotopic feature perception but it also highlights that grouping operations are critical in establishing the mappings from retinotopic to non-retinotopic space. A depiction of our stimulus in a space (horizontal axis) time (vertical axis) diagram is shown in Fig. 13. For simplicity a one-dimensional space is used. The circles and the triangle represent the spatial positions of the straight and offset (probe) Verniers, respectively. The Ternus-Pikler stimulus will activate a large number of integrative mechanisms, some of which are shown superimposed on the stimulus. To explain our data, only an exclusive subset of these mechanisms – specific to the spatial locus and to the prevailing grouping relation (shown by the solid line in the figure) can be in operation. The remaining mechanisms (shown by dashed lines) will integrate information in a manner inconsistent with our data. The oriented receptive-field and the shifter circuit models show two major deficiencies in explaining these data. First, because they do not take into account grouping mechanisms, they will integrate the Vernier information in multiple (inappropriate) ways following the activation of multiple motion detectors. Second, because they lack proper metacontrast mechanisms, they cannot predict when and how motion blur will be curtailed (Section “Motion deblurring in human vision”).
Pääkkönen and Morgan (1994) proposed a two-phase motion deblurring model wherein the first stage is “camera like exposure phase” that always produces motion blur. The second phase is proposed to carry out a “translation-invariant integration” of moving stimuli. This phase does not produce motion blur. No specific mechanisms were suggested for how translation-invariance is obtained. This model cannot explain the results discussed in the section “Motion deblurring in human vision”: Because motion blur in the first phase is assumed to be “camera-like”, the model predicts that motion blur should not to depend on the density of dots, contrary to the empirical findings. Neither can this model explain the results discussed in this section, because without grouping operations, the model cannot predict which specific translation will superimpose the elements in the two frames.
Our proposed theory goes beyond these previous models by including a retinotopic stage with “camera like” persistence whose extent is controlled by metacontrast interactions. Furthermore, the transition to non-retinotopic representation is governed by perceptual grouping operations, a property that allows us to explain Öğmen, Otto, & Herzog’s (2006) experimental results summarized in this section. The theory can also be applied to other non-retinotopic percepts observed in anorthoscopic viewing conditions.
The current neurophysiological knowledge of primate brain is not detailed enough to map directly our theory to neural structures. However, it is well known that early visual areas V1, V2, V3, V4/V8 and V3a are retinotopic and contain a complete eccentricity and polar angle map. Beyond retinotopic cortex, the polar angle representation becomes cruder. Interestingly, a recent study by Yin, Shimojo, Moore and Engel (2002) investigated neural correlates of anorthoscopic perception using fMRI. Their experiments included anorthoscopic percepts and control conditions with distorted stimuli that failed to generate anorthoscopic percepts. The activities in the retinotopic cortex did not correlate with whether the observers experienced anorthoscopic percepts or not. On the other hand, cortical activities in “object areas”, in the Lateral Occipital Complex (LOC), a mainly non-reti-notopic area, as well as those in the human motion area MT+ were correlated with anorthoscopic perception. Human motion area MT+, which is a likely homologue of the macaque motion-sensitive area MT/V5 (Heeger, Huk, Geisler, & Albrecht, 2000; Rees, Friston, & Koch, 2000), contains an orderly eccentricity organization within a hemifield representation (Dukelow, DeSouza, Culham, van den Berg, Menon & Vilis, 2001; Huk et al., 2002). LOC is a cortical region that exhibits selectivity to pictures of intact “meaningful” objects compared to scrambled objects and pictures that lack a clear meaningful object interpretation (Allison, Ginter, McCarthy, Nobre, Puce, Luby et al., 1994; Allison, Puce, Spencer & McCarthy, 1999; Doniger, Foxe, Murray, Higgins, Snodgrass & Schroeder, 2000; Faillenot, Toni, Decety, Gregoire & Jeannerod, 1997; Grill-Spector, Kushnir, Edelman, Itzchak & Malach, 1998; Grill-Spector, Kushnir, Hendler, Edelman, Itzchak & Malach, 1998; Grill-Spector, Kushnir, Hendler & Malach, 2000; Kanwisher, McDermott & Chun, 1997; Kourtzi & Kanwisher, 2000; Malach, Reppas, Benson, Kwong, Jiang, Kennedy et al., 1995; Murtha et al., 1999; Sergent, Ohta, & MacDonald, 1992). LOC also exhibits strong size and position invariance (Grill-Spector, Kushnir, Edelman, Avidan-Carmet, Itzchak & Malach, 1999; Malach et al., 1995). Hence, LOC and other similar non-retinotopic areas showing object selectivity can be candidates for our “non-retinotopic space”. Yin et al.’s (2002) study suggests that the motion vectors, directly depicted in the non-retinotopic area in Fig. 4, may physically reside in area MT+. A recent study by Kim & Kim (2005) provides evidence that LOC has direct connections to MT+ and V3A and that MT+ and V3A have reciprocal connections. V3A is part of the V3 complex which has been implicated in the analysis of dynamic form (Zeki, 1991). Thus a tentative mapping would include areas extending to V3 complex as our retinotopic space, LOC as the non-retinotopic space, and the connectivities between MT+, V3A, and LOC establishing the coupling of dynamic form and motion vector representations between these areas. While this mapping is highly speculative at this point, we believe that future neurophysiological studies can test more directly neural correlates of the proposed functional theory.
The three-dimensional structure of an object is mapped through the optics of the eye on two-dimensional retinae creating a “retinotopic image” of the object. Retino-cortical pathways provide an orderly projection to the lateral geniculate nucleus and to the primary visual cortex so that neighboring points on the retina map to neighboring points in these areas, a property known as retinotopy. This retinotopic organization is found in numerous visual cortical areas. Through their “classical” receptive fields, neurons in these visual areas process information locally in the retinotopic space. Retinotopic organization and retinotopically localized receptive-fields have been two fundamental pillars upon which most theoretical accounts of visual form perception are built. However, these theories are based mainly on a static characterization of visual perception and focus on how form information is processed for static objects. On the other hand, very little is known on how the nervous system computes the form of moving objects. Based on an analysis of dynamic aspects of vision, we argued that non-retinotopic computational principles and mechanisms are needed to compute the form of moving objects. We designate as “non-retinotopic” those mechanisms that can generate perception of form in the absence of a retinotopic image. Indeed, perceptual data demonstrate that a retinotopic image is neither necessary nor sufficient for the perception of form: When a moving object is viewed behind a narrow slit cut out of an opaque surface (anorthoscopic perception, Fig. 5), all information about the moving object’s shape collapses temporally on a narrow retinotopic locus in a fragmented manner, i.e. there is no spatially extended retinotopic image of the shape. Yet, observers perceive a spatially extended and perceptually integrated shape moving behind the slit instead of a series of fragmented patterns that is confined to the region of the slit. Anorthoscopic perception shows that a retinotopic image is not necessary for the perception of form.
The visibility of a “target stimulus” can be completely suppressed by a retinotopically non-overlapping “mask stimulus” that is presented in the spatio-temporal vicinity of the target stimulus, phenomena known as para- and metacontrast masking (Bachmann, 1984; Breitmeyer & Öğmen, 2006). These masking effects indicate that the existence of a retinotopic image is not a sufficient condition for the perception of form and that the dynamic context within which the stimulus is embedded plays a major role in determining whether form perception will take place.
In this manuscript, we presented a theory of moving form perception where masking, perceptual grouping, and motion computation interact across retinotopic and non-retinotopic representations. Due to visible persistence, moving targets are expected to generate extensive blur in retinotopic representations implemented in early visual cortex. We provided evidence showing that metacontrast masking controls the spatial extent of this blur. While this first step is critical in limiting the deleterious effect of motion blur; the computation of clear percepts for moving objects requires a non-retinotopic representation where figural information about moving objects is processed. We argued that motion-induced grouping is critical in transferring information from the retinotopic to non-retinotopic space. Dissociation between visibility and masking effectiveness allows metacontrast to be effective in a sequential mode. The RECOD model captures this property. The RECOD model can also explain the dissociation between visibility and spatial localization. This dissociation, allows the computation of motion information that can lead to motion grouping under metacontrast suppression conditions. Thus, taken together RECOD can implement the deblurring of retinotopic activity while preserving information for motion-induced grouping. In addition to normal viewing conditions, the proposed theory can also be applied to anorthoscopic perception which provides strong evidence that a “retinotopic image” is not necessary for the synthesis of a spatially extended percept. Our current work focuses on the interactions between perceptual grouping operations and non-retinotopic representations in order to develop a more detailed quantitative account for the remaining parts of the theory.
I would like to thank M. Aydin, H. E. Bedell, B. G. Breitmeyer, S. Chen, F. Hermens, M. H. Herzog, H. Kafaligonul, T. Kammer, T. U. Otto, G. Purushothaman, and F. Scharnowski whose past and current collaborations contributed to many aspects of the work described in this manuscript. I would like to thank Jim Enns for bringing to my attention the work by Tassinari et al. (1999) and the reviewers for helpful comments on the manuscript. The research described in the manuscript has been supported in part by a fellowship from the Hanse Wissenschaftskolleg (HWK) and by grants R01MH49892 from NIH and BCS 0114533 from NSF. I would like to extend my special thanks and appreciation to Michael Herzog for his hospitality as well as his scholarly support during my stay at HWK.