|Home | About | Journals | Submit | Contact Us | Français|
The estimation of the direction of visual attention is critical to a large number of interactive systems. This paper investigates the cross-modal relation of the position of one's feet (or standing stance) to the focus of gaze. The intuition is that while one CAN have a range of attentional foci from a particular stance, one may be MORE LIKELY to look in specific directions given an approach vector and stance. We posit that the cross-modal relationship is constrained by biomechanics and personal style. We define a stance vector that models the approach direction before stopping and the pose of a subject's feet. We present a study where the subjects' feet and approach vector are tracked. The subjects read aloud contents of note cards in 4 locations. The order of `visits' to the cards were randomized. Ten subjects read 40 lines of text each, yielding 400 stance vectors and gaze directions. We divided our data into 4 sets of 300 training and 100 test vectors and trained a neural net to estimate the gaze direction given the stance vector. Our results show that 31% our gaze orientation estimates were within 5°, 51% of our estimates were within 10°, and 60% were within 15°. Given the ability to track foot position, the procedure is minimally invasive.
The estimation of a subject's zone of attention is important to such domains in human-centered computing as computer-supported collaboration, teaching and learning environments, context-aware interaction, large-scale visualization, smart homes, multimodal interfaces, wearable computing, and analysis of group interaction. Systems that estimate such attention typically involve intrusive sensing technology such as video tracking and wearable technology. In this paper, we explore the possibility of estimating one's `zone of attention' by tracking one's footfalls and standing stance. The specific mode of tracking is not the focus of this paper, although one might imagine pressure sensitive carpets with [1, 2] thin piezoelectric cables , sensorized tiled floors [4–8], and a variety of wearable devices [9–11]. If such zone of attention estimation is possible, a range of multimodal interactive systems will be enabled that present timely information on ambient displays, support human meetings by estimating the zone of attention of one's interlocutor, and create active displays that are attention-sensitive.
By `zone of attention', we mean the sector of visual space of a subject. Our interest here is not in the exact angle of gaze as might be required when using an eye-tracker to support gaze control of a screen cursor. Rather, we want to estimate the general zone of attention (or where the subject is looking) centered on some sector axis as shown in Figure 1. Observe that our estimate does not require that the center of the attention zone be coincident with the `nose-forward' vector.
Our approach is to exploit the anatomical and behavioral constraints of the subject when her feet are set and to employ a biomechanical model from whose parameters we estimate the zone of attention using a classifier or by function estimation. In Section 2, we discuss the rationale for our approach by reviewing the need for gaze/awareness, and reviewing existing attention awareness approaches. We show that there is need for a coarse-scale attention estimation approach that is able to work over a large area. In Section 3, we ground our approach by outlining our biomechanical assumptions and discussing our model. In Section 4, we describe our empirical approach and report our experimental results. We conclude in Section 5.
We make a distinction between gaze tracking and attention awareness. The former typically requires precise detection of the angle or locus of gaze so that, for example, one may control a screen cursor with the output. Attention awareness requires that the zone of gaze be detected to determine the object or area of visual attention. The focus in this paper is the latter. The ability to detect attentional focus is critical to a wide variety of multimodal interaction and interface systems. These include computer-supported group interaction [12–16], wearable computing, augmented reality [17–21], context/attention aware applications [21–26], education and learning [26, 27], smart homes [28, 29], and meeting analysis [30–36]. In many of these applications, non-intrusiveness in the mode of sensing is more important than the accuracy of gaze angle tracking. We investigate the estimation of the zone of attention from a standing stance unobtrusively using information that may be obtained entirely from a sensing carpet, smart floor, or instrumented shoes. In this section, we review the state of the art in tracking visual attention, discuss the biometric and psycho-social behavioral constraints in direction of visual attention with respect to stance, and advance our model for zone of attention detection and tracking.
Allocation of attention is a key aspect of collaboration, meeting conduct, and interaction that is a prime determiner of information flow among participants or supporting technologies. Gaze has long been recognized as a primary indicator of zone of attention and as a conversational resource that assists participants in assessing connection, comprehension, reaction, responsiveness, and in interpreting intention . Other researchers have investigated the attentional behavior of subjects who are observing the gaze of others [38, 39]. As several researchers point out, however, it is still possible to visually fixate one location while diverting attention to another [40, 41]; even though eye tracking may be highly accurate, gaze direction is not necessarily a highly accurate estimator of attentional focus.
Eye tracking has been by far the technology of choice for gaze estimation. There has been much interest in the use of gaze to control interaction [42, 43], to modify information presentation [44–46], and to interact directly with data . Duchowski  partitions eye tracking applications into either diagnostic or interactive categories, depending upon whether the tracker provides objective and quantitative evidence of the user's visual and attentional processes or whether it serves as an interaction device.
Eye tracking has been of intense interest for many years, and many techniques have been proposed . Some, such as electrooculography , which involves attaching electrodes near the eyes, and magnetic eye-coil tracking, which involves special contact lenses, are particularly invasive and uncommon. Most current tracking methods are video based and fall into one of two categories, using either infrared illumination or passive tracking. The use of remote fixed cameras is not of great interest except in special studies because of the problems of occlusion, head tracking, and tracking multiple subjects simultaneously. Excellent work has also been done in the case of head mounted trackers to minimize the invasiveness of the camera, mount, and cabling , but the apparatus is always present in front of the wearer.
The so-called limbus trackers are usually passive trackers that utilize ambient light to track the limbus, which is the junction between the iris and the white surrounding sclera. These trackers are somewhat simpler since they do not require a special illumination source but suffer from the uncontrolled nature of the ambient light and limitations in vertical tracking due to eyelid movements. On the other hand, the use of an infrared illumination source makes it practical to track the pupil, which is a more sharply defined and less occluded feature than the limbus. However, infrared trackers can suffer in the presence of other infrared sources such as natural sunlight. Other infrared trackers make use of the so-called Purkinje images , which are due to reflections from the several optical boundaries within the eye such as the surfaces of the lens and cornea. The measurement of individual features requires that the head position be fixed or tracked; sophisticated trackers can avoid this by tracking features that move differentially when the eye moves as opposed to the head. More important, neither approach is feasible for estimating attention of subjects over a large space.
Some eye trackers separate the process of determining head orientation from the local process of determining eye orientation given the head pose [52–55]. Others have proposed the use of head pose alone as an estimator of gaze direction [41, 56, 57] in order to eliminate the need for invasive head mounted hardware. Stiefelhagen's results [41, 57] provide strong evidence to support the effectiveness of head orientation alone as an estimator of focus of attention.
The prime deterrents to the use of eye trackers for gaze analysis have been the cost of eye tracking, its invasiveness, its lack of robustness, and the difficulty of performing the analysis simultaneously on large groups of meeting or collaboration participants. Yet other problems concern calibration, dynamic range, response time, and angular range. Since gaze direction does not uniquely determine focus of attention, there are many applications in which determining zone of attention to high accuracy (< 1 degree) is unnecessary, applications for which gait, location, identity, and pose are sufficient estimators of the desired information.
The advent of wearable computing has sensitized researchers to the need for deeper context awareness that includes, among other things, the pose and location of the wearer. As always, the dilemma is how to determine this information as noninvasively as possible, simultaneously for multiple individuals, and over a large area where the motion of users are minimally constrained. Our hypothesis is that useful estimates of zone of attention can be obtained from floor stance and approach vector, so that the search for suitable sensor systems can be shifted to shoe and floor systems. An extensive sensor system for determining floor stance has been proposed by the Responsive Environments Group at the MIT Media Laboratory in which each participant shoe is fitted with a rich complement of wireless sensors [9–11].
In summary, head-mounted and wearable trackers encumber the user; many are still tethered by cabling, which is especially problematic in multi-user environments. Other systems that employ video and electromagnetic technology restrict the movement of the user to the effective tracking volume of the technology.
Floor stance and approach vector during locomotion can provide useful estimates of zone of attention. Figure 3 shows the degrees of freedom available to a human viewer when the feet are set. Human locomotion is guided by optic flow and egocentric direction strategy utilizing variant degrees of target visual context [58–61]. Optic flow describes temporal changes in image structure as a walker moves; and egocentric direction strategies describe how one walks in different contexts (e.g. in dimly lit areas one may use egocentric coding that minimizes angular distances to the goal.) The assumption behind both of these locomotion strategies is that the goal is visible, and as such, directly related to the salience of the visual context . The saliency dependence of the visual context suggests that gaze transient (i.e., flow and direction) of a target is an important parameter for goal directed gait. This can be conceptualized as a constraint on the way one approaches a target of focal attention prior to the static (standing) stance or configuration.
Gaze control involves motion coordination of eyes, head and trunk to allow both flexibility of movements and stability of gaze. During straight walking, gaze is maintained in the direction of forward locomotion with small head yaw oscillations in space, despite relatively large oscillations and lateral displacements of the body. A study investigating three-dimensional head, body and eye angles during walking and turning, it was found that the peak body yaw of 3.5° in space was compensated by the relative peak head yaw of 3°, which consequently resulted in a very small head yaw angles (less than 1°) in space. Additionally, the naso-occipital axis of the head was closely aligned with the anterio-posterior direction of locomotion . The head pitch and roll angles peaked at approximately 3° as observed both in over-ground walking  and in treadmill walking [64, 65]. In terms of gaze behavior, eyes were found to spend the majority of the time (78.8%) fixating the aspects of the environment along the direction of locomotion and a small amount of time (16.3%) searching for possible future routes. What appears to be random point inspecting only took 4.9% of the time during walking . Furthermore, such gazing patterns (fixating along the direction of walking) appeared not to be influenced by individual differences .
During turning, gaze is directed in advance of the body heading, and after turning, gaze is returned to align with the direction of motion. During a 90° turn while walking, head yaw was maintained smoothly in space, with a maximum 25° deviation from the heading direction of the body . Eye position, however, was found to shift in saccades in the direction of turn (Figure 2), reaching yaw angles as high as 50° relative to the head. Once the turn was complete, eye position and foot position returned to zero relative to the head . Our goal, then, is to determine the pattern of behavior that relate both the vector of approach and the final pose of the static stance with the likely final focus of attention.
In addition to biomechanical constraints in the previous section, we add behavioral constraints of instrumental gaze. In our work on meeting analysis [30–32, 35, 68], we observed that there is a difference between interactive deployment of gaze and an instrumental one [32, 69–71]. Interactive gaze takes place between people, and is influenced by aspects of social behavior such as the avoidance of `nose-to-nose' fixations, and back-channeling behavior. Instrumental gaze involves the deployment of gaze for the purpose of acquiring information (such as reading, or viewing a graphic). Our preliminary analysis of meeting room data suggests that there is greater variation in gaze deflection from the `nose-forward' vector of head orientation for interactive gaze with respect to instrumental gaze. Since our interest is in instrumental use of gaze with technology, our expectation is that eye deflection variability is reduced for such activity. Furthermore, the kind of instrumental gaze necessary to access information requires the deployment of central-foveal vision.
Figure 4 illustrates our base model of stance for the estimation of the zone of visual attention. We call this the base model because we expect that our model will have to evolve as more is known about the relationship between gaze and stance. The reference frame of the model is formed by the connecting line between the centers of mass of the feet, and the normal to that line in the forward direction of the subject (shown as the x–y reference frame in Figure 4). The orientations of the right and left feet are described by the angles ϕr and ϕl respectively. The approach angle γ describes the direction of locomotion prior to stopping in the resultant pose. d describes the width of the stance. The angle θT is the angle of gaze to the target of attention as a deflection off the stance normal. By this model, vi = [ϕr, ϕl, d, γ] constitutes an input stance vector, and the value θT is the output value to be estimated.
To test the hypothesis that stance may be a predictor of gaze direction, we designed an experiment where subjects are required to read a series of lines of text that are mounted on aluminum posts. The text is small so that the subjects had to move to the target to read the lines. We tracked the feet of the subjects to obtain the stance vector and used a neural net approach to learn the gaze direction. The point of this experiment is not to advance any specific learning approach. It is to ascertain if any patterning exists by which our cross-modal hypothesis may be validated.
Figure 5 shows the plan view of our experimental configuration. Since our model describes only horizontal gaze deployment (Figure 4 does not include viewing pitch) the target cards are set at eye height for each subject. Figure 6 shows a picture of our experimental setup in the laboratory. Two gaze targets can be seen. We employ our Vicon near-infrared motion trackers to estimate the parameters in our model to obtain the stance vector, vi = [ϕr, ϕl, d, γ]. By tracking the retro-reflector marker configurations on the frame attached to the subject's shoes (see Figure 6 inset), our experiment software produces a time-stamped stream of quaternions from which we derive the basis vectors of the tracked frame for each foot. To simplify the determination of the approach vector, we also track the location of the subject's head (tracked goggles in Figure 6). This also gives us access to the subject's head orientation, although we did not use it for this experiment.
By having the subject place her foot in a box of known coordinates and orientation marked on the floor we obtain the toe-forward vector from the basis frame of each tracked position. Given the unit basis matrix BC of the calibration box, and the unit basis matrix B0 of the tracked frame attached to a foot, we obtain the tracking transformation Mf = BC × B0T (for foot f, where f is r or l for the right and left foot respectively). Given a subsequent tracked unit basis matrix Bi, the toe-forward frame is simply given by Mf × Bi.
The subjects are directed to read lines in 12-point font printed on 3×5 cards placed at in four known coordinates in the laboratory (pictured in Figure 6, and labeled A, B, C, D in Figure 5). Each line of each card contains three columns: a sequential index (A.1, A.2 … for card A, B.1, B.2 … for card B etc.), a line of text to be read, and the index of the next line to be read. The station for the next line to be read is randomized so that the subject will go from station to station to read the next line. The small fonts ensure that subject must move from one station to the next. The subject reads each line aloud so that we know when her attention is fixed on the target 3×5 cards. With this information, we can extract the parameters described in Figure 4 (Section 3.2). Each time a target is read, we record a stance vector vi=[ϕr, ϕl, d, γ] and attention angle θT. For each trial, the subject reads 40 lines randomly located at the 4 targets. This trial is repeated 10 times with 10 subjects, yielding a training dataset containing 400 vectors.
We employed a standard three-layer backpropagation neural network parameter approach  to estimate θT from vi. Kolmogorov [73, 74] showed that any continuous function can be represented as a linear additions of multiple continuous functions. In our implementation, the input layer has four neurons for the input of the four parameters ϕr, ϕl, d, γ. The output layer has one neuron for the parameter θT and the hidden layer has 15 neurons (using a rule of thumb of between 4 and 5 times the number of input neurons) . The network was initialized with random weights. After training with samples, the network can learn the relationship [ϕr, ϕl, d, γ] ® θT. We can apply it to estimate the θT for some new vi. For our study, we divided our dataset into four sets of 300 training vectors and 100 test vectors. We trained our network on the former, and ran the resulting network using the stance vectors from the latter group.
Figure 7 is an histogram of the absolute difference abs, where is the estimated attention direction and θT is the measured direction. For this dataset, 31% of the estimations fell within 5° of the measurements. 51% of estimations were within 10°, and 60% of the estimates were within 15° of error.
Figure 8 shows plots the absolute values measured θT against the absolute error abs for a particular dataset (testing against 100 vectors). This shows that our estimation error increases with the size of deflection. Given the limited size of our dataset (only 400 samples), this might be expected since the data becomes sparser with larger θT.
These results show an estimation accuracy far in excess of chance. For example, assuming that the subject is capable of viewing 180° from a particular stance, chance would predict that a 5° estimate of 2.77%, a 10° estimate at 5.56% and a 15° estimate at 8.33%. It should be noted that this experiment did not take individual differences into account, and the training sets are not extensive. Hence, one might expect that the results to improve with more user-specific training. Also, we acknowledge that our stance vector is an initial principled guess. One might imagine that extension of the stance vector to include weight distribution, subject parameters (e.g. height), etc. the estimate may be improved. Our purpose here is to advance a proof-of-concept for consideration by the research community.
We have demonstrated a rather audacious presupposition that one is able to estimate a subject's instrumental gaze direction or attentional focus from her approach vector and standing stance.
We presented our rationale for our research by reviewing the need and the technologies for gaze/attention estimation. We showed that there is need for a non-intrusive coarse scale attention estimation approach that is able to track over a large area.
We ground our proposed stance model and the expectation that we may be able to estimate attention from stance by discussing the biomechanics of approach and gaze fixation. We present our stance model comprising only four parameters.
We present a set of experiments by which we track subjects' feet and approach vector to an attention target using a motion tracking system. Subjects were required to move to one of four stations randomly and read a line of text. We extracted 400 stance vector – attention direction sets, and employed a neural net system to learn the relationship. The results are promising.
While the results are promising, more needs to be done. The approach is to find a mapping between the stance vector and the direction of attention. Our initial stance vector, while arrived at in a principled manner, ignores many other possible vectors that may be deterministic. Examples of these include weight balance (right foot vs left foot, forward lean vs backward lean), dynamics of approach, and duration of gaze.
Also, in our study, our subject approached an initial target and directed visual attention at it. We can think of this as the initial zone of attention from a particular stance. This does not address the retargeting of attentional focus from such a fixed stance after the initial attentional gaze. We conjecture that once a stance is fixed, there is a `zone of comfort' where a subject can redeploy gaze without moving her feet (shifting her stance). This might occur when the subject has selected a stance for a particular initial target and a new target appears in close proximity to the original. Let δx be the distance of some secondary target from the initial target. To characterize the range of δx, a second type of experiment is required that utilizes a large display system such as our tiled wall-sized display (seen in the background in Figure 6). When an initial target is displayed, the subject approaches and reads as before. Secondary targets are displayed at different δx's to determine typical range thresholds that engage adjustment of stance. The range of these `within-stance attention redeployments' may require extension of the stance vector to include balance components, or it may define a zone of uncertainty of secondary gaze targets.
This work has been partially supported by NSF grants This research has been supported by the U.S. National Science Foundation NSF ITR program, Grant No. ITR-0219875, “Beyond the Talking Head and Animated Icon: Behaviorally Situated Avatars for Tutoring,” IIS-0624701, CRI-0551610, “Embodied Communication: Vivid Interaction with History and Literature,” and, NSF-IIS- 0451843, “Interacting with the Embodied Mind,” and “Embodiment Awareness, Mathematics Discourse and the Blind.” We also acknowledge Yingen Xiong and Pak-Kiu Chung who conducted the experiments and ran the neural net pattern classification system.
Categories and Subject Descriptors H.5 INFORMATION INTERFACES AND PRESENTATION (e.g., HCI) (I.7) I.5 PATTERN RECOGNITION,
General Terms Human Factors
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.