Simulate Human Saccadic Scan-Paths in Target Searching

Human saccade is a dynamic process of information pursuit. There are many methods using either global context or local context cues to model human saccadic scan-paths. In contrast to them, this paper introduces a model for gaze movement control using both global and local cues. To test the performance of this model, an experiment is done to collect human eye movement data by using an SMI iVIEW X Hi-Speed eye tracker with a sampling rate of 1250 Hz. The experiment used a two-by-four mixed design with the location of the targets and the four initial positions. We compare the saccadic scan-paths generated by the proposed model against human eye movement data on a face benchmark dataset. Experimental results demonstrate that the simulated scan-paths by the proposed model are similar to human saccades in term of the fixation order, Hausdorff distance, and prediction accuracy for both static fixation locations and dynamic scan-paths.


Introduction
Searching the localization of targets is still a challenge problem in the fields of computer vision.However, humans perform this task in a more intuitive and efficient manner by selecting only a few regions to focus on, while observers never form a complete and detailed representation of their surroundings [1].Due to the high efficiency of this biological approach, more and more researchers are devoting increasingly great effort to probing the nature of attention [2].
Usually two kinds of top-down cues are used to predict human gaze location in dynamic scenes [3] and gaze movement control when searching target: cues about bottom-up features such as shape, color, shape, scale [4]- [7] and cues about the top-down visual context that contains the target as well as other relevant objects' spatial relationships and their environmental features [8]- [10].
In classical search tasks, target features are important source of guidance [11]- [15].Although a natural object, such as an animal (cat or dog), does not have single defining feature, its statistically reliable properties (round head, straight body, legs and others) can be selected by visual attention.There has been little research using visual context in object search.Global context was used by Torralba to predict the region where the target is more detected by [16].Object detectors are used by Ehinger and Paletta [17] [18] to search the targets in that predicted region detected by [16] for accurate localization.An extended object template containing local context is used by Kruppa and Santana to detect extended targets and infer the location of the targets via the ratio between the size of the target and the size of the extend template in [19].Most of above methods are just only based on either global context or local context cues.However, Miao et al. proposed a serial of neural coding networks in [20]- [23] using both of them.
In this study, the main purpose of our work is to simulate human saccadic scan-paths by the proposed model in [23].To test the performance of the proposed model, we collect human eye movement data by using an SMI iVIEW X Hi-Speed eye tracker on a face dataset with a sampling rate of 1250 Hz.We compare the saccadic scan-paths generated by the proposed model against actual human eye movement data from the face dataset [28].
The paper is organized as follows: the model of the gaze movement control in target searching proposed in [23] is introduced briefly in Section 2. In Section 3, we compare our saccadic scan-paths with previous methods and scan-paths from eye tracking data.Our conclusions are presented in Section 5.

Review of the Gaze Movement Control Model in Target Searching
This paper applies the target searching model in [23] to simulate the eye-motion traces.The feature used in the model is a kind of binary codes called Local Binary Pattern (LBP) [32], which has been proved through our work superior to orientation features used in the same system [33] [34] with respect to search performance.LBP is a simple and fast encoding scheme to map a 3 × 3 image patch to a local feature pattern in terms of an 8-bit code.This encoding scheme has no parameters to do such mapping, just outputting 0/1 for each bit through comparing the central pixel's value and that of each one of eight surrounding pixels.There are encoding and decoding parameters in the model [23], such as P, which determines how many context coding neurons are activated through competition.Through our experiments, we find the best value of 70% for this parameter.So in this paper, we use the best model with LBP feature and P = 70% to simulate eye-motion traces.
The learning and testing algorithm for target search is illustrated in Figure 1 and described in Section 2.1 and 2.2.Here the visual context means the visual field image and the spatial relationship from the center of the visual field to the center of the target.In order to encode such context, we need to calculate and store the representation coefficients of the spatial relationship and the visual field images.The model's learning algorithm and test method are introduced in this part.In this experiment, we use head-shoulder image database from the University of Bern [24].

Model Training
The learning algorithm is described by [23] as follows: 1) Choose a value s from the scale set {s j } for the visual field that will be processed; 2) Choose an initial view point (x j , y j ) as the center of the visual field from an initial point set {(x j , y j )} covering the surrounding area of the target; 3) Receive signals from the current visual field, and output a relative position evaluation for the target with view point moving distances (Δx, Δy) ; 4) If the prediction error err is larger than the limit ERR(s) for the scale s of the current visual field, move the visual field center to a new position randomly; go to 3 until err ≤ ERR(s) or the iteration number is larger than a limit; 5) If err > ERR(s), generate a new VF-image encoding neuron (let its response R k = 1); encode the visual context by calculating and memorizing the connecting weights {w ij , k } between the new VF-image encoding neuron and the feature neurons and the connection weights w k,uv between the new VF-image encoding neuron and the motion encoding neurons (let their response R uv = 1) respectively using the Hebbian learning rule ∆w a,b = αR a R b ; 6) Go to 2 until all initial view points are chosen; 7) Go to 1 until all scales are chosen.

Model Prediction
In the test stage, the entire algorithm for view point control for object locating is given as follows: 1) Get a pre-given view point (x, y); 2) Choose a scale s from the set {s i } for the current visual field from the maximum to the minimum; 3) Receive signals from the current visual field, and calculating the response of the feature neurons and the context encoding neurons; 4) Predict a relative position (Δx, Δy) for the real position of the object; 5) If (Δx, Δy) = (0,0),object located; 6) If (Δx, Δy) ≠ (0,0), view point moving with (Δx, Δy), go 2 until all scales are chosen.

Participants
Fifteen female and twelve male college students of Beijing University of Technology participated in this study.The age range was 23 -26 and the average was 24 years old.All of the twenty-seven students had normal or corrected-to-normal vision.

Stimuli
A set of 30 face pictures are prepared as stimuli.Of this set of 30, 15 are Female-face, 15 are Male-face, and the size of each picture is 1024 × 768 pixels.Pictures are presented on a color computer monitor at a resolution of 1024 by 768 pixels.The monitor size was 41 cm by 33.8 cm, and the participants were sited in a chair about 76 cm in front of the screen.Stimuli consist of a set of 30 face pictures.There are 15 Female-face and 15 Male-face in this set of 30, and each picture's size is 1024 × 768 pixels.One of the 30 face pictures are presented on a color computer monitor at a resolution of 1024 by 768 pixels.

Design
A new searching task was used in this study, participants were demanded to search the left and right eyes in a face from a pre-given starting point.Thirty pictures of face were used as stimuli, including 15 female and 15 male faces.The size of each picture was 1024 × 768 pixels.There were four pre-given starting points, named the first, second, third and fourth quadrant respectively in a counterclockwise direction, similar with those in a coordinated system.Searching from a starting point to a target eye decided the searching distance and direction.Figure 1 illustrated the searching targets and the definition of the quadrants.

Procedure
For each trial, as shown in Figure 2, a black trail indicator was presented initially in the middle of the white screen for 1000 ms to indicate the target of the left or right eye.Then a "+" indicating pre-defined positions was presented in a random order.After that the picture of a face appeared in the middle of the screen for 2000 ms and participants were asked to search the right target eye or the left target eye as accurately and quickly as possible.Participants were told not to look at other part of the picture in the pictures after finding the target.

Preprocess
The real fixation points are collected on the images with the size of 1024 × 768 pixels.However, the model can only deal with the gray images with the size of no more than 320 × 320 pixels.So when evaluating the performance of the model, we compress the original 1024 × 768 color images into 320 × 240 gray images.10 face images are used in the learning stage and the other 20 face images are used for evaluation.The algorithms for the learning and the prediction stages are described in Section 2. When predicting fixation order and scan-paths, the same initial positions were used in the above experiment.Our model will search left eye and right eye separately from four different initial points that are similar to the above experiment.Each participant is asked to search left and right eyes from four different starting points on a face, and then it would certainly produce 8 eye scan-paths.For 27 subjects and 20 face images, 27 × 20 × 8 scan-paths are totally recoded.

Evaluation of Fixation Order
We are aware of only a limited literature on computational models of active visual attention, and in particular active visual attention needs further investigation.Lee and Yu's work in [25] provided a conceptual framework but failed to provide a fully implemented solution with experimental results.Renninger et al. in [26] simulated scan-paths on novel shapes, but it is not clear how to adapt their method to natural images.However, Itti et al. in [27] proposed a scan path generation method from static saliency maps based on winner-takes-all (WTA) and inhibition-of-return (IoR) regulations.Tom Foul sham tried to find the evidence from normal and Gaze-Contingent search tasks in natural scenes in [28] for Itti.Marco Wischnewski proposed a model combining static and dynamic proto-objects in a TVA-based model of visual attention to predict where to look next in [29].Gert Kootstra proposed a model to predict Eye Fixations on Complex Visual Stimuli Using Local Symmetry [30].De Croon [31] proposed a novel gaze-control model, named act-detect, which use the information from local image samples in order to shift its gaze towards object locations for detecting objects in images.Our system can automatically generate the fixations, and the fixation can move to the target under the control of learned memory and experience in four or five steps.We here compare the simulated scan-paths generated by the model of [23] with human saccades.We select the initial positions on the four quadrants of the image shown in Figure 3.And the experimental results are illustrated in Figure 4. We can find that the simulated scan-paths by our model are similar to human saccades.

Distance of Scan-paths
In order to quantitatively compare the stochastic and dynamic scan-paths, we divide scan-paths into pieces of length 2. We use the Hausdorff distance to evaluate the scan-paths by the model proposed by Miao et al. with scan-paths of all subjects recorded by the eye tracker and evaluate the scan-paths between different subjects.The results are shown in Table 1.
In Table 1, Model-Human means the average of the Hausdorff distances between the scan path generated by model and that from each one of 27 subjects on corresponding images.Human-Human means the average of the Hausdorff distances between the scan-paths generated by any two of 27 subjects.We can know from Table 1 that the simulated scan-paths by the model of Miao's are similar to human saccades by comparing the Hausdorff distance of scan-paths between the model and the humans: the average of the Hausdorff distances between   scan-paths generated by the model and each subject on all the corresponding images is 29.18 which is similar to the average (26.36) of the Hausdorff distances between the scan-paths generated by every two subjects of the total 27 subjects.We also compute the average of the Hausdorff distances in the cases of that the initial position is from the second, third and fourth quadrants respectively shown in Table 2.
In Table 2, Model-Human means the average of the Hausdorff distances between the scan-paths generated by the model and each of 27 subjects on all the corresponding images.Human-Human means the average of the Hausdorff distances between the scan-paths generated by every two of 27 subjects from the first, the second, the third and the fourth quadrants.The average of the Hausdorff distances from all four initial quadrants is 24.09.We conclude that the model of Miao's [23] achieves a good predictive accuracy on both static fixation locations and dynamic scan-paths.

Evaluation of Search Precision
We also compute the search precision from four different quadrants to left eye and right eye.The results are shown in Table 3.We noted that there is a discrepancy of the average value of the search precision between the left eye and right eye.Due to different contextual information which is coded and used by the search model, this case may take place.

Discussion and Conclusions
Miao et al. presented a new architecture for gaze movement control in target searching in [23].This paper utilizes the model to simulate human saccadic scan-paths in target searching.To test the performance of the proposed model, we collect human eye movement data by using an SMI iVIEW X Hi-Speed eye tracker at a sample rate of 1250 Hz.We compare the saccadic scan-paths generated by the proposed model against human eye movement data.Experimental results demonstrate that the simulated scan-paths by the proposed model are similar to human saccades in terms of the fixation order and the Hausdorff distance of scan-paths.It can be learned that the model achieves good prediction accuracy on both static fixation locations and dynamic scan-paths.
The model is suitable for target searching in strong-context cases.However, it performs less effectively in weak-context cases.Thus as future work we hope to propose to use a bottom-up saliency map together with a top-down target template to assist context based object searching in weak context cases, in order to achieve good prediction accuracy on both static fixation locations and dynamic scan-paths in weak-context cases.The current simulation is based on the model with the optimal features and parameters tuned from the real face data.How much do the variation of features and parameters affect the simulation is a valuable question to be investigated?Evaluating the model's performance on the pictures of people's face rather than real face is also an interesting question.These are what we will study in the future work.

Figure 1 .
Figure 1.Illustration of learning and testing algorithm for target search.(a) Five visual fields centered at a gaze point (here is the left eye center); (b) Five visual field images (16 × 16 pixels, scales = 5, 4, 3, 2 and 1) sub-sampled from the original image (320 × 214 pixels) with intervals = 16, 8, 4, 2, 1 pixel(s); (c) The spatial relationship between one given starting gaze point and the target center; (d) Memorizing the visual context or predicting between the target center from current gaze points at different scales.

Figure 2 .
Figure 2. Sketch map of pre-given starting points in the face picture.

Figure 3 .
Figure 3. Procedure of the task.

Figure 4 .
Figure 4.The left column describes fixations predicted by the model proposed in [23]; the right column describes the real Fixations recorded by the SMI iVIEW X Hi-Speed eye tracker (Note: Here example face images are processed with mosaics).

Table 1 .
The average of the Hausdorff distances between the model to each one of 27 subjects and that between each pair of subjects.

Table 2 .
The average of the Hausdorff distances.

Table 3 .
Search precision from four different quadrants to left eye and right eye.