Attention-Guided Organized Perception and Learning of Object Categories Based on Probabilistic Latent Variable Models

This paper proposes a probabilistic model of object category learning in conjunction with attention-guided organized perception. This model consists of a model of attention-guided organized perception of object segments on Markov random fields and a model of learning object categories based on a probabilistic latent component analysis. In attentionguided organized perception, concurrent figure-ground segmentation is performed on dynamically-formed Markov random fields around salient preattentive points and co-occurring segments are grouped in the neighborhood of selective attended segments. In object category learning, a set of classes of each object category is obtained based on the probabilistic latent component analysis with the variable number of classes from bags of features of segments extracted from images which contain the categorical objects in context and an object category is represented by a composite of object classes. Through experiments using two image data sets, it is shown that the model learns a probabilistic structure of intra-categorical composition and inter-categorical difference of object categories and achieves high performance in object category recognition.


Introduction
Human visual processing is guided through attention which circumscribes regions for high-level processing such as learning and recognition.An attention process can be divided into two stages of a preattentive process and a focal attentional process [1].In the preattentive process, local saliency is detected in parallel over the entire visual field.In the focal attentional process, they are successively integrated and attention works in two distinct and complementary modes of a space-based mode and an object-based mode [2], in which the former selects locations where finer segmentation is promoted and the latter selects organized segments of objects through figure-ground segmentation and perceptual organization, and they operates in concert to influence the allocation of attention.Organized percept of segments tends to attract attention automatically [3].Thus attention and organized perception can affect the high-level processing of learning and recognition.
The problem to be addressed in this paper is learning and recognition of object categories through attentionguided organized perception.In this problem, a set of scene images each of which is labeled with one of plural objects in a scene is provided for learning and a scene image which contains a labeled object is provided for recognition.Here a labeled object in a scene is considered to be in the foreground through attention and other co-occurring objects are in the background.An image set which contains the same categorical object in the foreground is used for learning about the object category.This paper proposes a probabilistic model of attentionguided organized perception and learning of object categories which consists of the following two sub-models: one is a model of attention-guided organized perception of segments on Markov random fields (MRFs) [4] and the other is a model of learning object categories based on a probabilistic latent component analysis (PLCA) [5,6].In attention-guided organized perception of segments, concurrent figure-ground segmentation is performed on the dynamically-formed MRFs around salient points and co-occurring segments are grouped in the neighborhood Attention-Guided Organized Perception and Learning of Object Categories Based on Probabilistic Latent Variable Models 124 of selective attended segments.In learning object categories, a set of object classes which composes each object category is obtained based on the PLCA with the variable number of classes (V-PLCA) from bags of features (BoFs) [7] of segments extracted from images in the object category.Here a BoF of a segment is calculated by using a code book which is a set of key features generated by clustering SIFT features [8] of salient points of all the segments extracted from a set of all the scene images.The V-PLCA learns a probabilistic structure of object classes in each object category where an object class represents an appearance of the categorical object or another co-occurring categorical object and a composite of object classes represents an object category.
As for related work, there have been proposed a lot of computational models of visual attention, in which a saliency map model [9] is well-known and have a great influence on later studies [10][11][12][13][14]. Image segmentation methods based on MRF models, which date back to Geman's work [15], are also widely studied and there has been proposed an attention-based segmentation method using MRF [16].There has also been proposed a salient object detection method using a conditional random field [17].Our model of attention-guided organized perception is unique as it links spatial preattention and object-based attention through figure-ground segmentation on dynamically-formed MRFs and groups segments in the neighborhood of selective attended segments.There have been proposed several methods which apply probabilistic latent semantic analysis to learning object or scene categories [18][19][20] and incorporate attention into object recognition [21].It is known that context improves category recognition of ambiguous objects in a scene [22] and there have been proposed several methods which incurporate context into object categorization [23][24][25][26][27][28].The difference of our learning method from those existing ones is that it uses attended co-occurring segments for learning and it learns a probabilistic structure of each categorical object and its context which make it possible to recognize objects in context.This paper is organized as follows.Section 2 presents a model of attention-guided organized perception.Section 3 describes a probabilistic learning model of object categories.Experimental results are shown in Section 4 in which the Caltech-256 image data set is used for evaluating learning through attention-guided organized perception and the MSRC labeled image data set v2 is used for evaluating recognition through categorical object learning.We discuss our results in Section 5 and conclude our work in Section 6.

Attention-Guided Organized Perception
The model of attention-guided organized perception consists of a saliency map for preattention, a collection of dynamically-formed MRFs for figure-ground segmentation, a visual working memory for maintaining segments and perceptually organizing them around selective attention, and an attention system on a saliency map and a visual working memory.Figure 1 depicts the organization and the computational steps of the model, which are explained in the following subsections.

Saliency Map
A saliency map is in general computed by integrating several visual features such as contrast, orientation, motion and so forth.A saliency map in this paper is a simplified model of a multi-level saliency map which is proposed in [12].As features of an image, brightness, hue and their contrast are obtained on a Gaussian resolution pyramid of the image.Brightness contrast and hue contrast are respectively computed by convolving brightness and hue with a LoG (Laplacian of a Gaussian) kernel.However, since a hue value represents a color category by an angle in   0, 2π on a continuous color spectrum circle, hue contrast is obtained by performing con- volution for hue difference of each point with its neighboring points.A saliency map is obtained by calculating saliency from brightness contrast and hue contrast on each level of a Gaussian resolution pyramid [12] and combining the multi-level saliency into one map by taking a sum of them.

Segmentation through Preattention
Figure-ground segmentation is performed by figureground labeling on dynamically-formed MRFs of brightness and hue around preattentive points.In the first step (Figure 1), plural preattentive points are stochastically selected from a saliency map according to their degrees of saliency.In the second step (Figure 1   be segment labels on W.Then, for a given observed feature w W , the problem of estimating segment labels is solved by using the EM algorithm with the mean field approximation [29].The mean field local energy function using mean field approximation is defined by and where V is potential of a pair-site clique, B w is the 8neighborhood system,  is an interaction coefficient which is preset in this study, w l  is an expectation of a segment label in the neighborhood, t is the EM iteration number and is a parameter set that determines distributions of is means and variances of multivariate Gaussian distributions of figure and ground features.Then, a posterior probability of a segment label is given by where mf w H  is the partition function and an expectation of a segment label is obtained as In the E-step, for each point in a domain of a MRF, an expectation of the segment label w w l z is repeatedly calculated until all the expectations of segment labels converge.Usually, only a few iterations are required to converge.A segment label is estimated as "1" if 0 and "−1" otherwise.In the M-step, means and variances of multivariate Gaussian distributions for figure and ground features are updated by using results of the E-step.
The mergence of segments is performed if they spatially overlap and the Mahalanobis generalized distance for brightness and hue between them is not greater than a certain threshold.Let 1 s and 2 s be a pair of segments.Then the Mahalanobis generalized distance , bh D s s for brightness and hue between 1 s and 2 s is defined by where, for  

Organized Perception through Object-Based Attention
Figure segments are maintained in a visual working memory and organized perception is performed around selective attended segments through object-based attention.In the third step (Figure 1), for each extracted figure segment, the attention degree of the segment is calculated from its saliency, closedness and attention bias for object-based attention.Saliency of a segment is defined by both the degree to which a surface of the segment stands out against its surrounding region and the degree to which a spot in the segment stands out by itself.The former is called the degree of surface attention and the latter is called the degree of spot attention.The degree of surface attention is defined by the distance between mean features (brightness and hue) of a figure segment and its surrounding ground segment.The degree of spot attention is defined by the maximum value of saliency of each point in a segment.Closedness of a segment is judged whether it is closed in an image, that is, whether or not it extends outside the bounds of an image.A segment is defined as closed if it does not intersect with the border of an image at more than a specified number of points.Attention bias represents a priori or experientially acquired attentional tendency to a region with a particular feature such as a face-like region.In experiments in Section 4, a segment is judged as a face by simply using its hue and aspect ratio.Then, the attention degree   A s of a segment s is defined by is the decrease rate of attention when the segment isn't closed.
In the fourth step (Figure 1), from these segments, the specified number of segments whose attention degree are larger than others are selected as selective attended segments.In the fifth step (Figure 1), each selective attended segment and its neighboring segments are grouped as a co-occurring segment.If two sets of co-occurring segments overlap, they are combined into one co-occurring segment.This makes it possible to group part segments of an object or group salient contextual segments with an object.

Probabilistic Learning of Object Categories
The problem to be modeled is learning a probabilistic structure of object classes from object segments in each object category, where an object class statistically represents an appearance feature of the categorical object or a co-occurring categorical object in context.In this problem, for each object category, a set of object segments is extracted through the attention-guided organized perception from a set of scene images each of which contains the categorical object.Each object segment is represented by a BoF and the proposed V-PLCA is applied to each object category for learning the probabilistic structure from BoFs of object segments in the category.

Object Representation by Bags of Features
Let C be a set of categories and N C be the number of categories.A category is a set of images each of which contains an object of the category in the foreground and other categorical objects in the background.is represented by a BoF of local feature of its salient points.In order to calculate a BoF, first of all, any points in a segment whose saliency are above a given threshold are extracted as salient points at each level of a multi-level saliency map.As a local feature, a 128-dimensional SIFT feature is calculated for each salient point at its resolution level.Next, all the SIFT features for all the segments of all the images are clustered by the K-tree method [30] to obtain a set of key features as a code book.Let F be a set of key features as a code book, n f be a n-th key feature of F, N F be the number of key features.Then a BoF is calculated for SIFT features of its salient points by using this code book.

Learning about Object Categories
The V-PLCA computes a probabilistic structure of classes is the number of classes in c .Here the problem to be solved is estimating probabilities , , , , and the number of classes that maximize the following log-likelihood for a set of BoFs The class probability represents the composition ratio of object classes in an object category, the conditional probability of segments represents the degree to which object segments are instances of an object class and the conditional probability distribution of key features represents the feature of an object class.
When the number of classes is given, these probabilities are estimated by the tempered EM algorithm in which the following E-step and M-  p q p s q p f q p q s f p q p s q p f q , where  is a temperature coefficient.
The number of classes is determined through an EM iterative process with subsequent class division.The process starts with one or a few classes, pauses at every certain number of EM iterations less than an upper limit and calculates the following index, which is called the dispersion index, , , is set by specifying its conditional probability distribution of key features, conditional probabilities of segments and a class probability as     .As a result of subsequent class division, classes can be represented in a binary tree form.

p q p q 
The temperature coefficient  is set to 1.0 until the number of classes is fixed and after that it is gradually decreased according to a given schedule of the tempered EM until convergence.
The feature of an object category is represented by composing conditional probability distributions of key features of classes in the category.A composite probability distribution of key features for an object category c is obtained for a set of classes

Experiments
Two experiments were conducted to evaluate attentionguided organized perception and learning of object categories.The first experiment evaluates learning through attention-guided organized perception by using the Caltech-256 image data set [31] and the second experiment evaluates recognition through learning about object categories by using the MSRC labeled image data set v21 .

Experiment of Learning through Attention-Guided Organized Perception
The Caltech-256 image data set was used for evaluating learning through attention-guided organized perception.
For each of 20 categories, 4 images, each of which contains the categorical object and other categorical objects in context, were selected and used for experiments.Fig- ure 2 shows some categorical images.
 (14) Main parameters were set as follows.The number of levels of a Gaussian resolution pyramid was 5.As for attention-guided organized perception, an interaction coefficient  was 1.5, a threshold for segment mergence  ("bear", "butterfly", "chimp", "dog", "elk", "frog", "giraffe", "goldfish", "grasshopper", "helicopter", "hibiscus",  "horse", "hummingbird", "ipod", "iris", "palm-tree",  "people", "school-bus", "skyscraper" and "telephone-box") were used in experiments.was 1.0, weighting coefficients and a decrease rate for the attention degree of segments in the expression ( 6) were and 0.5, 0.5, 1.0 respectively, and the upper bound number of selective attention was 4. As for learning, a threshold for salient points was 0.1, a threshold of class division was 0.07 and a correction coefficient  in the expression ( 14) was 2.0.In the tempered EM, a temperature coefficient  was decreased by multiplying it by 0.95 at every 20 iterations until it became 0.8.
Learning was performed for a set of co-occurring segments extracted from images of each category through the attention-guided organized perception.The number of salient points, that is, 128-dimensional SIFT features which were extracted from all these segments was 76019.The code book size of key features which were obtained by the K-tree method was 438.The BoFs were calculated for 181 segments whose numbers of salient points were more than 100.
Figure 3 shows co-occurring segments and their labels for some categorical images which were extracted by the attention-guided organized perception.There were observed three types of co-occurring segments.The first type of co-occurring segments represents organized perception in which an object consists of one segment and it is grouped with its contextual segments.Examples of "telephone-box" and "hibiscus" in Figure 3 show organized perception of this type.The second type of co-occurring segments represents organized perception in which each co-occurring segment is a part of an object and the object consists of those segments.Examples of "people" and "school-bus" in Figure 3 show organized perception of this type.The third type of co-occurring segments represents organized perception in which an object consists of plural segments and it is also grouped with its contextual segments.Examples of "chimp" and "butterfly" in Figure 3 show organized perception of this type.
Figure 4 shows some results of the V-PLCA, that is, object classes for some object categories in a binary tree form.In Figure 4, a typical segment of a class r of each

Experiment of Recognition through Learning about Object Categories
A composite probability distribution of key features for an object category is a weighted sum of conditional probability distributions of key features for its object classes with their class probabilities.Figure 5 shows composite probability distributions of key features for all categories and Figure 6 shows distance between each pair of them which is defined by the following expression The MSRC labeled image data set v2 was used for evaluating recognition through learning about object categories.This data set contains 23 object categories and each image has a pixel level ground truth in which each pixel is labeled as one of 23 object categories or "void".Most images are associated with more than one object category.A collection of 14 sets of images each set of which contained about 30 images and each image in it had the same categorical object that was considered to be in the foreground and other categorical objects in the background were arranged from this data set.This made 14 object categories and an image in each object category contained an object with the category label and other cooccurring objects with other labels in 23 category labels.The total number of images was 420. Figure 7 shows some categorical images and their object segments with labels.In this experiment, labeled co-occurring object for any different categories   categories ("tree", "building", "airplane", "cow", "person", "car", "bicycle", "sheep", "sign", "bird", "chair", "cat", "dog", boat") were used in experiments.Here a face and a body were interpreted as a person." segments are supposed to be extracted from an image by attention-guided organized perception and used for learning and recognition.Images in object categories were split into two parts for 2-fold cross validation.In order to represent features of segments, 128-dimensional SIFT features of keypoints in all the segments were clustered by the K-tree method to generate a set of key features as a code book and a BoF of each segment was calculated for its 128-dimensional SIFT features at keypoints by using this code book.The code book sizes of key features were 412 and 438 for two learning sets respectively.
where c  is a recognized object category and is a BoF for an input categorical image i.Table 1 shows the average classification accuracy of two image subsets for four different settings of recognition.In rows of Table 1, a BoF for co-occurring segments is calculated for a region in a categorical image which consists of the categorical segment and its co-occurring segments.On the other hand, a BoF for an entire image is calculated for the entire region of a categorical image.In columns, training samples and test samples refer to image subsets that are used and not used for learning in a 2-fold cross validation respectively.Since object category learning is performed for co-occurring segments of training sample images, recognition using the entire region of training sample images is not the same with recognition using the same features with learning.It uses features not only in co-occurring segments but also in the rest of them for training sample images.As a result, classification accuracy in case of using co-occurring segments of test sample images was higher than that of using the entire region of training sample images and obviously classification accuracy in case of using co-occurring segments of training sample Main learning parameters were set as follows.A threshold of class division was 0.046 and a correction coefficient α in the expression ( 14) was 2.0.In the tempered EM, a temperature coefficient  was decreased by multiplying it by 0.95 at every 20 iterations until it became 0.8.
Figure 8 shows some results of the V-PLCA, that is, object classes for some object categories in a binary tree form.In Figure 8, a typical segment of a class r of each category c is a segment , j c i s that maximizes   , , j c r c i p q s .The mean number of classes per a category for 14 categories was 7.21.Figure 9 shows distance between each pair of composite probability distributions of key features for all categories which is defined by the expression (18).The mean distance of all pairs of categories was 0.35.
Recognition is performed by computing an object category which gives the minimum distance between composite probability distributions of key features of object categories, which are calculated by the expression (17), and a BoF for an input categorical image according to the following expression    images was the highest of the four settings for recognition.Thus, it was confirmed that extraction of co-occurring segments from images was effective for recognition through learning by our method.

Discussion
The proposed attention-guided organized perception selects an object segment with its contextual segments based on their saliency and the proposed V-PLCA learns a probabilistic structure of appearance features of categorical objects in context from those segments for object category recognition.The distinguished characteristic of the attention-guided organized perception is that spatial preattention is integrated into object-based selective attention for organized perception through segmentation on dynamically-formed MRFs.In the V-PLCA, the number of object classes in object categories is not necessary to be fixed in advance and is determined dependent on learning samples.This characteristic makes it easy to adapt to various features and data sets for learning without tuning size parameters of the method.
In experiments of learning through attention-guided organized perception using the Caltech-256 image data set and learning from co-occurring segments using the MSRC labeled image data set v2, it was confirmed that the probabilistic structure of appearance features of objects with context distinctively characterized object categories.It was also confirmed that extraction of co-ocsegments was effective for recognition by showing that classification accuracy was higher when using features of co-occurring segments than when using features of entire images through experiments using the MSRC labeled image data set v2.By the way, recognition performance depends on not only learning and recognition methods but also feature coding and pooling methods and learning data sets [32].The performance of our method is relatively high in comparison with existing methods which used SIFT-based features and the MSRC data set [25,26].These results demonstrate that our categorical object learning achieves high recognition performance by using co-occurring segments extracted through attention-guided organized perception.

Conclusion
We have proposed a probabilistic model of learning object categories through attention-guided organized perception.In this model, a probabilistic structure of object categories is learned and used for recognition based on the probabilistic latent component analysis with the variable number of classes, which uses co-occurring segments extracted through the attention-guided organized perception on dynamically-formed Markov random fields.
Through experiments using images of plural categories in the Caltech-256 image data set and the MSRC labeled image data set v2, it was demonstrated that, by the attention-guided organized perception, our method extracted a set of co-occurring segments which consisted of objects and their context and that, from those co-occurring segments, our method learned a probabilistic structure which represented intra-categorical composition of objects and distinguished inter-categorical difference of objects.It was also confirmed that our method achieved high recognition performance of object categories.
bias, and s p b s p b are weighting coefficients for them respectively.The function  , , be a j-th segment extracted from an image i of a category c, be a set of segments extracted from any images of a category c, and c S be the number of segments in c .An object segment , a class whose dispersion index takes the maximum value among all classes is divided into two classes.This iterative process is continued until , for all classes become less than a certain threshold.The class is divided into two classes as follows.Let be a source class to be divided and let

Figure 3 .
Figure 3. Examples of (a) images, (b) co-occurring segments and (c) labels for some categories.Different labels are illustrated by different colors.

Figure 4 .
Figure 4. Object classes for some object categories in a binary tree form.A colored square shows that it is an object class of a given category and a white square shows that it is a co-occurring categorical object class in context.A value in a parenthesis represents a class probability and a typical segment of each class is depicted beside the class.A representative co-occurring segment of each category is also depicted above a tree.

1 c and 2 cFigure 5 .
Figure 5. Probability distributions of key features for all object categories.

Figure 6 .
Figure 6.Distance between probability distributions of key features for pairs of object categories.

Figure 9 .
Figure 9. Distance between probability distributions of key features of object categories for two learning sets.

Figure 8 .
Object classes for some object categories in a binary tree form.A colored square shows that it is an object class of a given category and a white square shows that it is a co-occurring categorical object class in context.A value in a parenthesis represents a class probability and a typical segment of each class is depicted beside the class.A representative co-occurring segment of each category is also depicted above a tree.
Step are iterated until convergence