Unsupervised Multi-Level Non-Negative Matrix Factorization Model: Binary Data Case

Rank determination issue is one of the most significant issues in non-negative matrix factorization (NMF) research. However, rank determination problem has not received so much emphasis as sparseness regularization problem. Usually, the rank of base matrix needs to be assumed. In this paper, we propose an unsupervised multi-level non-negative matrix factorization model to extract the hidden data structure and seek the rank of base matrix. From machine learning point of view, the learning result depends on its prior knowledge. In our unsupervised multi-level model, we construct a three-level data structure for non-negative matrix factorization algorithm. Such a construction could apply more prior knowledge to the algorithm and obtain a better approximation of real data structure. The final bases selection is achieved through L 2 -norm optimization. We implement our experiment via binary datasets. The results demonstrate that our approach is able to retrieve the hidden structure of data, thus determine the correct rank of base matrix.


Introduction
Non-negative matrix factorization (NMF) was proposed by Lee and Seung [1] in 1999.NMF has become a widely used technique over the past decade in machine learning and data mining fields.The most significant properties of NMF are non-negative, intuitive and part based representative.The specific applications of NMF algorithm include image recognition [2], audio and acoustic signal processing [3], semantic analysis and content surveillance [4].In NMF, given a non-negative dataset , the objective is to find two non-negative factor matrices . Here W is called base matrix and H is named feature matrix.In addition, W and H satisfy (1) K is the rank of base matrix and it satisfies the inequality K MN M N   .For NMF research, the cost function and initialization problems of NMF are the main issues for researchers.Now the rank determination problem becomes popular.The rank of base matrix is indeed an important parameter to evaluate the accuracy of structure extraction.On the one hand, it reflects the real feature and property of data; on the other hand, more accurate learning could help us get better understanding and analyzing of data, thus im-proving the performance in applications: recognition [5,6] surveillance and tracking.The main challenge of rank determination problem is that it is pre-defined.Therefore, it is hard to know the correct rank of base matrix before the updating process of components.As the same as the cost function, there are no more priors added to the algorithm in previous methods.That is why the canonical NMF method and traditional probabilistic methods (ML, MAP) cannot handle the rank determination problem.Therefore in this paper, we propose an unsupervised multi-level model to automatically seek the correct rank of base matrix.Furthermore, we use L 2 -norm to show the contribution of hyper-prior in correct bases learning procedure.Experimental results on two binary datasets demonstrate that our method is efficient and robust.
The rest of this paper is organized as follows: Section 2 provides a brief review of related works.In Section 3, we describe our unsupervised multi-level NMF model in details.The experimental results of two binary datasets are shown in Section 4. Section 5 concludes the paper.needs to pass through all the possible values of rank of base matrix to choose the best one.Obviously, this method is not impressive enough for unsupervised learning.In [8], the author proposed a rank determination method based on automatic relevance determination.In this method, a parameter is defined relevant to the columns of W. Then using EM algorithm to find a subset, however, this subset of bases is not accurate to represent true bases.Actually, the nature of this hyper-parameter is to affect the updating procedure of base matrix and feature matrix, thus affect the components' distributions.
The only feasible solution is fully Bayesian models.Such kind of methods have been proposed in [9].In this paper, the author addresses an EM based fully Bayesian algorithm to discover the rank of base matrix.EM based methods are an approximation solution.In comparison, a little more accurate solution is Gibbs sampling based methods.Such approach is utilized to find the correct rank in [10].Although such kinds of methods are flexible, it requires successively calculation of the marginal likelihood for each possible value of each rank K.The drawback is too much computation cost involved.Additionally, when such methods are applied to real time application or some large scale dataset based applications, the high computation load is impractical.Motivated by the current condition, we propose a low computation, robust multi-level model for NMF to solve rank determination problem.Our unsupervised model with multilever priors only calculate once of the rank of base matrix and is able to successfully find the correct rank of base matrix given a large enough rank K. Therefore, our method involves less computation.This will be discussed in details in next section.

Unsupervised Multi-Level Non-Negative Matrix Factorization Model
In the solutions through optimizing the maximum a posterior criterion.Our approach could be depicted by the following equation, here c  denotes equality up to a constant,  is the prior of both W and H.
The difference between our approach and the traditional MAP criterion is that in traditional one there is no hyper-prior added to the model.Moreover, in our model we attempt to update the hyper-priors recursively, but not just set it as a constant.

Model Construction
In NMF algorithm, the updating rules are based on the specific data model.Therefore, the first step is to set a data model for our problem.Here, in our experiment we assume that the data follows Poisson distribution.Consequently, the cost function of our model will be generalized KL-divergence.So given a variable x, which follows Poisson distribution with parameter  , we have given dataset V, we have the likelihood The generalized KL-divergence is given by: Thus, the log-likelihood of the dataset V can be rewritten as: From ( 2) and ( 5) we could conclude that maximizing a posterior is equivalent to maximizing the log-likelihood, and maximizing the log-likelihood is equivalent to minimizing the KL-divergence.Thus, maximizing a posterior is equivalent to minimizing the KL-divergence.Therefore, it is possible to find a base matrix W and a feature matrix H to approximate the dataset V via maximizing a posterior criterion.
Then the log-likelihood of the priors cou te ld be rewritn as:  as a constant, the diversity of

Inference
hment of data model and the deduction After the establis of log-likelihood of each prior, we can gain the maximum a posterior equation: Since the first factor in (12) has nothing to do with the pri  W H ors, and we have discussed the relationship between the posterior probability and KL-divergence, here we minimize the second factor to seek the solutions for this criterion.In our paper, we choose gradient decent updat-ing method as our updating rule.Although multiplicative method is simpler, it has no detailed deduction about why the approach works.On the contrary, gradient decent updating will give us clear deduction about the whole updating procedure.We utilize this method to infer the priors W and H, as well as the hyper-priors  and b.First we find the gradient of the parameters: Then we utilize gradient coefficient to get rid of the subtraction operation during the updating procedure for W and H to guarantee the non-negative constrain.The parameters k  and k b are updated by zeroing.The updating rules listed as follows: are Then we find the correct bases and determine the order of the data model by: where B is defined as , 0 R is the rank of base matrix.

B
In this section, we apply our unsupervised multi-level NMF algorithm on two binary datasets.One is fence dataset, and the other is famous swimmer dataset.Both of the experiment results demonstrate the efficacy of our method on the rank determination issue.

Fence Dataset
We first performed our experiments on fence dataset.
f ra Here I defined the data with four row bars (the size is 1 × 32) and four column bars (the size is 32 × 1).The size of each image is 32 × 32 with zero-value background, and the value of each pixel in eight bars is one.Each image is separated into five parts in both horizontal direction and vertical direction.Additionally, in each image the number of row bars and the number of column bars should be the same.For instance, there are two row bars in a sample image, then there should be two column bars in this image.Hence, the total number of the fence dataset is N = 69.The samples of Fence dataset are shown in Figure 2.
Here, we set the initial rank K = 16 (the initial value o nk K needs to be larger than the value of real rank of base matrix), the hyper-parameter a = 2,   1 0.05 0.05 . Figure 3 shows t learned via our unsupervised multilevel NMF approach, we could see that the data is sparse, especially the base matrix.In both images, the color parts denote the effective bases or features, and the black parts denote irrelevant bases or features there.In addition, from image processing perspective, we can conclude that compared to the values of effective bases and features, the values of irrelevant bases and features are very small, since the color of such pixels are very dark.We could clearly find that there are eight color column vectors in the first image.Additionally, among the eight color vectors, four are composed of several separated color pixels, whereas the other four are composed of assembly pixels.Actually, the former four vectors are row bars, and the latter four vectors are column bars.We resize the dataset in columns during factorization procedure.Hence the row bars and column bars have different structures.Furthermore, there are also eight rows in the second image, which are the corresponding coefficients of the bases.he base matrix ure matrix and feat  Therefore, we could get the conclusion that our algorithm is very powerful and efficient to find the real basic components and the correct rank.
The other dataset we used is the swimmer dataset.Swimmer dataset is a typical dataset for feature extraction.Due to the clearly definition and composition of 16 dynamic parts, it is quite appropriate to the unique characteristic of NMF algorithm, which is to learn part-based data.As we know, however, the swimmer dataset is a gray-level image dataset.In our experiment, we focus on binary dataset, so first we need to convert this gray-level dataset to binary dataset.Then apply our approach to perform inference.In this swimmer dataset, there are 256 images totally, each of which depicts a swimming gesture using one torso and four dynamic limbs.The size of each image is 32 × 32.Each dynamic part could appear at four different positions.resul ages and the co ts for the swimmer dataset.It could be observed that as for this dataset, we also could find out the correct bases via our algorithm.In this figure there are 25 base images.The black ones correspond to irrelevant bases, and the other 17 images depict the torso and the limbs at each possible position.We can see that the correct torso and limbs are discovered successfully.
The differences between the black im rrect base images are shown in Figure 7.

Conclu
We have presented an un sion negative matrix factorization algorithm which is powerful and efficient to seek the correct rank of a data model.This is achieved by introducing a multi-prior structure.
The experiment results on binary datasets adequately demonstrate the efficacy of our algorithm.Compare to the fully Bayesian method, it is simpler and more convenient.The crucial points of this method are how to introduce the hyper-priors and what kind of prior is appropriate to a certain data model.This algorithm also could be extended to other data models and noise models.
Although our experiment is based on binary dataset, this algorithm is suitable to other datasets such as gray-level dataset, colorful dataset, etc.


In data model  p V WH we regard WH as the parameter of data V.With respect to the base matrix W and the feature matrix H, we also introduce a parameter  as a prior to them.Moreover, we define an independent Exponent distribution for each column of W and each row of H with prior k  because exponent distribution has sharper performance.It is no doubt that we can choose other exponential family distributions such as Gaussian distribution, Gamma distribution, etc.Therefore, the columns of W and rows of H yield:

k
and recursively upd ting of k a  enable the inference ocedure to converge at the stationary point.Through calculating the L 2 -morm of each column of base matrix W, we could discover that the data finally emerges to two clusters.One cluster contains the points of which the L 2 -norm are much larger than 0, whereas in the other cluster the L 2 -norm values are 0 or almost 0.In order to find the best value for k pr  , here we introduce hyper-prior for k  .Since k  is he parameter of

Figure 2 .
Figure 2. Sample images of fence dataset.

Figure 3 .Figure 4 .
Figure 3. Base matrix W an ature matrix H learned via

Figure 5 Figure 6
shows some sample images of the swimmer dataset.In this experiment part, the ini e initial values of hyper-parameters are a = 2, shows the experiment

Figure 4 .Figure 5 .
Figure 4.The bases obtained by our algorithm on fence dataset.

Figure 6 .Figure 7 .
Figure 6.The bases of swimmer dataset learned by our algorithm.
supervised multi-level non-[1] D. D. Lee and he Parts of Objects