Generate Faces Using Ladder Variational Autoencoder with Maximum Mean Discrepancy ( MMD )

Generative Models have been shown to be extremely useful in learning features from unlabeled data. In particular, variational autoencoders are capable of modeling highly complex natural distributions such as images, while extracting natural and human-understandable features without labels. In this paper we combine two highly useful classes of models, variational ladder autoencoders, and MMD variational autoencoders, to model face images. In particular, we show that we can disentangle highly meaningful and interpretable features. Furthermore, we are able to perform arithmetic operations on faces and modify faces to add or remove high level features.


Introduction
Generative Models have been highly successful in a wide variety of tasks by generating new observations from an existing probability density function.These models have been highly successful in various tasks such as semi-supervised learning, missing data imputation, and generation of novel data samples.
Variational Autoencoder is a very important class of models in Generative Models [1] [2].These models map a prior on latent variables to conditional distributions on the input space.Training by maximum likelihood is intractable, so a parametric approximate inference distribution is jointly trained, and surprisingly, jointly training the generative model for maximum likelihood, and the inference distribution to approximate the true posterior is tractable, through a "reparameterization trick" [1].These models have been highly successful in It has also been observed that the evidence lower bound (ELBO) used in traditional variational autoencoders suffers from uninformative latent feature problem [4] where these models tend to under-use the latent variables.Multiple methods have been proposed to alleviate this [4] [5].In particular, [5] showed that this problem can be avoided altogether if an MMD loss is used instead of the KL divergence in the original ELBO variational autoencoders.
In this paper we combine these ideas to build a variational ladder autoencoder with MMD loss instead of KL divergence, and utilize this model to analyze of structure and hidden features of human faces.As an application we use this model to perform "arithmetic" operations on faces.For example, we can perform arithmetic operations such as: men with pale skin − men with dark skin + women with dark skin = women with pale skin.The way we do this is by performing arithmetic operations in the feature space, and transform the results back into image space.This can be potentially useful in games and virtual reality where arbitrary features can be added to a face through the above process of analogy.This further demonstrates the effectiveness of our model in learning highly meaningful latent features.

Generative Modeling and Variational Autoencoders
Generative models seek to model a distribution pdata (x) in some input space X.
where KL denotes the Kullback-Leibler divergence.Intuitively this model achieves its goal by first applying an "encoder" ( )

Ladder Variational Autoencoder
Ladder variational autoencoders [3] add additional structure into the latent code by adding multiple layers to the model.The model is shown in Figure 1.High level latent features are connected with the input through a deep network while low level features are connected through a shallow network.The intuition is that complicated features require deeper networks to model, so that high level latent variables will be used to model the high-level features, and vice versa.This makes it possible to disentangle simple and sophisticated features.

MMD Regularization
It has been observed that the ( ) ( ) ( ) q z p z Eq z q z k z z Ep z p z k z z Ep z q z k z z where k(z, z 0 ) is a kernel function such as Gaussian.( ) Intuitively k(z, z 0 ) measures the distance between z and z 0 , and Ep(z), q(z 0 ) [k(z, z 0 )] measures the average distance between samples from distributions p(z) and q(z 0 ).If two distributions are identical, then the average distance between samples from p, samples from q, and samples from p, q respectively, should all be identical, so MMD distance should be zero.This can be used to replace KL q z x p z ϕ in ELBO VAE to achieve better properties.

MMD Variational Ladder Autoencoder
We apply MMD regularization to Variational Ladder Autoencoders.In particular, we regularize all the latent features respectively This combines the advantage of both models and learns meaningful hierarchical features.

Experiments
To verify the effective of our method we performed experiments on MNIST and CelebA [6].We visualize the manifold learned for each dataset, and observe extremely rich disentangled features.
Samples from MNIST are shown in Figure 2. We are able to disentangle visual features such as digit width, inclination, digit identity, etc.For example, bottom layer represents style of the stroke, such as the width.Middle layer represents inclination while top layers mostly represent digit identity.
Samples from CelebA are shown in Figure 3.We are able to disentangle features such as lighting, hair style, face identity and pose.

Arithmetic Operations on Faces
We observed that by adding or subtracting values from latent code, we can modify   We observed convincing results from these experiments (as shown in Figure 4).The final result of fourth column has shown various arithmetical properties.
For example, faces of colors and brightnesses on all images are explicitly represented by the arithmetic result: the forth images share similar colors and brightnesses with the first and the third images, while these properties differ from the second images.Moreover, more complicated features are also learned and applied, the most specific one being the facial expression.
generated results with the original data x using the cost function ( ) log p z x θ .

Figure 1 .
Figure 1.Structure of VLAE (variational ladder autoencoder).Here circles are stochastic variables and diamonds are deterministic variables.

Figure 2 .
Figure 2. Training results over MNIST after 1 hour on a GTX1080Ti.Each plot is obtained by sampling one layer uniformly in the [−3, 3] range, and other layers randomly.Left: Represents stroke style and width.Middle: Represents digit inclination.Right: Represents digit identity.

Figure 3 .
Figure 3. Training results over CelebA after roughly 8 hours on a GTX1080Ti.Left: Represents lighting and white balance.Middle Left: Represents hair color, face color, and minor variations of facial feature.Middle Right: Represents face identity.Right: Represents pose and expression of the face.

Figure 4 .
Figure 4. Faces of the fourth column are acquired by subtracting the second column from first column, then by adding the third column to the first column.
modeling complex natural distributions such as natural images.In addition it has been observed that these models can make use of the latent space in a meaningful manner.For example, it can learn to map different regions of the latent variable space into different object classes.