Research on Image Generation and Style Transfer Algorithm Based on Deep Learning

Aiming at the current process of artistic creation and animation creation, there are a lot of repeated manual operations in the process of conversion from sketch to the stylized image. This paper presented a solution based on a deep learning framework to realize image generation and style transfer. The method first used the conditional generation to resist the network, optimizes the loss function of the training mapping relationship, and generated the actual image from the input sketch. Then, by defining and optimizing the perceptual loss function of the style transfer model, the style features are extracted from the image, thereby forming the actual The conversion between images and stylized art images. Experiments show that this method can greatly reduce the work of coloring and converting with different artistic effects, and achieve the purpose of transforming simple stick figures into actual object images.

learning process is automated, it still requires a lot of manpower to design its tags. In contrast, generating anti-network GANs, using the generation model and the discriminant model, while minimizing loss, can then use the loss function to generate a new picture.
Style transfer is the process of migrating from one reference style to another to generate another image. The feedforward image conversion task has been widely used. Many conversion tasks use the pixel-by-pixel differential method to train the deep convolutional neural network, which spans the pixel-by-pixel difference [2], by putting the CRF as an RNN, train with other parts of the network. The structure of our conversion network was inspired by [3] and [4], using down-sampling in the network to reduce the spatial extent of the feature map, followed by up-sampling in a network to produce the final output image. Some methods change the pixel-by-pixel difference to a penalty image gradient or use the CRF loss layer to force the output image to be consistent. A feedforward model in [5] is trained with a loss function of pixel-by-pixel difference for coloring grayscale images. There are a number of papers that use optimized methods to produce images, their objects are perceptual, and perceptuality depends on the high-level features extracted from CNN. Mahendran and Vedaldi reversed features from convolutional networks, reconstructing loss functions by minimizing features, in order to understand image information stored in different network layers; similar methods were also used to invert local binary descriptors [6] and HOG features [7]. The work of Dosovitskiy and Brox is most relevant to us. They train a feedforward neural network to invert the convolution feature and quickly approximate the outcome of the proposed optimization problem. However, their feedforward network uses pixel by pixel. Reconstruct the loss function to train, and our network directly uses the feature reconstruction loss function used in [8]. Gatys et al. show artistic style conversion [9] [10], combining a content map and another style map. By minimizing the cost function reconstructed according to features, the cost function for style reconstruction is also based on the advanced from the pre-training model. Features; a similar method was previously used for texture synthesis. Their approach yields a high-quality record, but the computational cost is very expensive because each iteration of the optimization requires a feedforward, feedback-pre-trained network. In order to overcome the burden of such a computational load, this paper trains a feedforward neural network to quickly obtain a feasible solution.
Our network consists of two parts: a picture conversion network f w and a loss network φ, where the picture conversion network is a deep residual network [11], the parameter is the weight W, which converts the input picture x by mapping y = f w (x). To output the picture y, each loss function calculates a scalar value l i (y, y i ), which measures the difference between the output y and the target image y i . The picture conversion network is trained with SGD so that the weighted sum of a series of loss functions remains degraded. This paper implements the task of generating stylized art images from sketches. First, use conditional generation to combat the network [12], optimize the loss function of the R. K. Wang Open Journal of Applied Sciences training mapping relationship to generate the actual image from the input sketch. This paper trains a feedforward network for image conversion tasks, and does not use pixel-by-pixel difference to construct the loss function, and instead uses the perceptual loss function to extract advanced features from the pre-trained network. In the process of training, the perceptual loss function is more suitable than the pixel-by-pixel loss function to measure the degree of similarity between images. After training, the effect of sub-network image translation achieves the expected effect, and because of the characteristics of the anti-network, we no longer need to manually design the mapping function like the ordinary CNN network. Experiments have shown that reasonable results can be achieved even without manually setting the loss function.

Structure-Generated Image Modeling Structure Loss
The structure loss image conversion problem of image generation image modeling is usually expressed as the classification or regression problem of each pixel [13], and the output space is regarded as "unstructured", and each pixel of the output is regarded as independent of all other pixels of the input image as appropriate. Instead, conditional GANs learn the structured loss. Structured loss penalizes the node construction of the output. Most types of literature consider this type of loss, such as conditional random fields [14], SSIM metrics [15], feature matching [16], nonparametric loss [17], convolutional pseudo-prior [18], and loss based on matching covariance statistics [19]. Our conditional GAN differs from these learned losses and can theoretically penalize any possible structure different from the output and target.

Condition GANs
This paper is not the first to apply GANs to conditional settings. There have been previous works to constrain GANs with discrete tags [20], text, and the like. Image-based GANs have solved image restoration [21], predicting images from normal maps [22], editing images based on user constraints, video predictions, state predictions, and generating merchandise and style transitions from photos [23] [24]. These methods have all changed based on specific applications, and our methods are simpler than most of them.
Our approach to the choice of several structures in the generator and discriminator is also different from the previous work. Unlike the previous one, our generator used the "U-Net" structure [25], and the discriminator used the convolution "PatchGAN" classifier. Previously, a similar PatchGAN structure was proposed to capture local style statistics.

Image Generation Open Journal of Applied Sciences
to output image yy: G: z → y G: z → y. Conversely, the conditional GANs learn the mapping of the observed image xx and the random noise vector zz to yy. The formula is: The training generator GG generates an image in which the discriminator D cannot discriminate, and the training discriminator DD detects the "falsified" image of the generator as much as possible.

Image Generated Objective Function
The objective function of the condition GAN is calculated as: GG wants to minimize the value of this function, DD wants to maximize the value of this function, that is, in order to test the importance of the condition to the discriminator, we compare the variant form without the discriminator without xx input, condition GAN previous method found Using the traditional loss is beneficial to the hybrid GAN target equation: the work of the discriminator remains the same, but the generator not only deceives the discriminator, but also generates real images as much as possible. Based on this consideration, the L 1 distance is used instead of the L 2 distance. Because L 1 encourages less blur, the formula is: The final target formula is:

Network Structure
This paper uses the structure of the generator and discriminator in [9], both of which use the convolution unit form of "conv-BatchNorm-ReLu". The appendix provides details of the network structure. Below we only discuss the main features. Construct a generator with jumpers One feature of the image conversion problem is the mapping of high resolution input meshes to a high resolution output mesh. In addition, for the problem we are considering, the input and output are different in appearance, but they are consistent with the underlying structure. Therefore, the structure of the input can be roughly aligned with the structure of the output. We design the generator structure based on these considerations. We mimicked "U-Net" to add jumper connections. In particular, we add jumpers between each of the ii and n-in-i layers, where nn is the total number of layers in the network. Each jumper simply connects the feature channels of the ii layer and the n-in-i layer.
The discriminator for constructing the Markov process (PatchGAN) It is well known that L 1 and L 2 loss have ambiguities in image generation problems. The discriminator structure we designed only penalizes the structure of the patch size. The discriminator classifies each N × NN × N as true or false.

R. K. Wang Open Journal of Applied Sciences
We run this discriminator (sliding window) on the entire image and finally take the average as the final output of DD. Such a discriminator models the image as a Markov random field, assuming that the pixels segmented by the patch diameter are directly independent of each other. This finding has been studied and is a commonly used hypothesis in texture and style models. Our PatchGAN can therefore be understood as a form of texture/style loss. Optimization and reasoning To optimize the network, we use the standard method: alternate training DD and GG. We use minibatch SGD and apply the Adam optimizer. In the reasoning, we run the generator in the same way as the training phase.

Style Transfer
The system consists of two parts: a picture conversion network f w and a loss network φ (used to define a series of loss functions [l 1 , l 2 , l 3 ]). The picture conversion network is a deep residual network, and the parameters are weights W. It converts the input image x into the output image y by mapping y = f w (x), and each loss function calculates a scalar value l i (y, y i ), which measures the difference between the output y and the target image y i . The picture conversion network is trained by SGD, and the effect diagram is shown in Figure 1.
The purpose is to calculate the weighted sum of a series of loss functions by operation, and the formula is: We used a pre-trained network φ for image classification to define our loss function. We then train our deep convolutional transformation network using a  The loss network φ is able to define a feature (content) loss l feat and a style loss l style , respectively measuring the difference in content and style. For each input image x we have a content target y c a style target y s , for style conversion, the content target y c is the input image x, the output image y, the style Y s should be combined to the content x = y c . We train a network for each target style.

Construction of Image Conversion Network
Instead of any pooling layer, we use a convolution or micro-step convolution instead. Our neural network consists of five residual blocks. All non-residual convolutional layers follow a spatial batch-normalization, and the nonlinear layer of the RELU, with the exception of the last output layer. The last layer uses a scaled is an upsampling factor f, and we use several residual blocks followed by the Log2f volume and the network (stride = 1/2). This process is different from [1].
Double-cubic interpolation is used to upsample this low-resolution input before putting the input into the network. Without relying on any fixed upsampling interpolation function, the microstep convolution allows the upsampling function to be trained along with the rest of the network. For image conversion, our network uses two contension = 2 convolutions to downsample the input, followed by several residual blocks, followed by two convolution layers (stride = 1/2) upsampling.

Perceptual Loss Function
We define two perceptual loss functions to measure the high level of perceptual from the target y), so we also want to punish style deviations: color, texture, common patterns, and so on. In order to achieve such an effect, Gatys et al. proposed a loss function for the following style reconstruction. Let φ j (x) represent the jth layer of the network φ, and the input is x. The shape of the feature map is C j × H j × W j , and the definition matrix G j (x) is C j × C j matrix (characteristic matrix). The elements are derived from the following formula: If we understand φ j (x) as a feature of the C j dimension, and the size of each feature is H j × W j , then the left G j (x) is proportional to the non-central covariance of the C j dimension. Each grid location can be used as a separate sample.
This can therefore capture which feature can drive other information. The gradient matrix can be calculated in a very funny time by adjusting the shape of φ j (x) to a matrix ψ, the shape is C j × H j W j , and then G j (x) is ψψT/C j H j W j . The loss of style reconstruction is well defined, even when the output and target have different sizes, because with the gradient matrix, the two will be adjusted to the same shape.

Conditional Confrontation Network Model
To optimize the versatility of GANs, we tested the method on a variety of tasks and data sets, including graphics tasks (such as photo generation) and visual tasks (such as semantic segmentation). We have found that very good results are often obtained on small data sets. The training data set we used contains only 400 images, and training can be made very fast with this size of training set.
Some of the super parameters are shown in Table 1.
Qualitative results: the completed model is displayed, and the actual generated effect is displayed. Below we list three sets of pictures, as shown in Figure 3, the input of the figure, the second column is the output (model generation result), and the third column is the actual result. Equation (8)

Style Migration
The goal of style conversion is to produce a picture with both the content information of the content map and the style information of the style map. As a baseline, we reproduce the method of Gatys et al., giving the style and content goals y s and y c , layer i and J represent feature and style reconstruction. The implementation formula is:

Model Combination
We combine the conditional confrontation network model and the style transfer model to achieve a good combination effect. The specific results are shown in Figure 5. The a column is the sketch, the b column is the generated result, and the c column is the effect after the style transfer.

Conclusion
In this paper, we take advantage of the feedforward network and the optimization-based approach to achieve a good performance and speed by training the feedforward network with a perceptual loss function. We use the conditional confrontation network to implement the function of image translation. Finally, we combine the two models to achieve the application effect in a specific scenario. But, the migration of details is not in place. The lack of detail in the depiction of different image styles will follow the following two aspects to improve the network's capabilities: First, for the already trained model, the generated image has reached a very fast speed, but the training model still takes several hours. I hope that the training process of the model can be optimized and the training time of the model can be improved. Second, for more research on the details of the image, you can add more detail extraction to the network to transfer the style R. K. Wang Open Journal of Applied Sciences of the image, achieve more realistic comic style migration effects, and imitate different painter strokes and for buildings and characters adapt to different parameters.