Fine-Grained Classification of Product Images Based on Convolutional Neural Networks

With the rapid development of the Internet of things and e-commerce, feature-based image retrieval and classification have become a serious challenge for shoppers searching websites for relevant product information. The last decade has witnessed great interest in research on content-based feature extraction techniques. Moreover, semantic attributes cannot fully express the rich image information. This paper designs and trains a deep convolutional neural network that the convolution kernel size and the order of network connection are based on the high efficiency of the filter capacity and coverage. To solve the problem of long training time and high resource share of deep convolutional neural network, this paper designed a shallow convolutional neural network to achieve the similar classification accuracy. The deep and shallow convolutional neural networks have data pre-processing, feature extraction and softmax classification. To evaluate the classification performance of the network, experiments were conducted using a public database Caltech256 and a homemade product image database containing 15 species of garment and 5 species of shoes on a total of 20,000 color images from shopping websites. Compared with the classification accuracy of combining content-based feature extraction techniques with traditional support vector machine techniques from 76.3% to 86.2%, the deep convolutional neural network obtains an impressive state-of-the-art classification accuracy of 92.1%, and the shallow convolutional neural network reached a classification accuracy of 90.6%. Moreover, the proposed convolutional neural networks can be integrated and implemented in other colour image database.


Introduction
With the popularity of the Internet and varieties of terminal equipment, online shopping has become a regular part of people's lives with the onset of websites such as Amazon, Dangdang, Taobao, and Jingdong.Customers view a large number of product images, and there is an urgent need for efficient product image classification methods.At present, most studies have mainly focused on keyword-based, label-based, and content-based image retrieval.Zhou [1] used a querying and relevance feedback scheme based on keywords and low-level visual content, incorporating keyword similarities.He [2] proposed a method based on the Multi-Modal Semantic Association Rule (MMSAR) to automatically combine keywords with visual features automatically for image retrieval.Xu [3] used Bayes with expectation maximization to learn an initial query concept based on the labeled and formerly unlabeled images, and the active learning algorithm selects the most useful images in the database to query the user for labeling.However, the keywords and labeled information can only explain the basic information of the goods, such as the name of the product name, the origin, the size, and price and so on, which are difficult to reflect the complete characteristics of the products.At last, images have more information and intuitive expression.If we set an image classification filter on a shopping website, it will be convenient for users to browse and quickly find their favorite products.
The last decade has also witnessed great interest in research on content-based image classification.Image classification based on the content is based on the image features, including image shape, color, and texture.
Jia [4] adopted a gist descriptor and three complementary features, including Pyramid Histogram of Orientated Gradients (PHOG), Pyramid Histogram of Words (PHOW), and Local Binary Pattern (LBP) to extract and describe the features of product images.Valuable product information (such as long skirts versus skirts, and turtleneck versus round collars) can be labeled based on the image features and classification algorithms.Furthermore, they combine discriminative features for the SVM classifier.Experimental results showed that the performance of the product image database (PI 100) improved significantly using features fusion.
Nilsback and Zisserman [5] used the features of the Histogram of Gradient Orientations (HOG), HSV value, and Scale Invariant Feature Transform (SIFT) combining an SVM classifier and multiple-kernel learning framework to classify flower images.The classification accuracy ranged from 76.3% to 95.2%.
Yao and Khosla [6] proposed a random forest, in which every tree node is a discriminative classifier that can combine node information and all upstream nodes.This method identified meaningful visual information of both subordinate categorization and the activity recognition database.
For fine-grained classification, Yao [7] presented a codebook-free and annotation-free approach for fine-grained image categorization of birds.Experimental results showed that the method was better than state-of-the-art classification However, it was difficult to contain features like the above-mentioned shape, color, and texture that could be applied to all product image classifications.
Compared to these classification methods, Convolutional neural networks (CNNs) are one of the deep learning algorithms with strong ability to acquire features, simple structure, and few parameters [10].Nevertheless the fine-grained classification of a category in product images is rarely observed.
In recent years, CNNs received much attention on the computer vision research community, mainly because they have proven to be capable of effectively classifying images and outperforming previous records in image recognition challenges.Most noticeably is the task by Krizhevsky, Sutskever, and Hinton [11], who in 2012 had a margin of 10.9% compared to the second-best entry in the ImageNet Large Scale Visual Recognition Challenge [12].The ImageNet challenge distinguishes objects such as cat, car, tree, and house from 1000 different categories.CNNs are a deep learning application to images, and they stimulate the neuron's activity in the neocortex, where most thinking happens, as Lecun describes [13].The main benefit of using CNNs is that they are traditional, fully connected neural networks and can reduce the amount of parameters to be learned.Convolution layers effectively extract high-level features with small-sized kernels and feed the features to fully connected layers.According to Rumelhart, Hinton, and Williams [14], the training of CNNs is performed through back-propagation and stochastic gradient descent.
This study proposed a novel deep CNN that has data augmentation pre-processing, feature extraction, and softmax classification.To solve the problem of long training time and high resource share of deep convolutional neural network, this paper designed a shallow convolutional neural network to achieve the similar classification accuracy.To evaluate the classification performance of the networks, experiments were conducted using a public database Caltech256 and a homemade product image database from shopping websites.

Caltech256 Database
The Caltech256 database containing 256 object categories on a total of 30607 images.This paper selected 20 object categories which were similar to product images to input the deep convolutional neural network for training.Each category of images was randomly selected 100 images for training and 50 images for testing.The size of the input image was normalized to 256 × 256 during the experiment.

Homemade Database
The data used in the numerical analysis are mainly obtained from Internet-based e-commerce databases, including T-mall, Jingdong, and Amazon.As shown in

Methods
This section describes the pre-processing of the product images and the architecture of the deep convolutional neural network is used for classification of the garments and shoes.

Data Augmentation
A CNN is translation invariant but not rotation invariant.The number of product images, however, on the website is limited and therefore, we can generate the training and testing data by rotating the original data using affine transformation.The data was thereby increased five-fold by mirroring the images horizontally and vertically and rotating them in 90˚ and 180˚ increments.After using data augmentation, there were 20,000 product images in which 16,000 images were used for training and 4000 images for testing.

Model Architecture
Several pre-trained networks for image classification exist such as AlexNet [11], VGGNet [15] and GoogleNet [16], which won the championship in the Image-  to speed up the gradient update with a learning rate set to 0.001.

Input Layer
Data is fed to the network and the input layer produces an output vector as input to the convolution layer.Input data can be either raw image pixels or their transformations, which emphasize specific aspects of the image.This study inputs three-channel product images through a data augmentation method.

Convolution Layers
The convolution layer is the feature extraction layer.The input of each neuron is connected to the local receptive field of the previous layer, and the local feature is extracted.One of the important features of the convolution operation is that it enhances the original signal characteristics and reduces the noise.Filter kernels are slid over the original image and for each position, the dot product between the filter kernel and the part of the image covered by the kernel is determined.
The calculation of the convolution layer is where l is the number of layers, k ij represents a convolution kernel with the connection of map j in the l layer and map i in the l −1 layer, x − is the input fea- ture maps of the l −1 layer, * represents convolution, b is the bias, and ( ) f ⋅ is the nonlinear activation function.

Max-Pooling Layers
The max-pooling layer is a method of aggregate statistics that uses the maximum or mean value of the region to reduce spatial size of a feature map and provide invariance to the network.Max-pooling layers can reduce the image size of the next layer, thereby reducing the parameters and calculations of the network.This is done by only keeping the maximum value within a k × k neighborhood in the feature map.

Batch Normalization
The role of batch normalization [17] is to normalize input data in the same range, even though the earlier layers were updated.According to Dieleman S, during each stochastic gradient descent (SGD), the corresponding activation was normalized by the mini-batch, so that the mean value of the result (output signal in each dimension) was 0 and the variance was 1 [18].The calculation of the batch normalization is where μ and σ are the mean value and variance of the image batch х, and γ and β are trainable parameters that are updated after each batch.ε is a small constant DOI: 10.4236/ami.2018.8400776 Advances in Molecular Imaging value that is added to the variance to avoid division by zero.

Activation Functions
The activation functions in deep learning are responsible for applying a non-linear function to the output of the previous layer.Sigmoid, tanhyperbolic (tanh), rectified linear unit (ReLU) and softplus are commonly used in deep learning.
The non-linear Sigmoid function has a large signal gain in the central region, and relatively small signal gain on both sides [19].The output of the sigmoid function is mapped into the internal of 0 and 1, so it has a good effect on the feature space map of the signal.However, this kind of activation function cannot solve the vanishing gradient problem and is slow in network training.The calculation of the sigmoid function is ( ) The This study used ReLU as the activation function.In 2011, the ReLU activation function was proposed by Glorot [20].According to Krizhevsky [11], the ReLU function effectively suppressed the vanishing gradient problem with a faster convergence rate in training gradient descent than traditional saturated nonlinear functions.They can speed up training and keep the gradient relatively constant in all network layers.The ReLU is defined as The rectifier function is one-sided and therefore does not enforce a sign symmetry or antisymmetry.However, the response to the opposite of an excitatory input pattern is 0 (no response).Therefore, it is more biologically plausible and provides good results.
A smooth approximation to the rectifier is the softplus function.The softplus is not completely one-sided, so it is less biologically plausible and is not used as widely as ReLU.The calculation of the softplus function is ( ) ( ) where x is the value of input signal.
Figure 3 shows the corresponding curves of the activation functions.

Fully Connected Layers
According to traditional neutral networks, all inputs in fully connected layers are  connected to all outputs of the previous layer.The fully connected layers are used as a way of mapping spatial features to image labels.After being trained, the network can extract features in these layers to train another classifier.

Softmax
This study used the softmax classifier, which is the generalization of the logistic model on multiple classification.The softmax classifier is an algorithm that divides the target variable into several classes.Supposing there are N input images x y = , each image is marked with k classes { } ; where represents the normalization of the probability distribution, that is, the sum of all probabilities is 1. θ is a parameter of the softmax function.
The calculation of the loss function is where { } is an indicative function.The rule of value is as follows: { } 1 the value of expression is true 1 1 the value of expression is false 0 = .Finally, the error function is minimized by stochastic gradient descent.

Filter Capacity
In this study, the efficiency of the network was determined by evaluating the filter capacity and coverage of the network [21].The filter capacity is a measure of the filter's ability to detect complex structures in an image.If the capacity is small, only local features in the image will be mapped to the next layer.On the contrary, if the capacity is large, the filter will find complex structures of elements that are not neighbors in the input image.The filter capacity is calculated as the ratio between the real filter size and the receptive field [22].The calculation of the capacity is real filter size Capacity receptive field where the real filter size is the size of the kernel, which consists of downsampling (striding or pooling) of previous layers.If no downsampling is applied, the real filter size is the same as the kernel size.For example, if the input to a layer with kernel size n × n is downsampled by a factor k, the real filter size would then be kn × kn.In this network, there are two 3 × 3 max-pooling layers and a 2 × 2 max-pooling layer.After the first 3 × 3 max-pooling layer, the real filter size would be 3 n × 3 n.After the second 2 × 2 max-pooling layer, it would be 6 n × 6n and after the third 3 × 3 max-pooling layer, it would be 18 n × 18 n.The receptive field is defined as the region in the original image that a particular CNN's feature is focused on [22].Increasing the size of filters in the convolution layers or using pooling can increase the receptive field and thus the filter capacity.According to Cao [21], the network is meaningless if the capacity is smaller than 1/6.For this network, the filter capacity is between 20.4% and 100%, and thereby well above the lower 1/6 limit.

Coverage
Coverage is a measure to "see" a part of the input image of the layer in a CNN.
Adding convolution or pooling layers can increase coverage.The coverage of the network in the end should not exceed 100%.If coverage exceeds 100%, it will be a waste of network calculations, because the network can operate images larger than the input image.For this network, the convolution filters covered 55.9% of the input image and never exceeded the size of the image.Table 2 shows the coverage and capacity of the network.

Results and Discussion
The operating system used in the experiments is Centos Figure 4 shows the classification accuracy and cross entropy loss of the experiment on homemade database.To achieve the highest accuracy possible without overfitting the network, the training was set to 100 epochs.The average classification accuracy of the test was 92.1%.Setting appropriate learning rates in the experiment can improve the learning efficiency of network and therefore improve the classification accuracy.The learning rate was reduced three times before the experiment was stopped.At the beginning, we set the learning rate at 0.001.The test accuracy rapidly increased and the test loss rapidly declined.According to the decline of train loss curve, the learning rate of the network is relatively high.After 10 epochs, the test accuracy slowly increased, even decreasing, and the test loss was an upward trend.We therefore set the learning rate at 0.0005.It was observed that the test accuracy of the network increased again and the test loss slowly decreased.After 20 epochs, the test accuracy and test loss was not stable.We set the learning rate at 0.0001.It was observed that the test accuracy was high and the train loss continued to decline, then stabilized after 30 epochs.
Overall, most species had the highest classification accuracies.This is because the aim of the training was to obtain the most correctly classified product images, without taking into account how these product are distributed among the DOI: 10.4236/ami.2018.8400780 Advances in Molecular Imaging  of garments was 93.4%, and the accuracy of shoes was lower at 88.2%.This was because the shoes sample that we chose was similar and the features could not be better extracted.
As shown in Figure 6, we chose an image from each of the three categories, including short skirt, trouser, and basketball shoes, to show the visualization feature images of each convolution layer.It can be seen from the horizontal comparison of the feature images of each category that the first convolution layer (conv1) shows the edges, shapes, and colors of the product.Conv2 shows the texture of the product.After conv3, the feature images of product are more ambiguous and have no specific meaning.The classification accuracy of short skirts, trousers, and basketball shoes was 97%, 90%, and 82.5%.It can be seen from the vertical comparison of the feature images of each category that the edge sharpness of skirt is higher than trouser, and the trouser is higher than basketball shoes after conv3.It can be also proven from Table 3, which shows the mean and standard deviation value of each convolution layer feature extraction.From conv1 to conv5, the mean and standard deviation value of each category are gradually decreasing.This means that the feature information of images are extracted in a stable fashion.From conv3 to conv5, the standard deviation value of short skirt is less than trouser, and trouser is less than basketball shoes.The smaller the standard deviation value, the better the effect of feature extraction and the more stable the image feature.

Comparative Experiment Based on Shallow Convolutional Neural Network
In the application of modern technology, saving time cost and resource share rate are very important aspects that cannot be ignored.In a relatively simple task, such as collecting fewer images in the object, the shallow convolutional neural network can accomplish the task better, why should we design a complex network with higher time cost?

Image Preprocessing
There were 4000 images in our database and each product has 200 images in which 150 images were used for training purposes and 50 images for testing purposes.The image sizes are not the same.To facilitate the experiment, all images are normalized into 256 × 256 = 65,536 pixels.Because of the small number of samples and the shallow network layers, this paper focuses on image preprocessing.In order to eliminate the influence of complex background on the network, a more intuitive method is to extract the recognition object from the image and then use the extracted region for training.It is necessary to detect the target object in the image, and the RCNN algorithm is the classical algorithm in deep learning for detecting target object.The RCNN algorithm was proposed by Girshick [23] in 2014 and achieved great success.The detection rate on PASCALVOC database was greatly increased from 35.1% to 53.7%.
Although RCNN has achieved good results, there are some obvious shortcomings, such as the number of bounding boxes is too large, the training time is long, and many bounding boxes overlap each other, resulting in repeated calculation.To solve these problems, an improved Fast-RCNN [24] has been proposed.The biggest difference between Fast-RCNN and RCNN is that the Fast-RCNN maps all bounding regions to the last convolution layer of the network, and then uses a ROI pooling layer to unify the sizes of different bounding regions.Only one feature extraction is needed for an image, and feature extrac-tion is not performed for each bounding region, thereby greatly improving the efficiency of calculation.
Although the speed of Fast-RCNN is greatly improved compared to RCNN, there is still a need to optimize the large number of bounding regions.In view of this, the Faster-RCNN [25] algorithm is proposed.Faster-RCNN is characterized by extracting bounding regions from feature maps after the convolution layer rather than from the original image, so a Region Proposal Networks (RPN) is added to generate bounding regions based on Faster-RCNN.This paper used Faster-RCNN to detect the location of clothing in the image, and then the image is normalized to 64 × 64 as input image.Finally, the Softmax is used to classify these features.

Shallow Convolutional Neural Network Model Architecture
Figure 7 shows the shallow convolutional neural network architecture and the trained parameters.The database was normalized to 64 × 64 RGB images after the preprocessing.The shallow network accepted 64 × 64 RGB images as input and output a vector for each block.It had one 8 × 8 convolution layer with a stride of 1, followed by a 3 × 3 max-pooling layer with a stride of 2. This was mapped into a 6 × 6 convolution layer, which increased the number of filters from 16 to 28.Next, a 3 × 3 max-pooling layer was mapped with a stride of 2, and the number of filters was 28.Following this, there were three 4 × 4 convolution layer with strides of 1. Finally, the feature maps were mapped into a 3 × 3 max-pooling layer with a stride of 2.Then, using softmax classifiers to classify 20 category product images.The network was trained using mini-batches with 25 images per batch and the training was set to 50 epochs, to speed up the gradient update with a learning rate set to 0.001.After 10 epochs, the test accuracy slowly increased, even decreasing, and the test loss was an upward trend.We therefore set the learning rate at 0.0005.It was observed that the test accuracy of the network increased again and the test loss slowly decreased.After 30 epochs, the test accuracy and test loss was not stable.

Results and Discussion
We set the learning rate at 0.0001.It was observed that the test accuracy was high and the train loss continued to decline, then stabilized after 35 epochs.
Overall, the shallow convolutional neural network can save time cost and resource share rate by reducing network layers and training epochs.However, it is impossible to achieve high classification accuracy by simply reducing the number of network layers and iterations, which requires processing in image preprocessing and network initial parameter modulation.

Conclusions
In this study, we designed and trained a feature-based deep CNN for color image

Figure 1 ,
Figure 1, 20 products were selected, including garments and shoes.The garments consist of trousers, sweaters, jackets, outdoor jackets, dresses, short T-shirts, down jackets, fleeces, vests, Chinese dresses, shirts, short pants, short skirts, scarves, and socks.The shoes include skateboard shoes, basketball shoes, leather shoes, climbing shoes, and running shoes.Each product has 200 images and after using data augmentation, there were 20,000 product images in which 16,000 images were used for training purposes and 4000 images for testing purposes.The image sizes are not the same.To facilitate the experiment, all images are normalized into 256 × 256 = 65,536 pixels.
non-linear tanh function converges faster than the sigmoid function.It is mapped into the internal of −1 and 1, and the output is centered at 0. Still, the tanh function (like the sigmoid function) cannot solve the vanishing gradient problem.The calculation of the tanh function is

Figure 4 .
Figure 4. Classification accuracy and cross entropy loss of experiment.Red line represents the test accuracy, blue line represents test loss, green line represents train loss.
Convolution neural network as a deep learning network structure requires a lot of data for training and a deep network structure in order to achieve better classification impacts.The training result based on the small samples and shallow convolutional neural network is often unsatisfactory.In view of this situation, this paper uses ImageNet database, which consists of 1.2 million images and 1000 categories for the shallow network pre-training.Network training is a process to update the initialization parameters to the optimal parameters.When the ImageNet training is completed, the trained parameters are stored in the shallow convolutional neural network.Then, input preprocessed product database for network training to obtain optimal parameters.The network storing the optimal parameters serves as a new shallow network model for feature extract.

Figure 8
Figure8shows the classification accuracy and cross entropy loss of the experiment on homemade database.To achieve the highest accuracy possible without

Figure 8 .
Figure 8. Classification accuracy and cross entropy loss of experiment.Red line represents the test accuracy, blue line represents test loss, green line represents train loss.
. Liu et al.DOI: 10.4236/ami.2018.8400785 Advances in Molecular Imaging classification in e-commerce domains, which are comprised of data augmentation pre-processing, feature extraction, and softmax classification.The proposed network is feasible and effective by evaluating the filter capacity and coverage of the network.To evaluate the classification performance of this technique, experiments were conducted using a homemade product image database taken from shopping websites on a total of 20,000 color images, with an average accuracy of 92.1%.Empirical results for the image database have shown that the proposed feature-based deep CNN is very competitive when compared with traditional content-based image classification for all performed experiments.To solve the problem of long training time and high resource share of deep convolutional neural network, this paper designed a shallow convolutional neural network to achieve the classification accuracy of 90.6%.The proposed network fine-tunes the parameters and architecture based on CNNs (as reported in this study) can be readily integrated and implemented in other image recognition and classification domains.The potential future work involves improving new and deeper network architectures for product image classification; applying the CNN on other image databases and improving the classification accuracy by transfer learning.

Table 1 .
Performance of Alex NET, VGG Net and Google Net.
volution and deeper levels can obtain a better structure.Table1presents the performance results of AlexNet, VGGNet, and Google Net in the ImageNet image Recognition Competition.However, the pre-trained network was created by the ImageNet, which is the largest database of image recognition in the world, which is different from the images in this study.Therefore, a new architecture was built to create a better classification of product images.Our CNN is sketched in Figure2.The images in the database are 256 × 256 RGB images.Matlab is used to augment data and transform the data into 227 × 227 RGB images.The network accepted 227 × 227 RGB images as input and output a vector for each block, as illustrated in Figure 2. The network had one 7 × 7 convolution layer with a stride DOI: 10.4236/ami.2018.8400774 Advances in Molecular Imaging

Table 2 .
Coverage and Capacity of the Network.
The classification accuracy of deep convolutional neural network on Cal-tech256 database reached 94.8%.It shows the effectiveness of the proposed deep network and its suitability for feature extraction of color images.T. T. Liu et al.DOI: 10.4236/ami.2018.8400779 Advances in Molecular Imaging

Table 3 .
The mean and standard deviation value of each convolution layer feature extraction.