^{1}

^{*}

^{1}

Machine learning is an integral technology many people utilize in all areas of human life. It is pervasive in modern living worldwide, and has multiple usages. One application is image classification, embraced across many spheres of influence such as business, finance, medicine, etc. to enhance produces, causes, efficiency, etc. This need for more accurate, detail-oriented classification increases the need for modifications, adaptations, and innovations to Deep Learning Algorithms. This article used Convolutional Neural Networks (CNN) to classify scenes in the CIFAR-10 database, and detect emotions in the KDEF database. The proposed method converted the data to the wavelet domain to attain greater accuracy and comparable efficiency to the spatial domain processing. By dividing image data into subbands, important feature learning occurred over differing low to high frequencies. The combination of the learned low and high frequency features, and processing the fused feature mapping resulted in an advance in the detection accuracy. Comparing the proposed methods to spatial domain CNN and Stacked Denoising Autoencoder (SDA), experimental findings revealed a substantial increase in accuracy.

Machine learning has become an integral part of everyday life for many people around the world. The discovery and implementation of algorithms allowing computers to learn and predict patterns opens up possibilities of computers interfacing and assisting humans globally [

One powerful utilization of machine learning is image classification. Image classification categorizes the pixels in an image into one of numerous classes, based on portrayals of the features of the image it gathers during extraction. Many spheres of influence such as business, finance, medicine, research, technology, etc. use image classification to enhance their products, causes, efficiency, etc. [

Deep learning, using multiple layers of nonlinear information processing, trains computers to differentiate patterns in data. Each layer builds upon the next layer, and they represent new learned features of the data. At each depth, higher-level abstract features derive from the previous depth level. These revelations moreover allow for greater discerning between multiple classes the deeper it goes in the network. Concluding with the organization and classification of massive, messy, disorderly data in accelerated, more expedient times than the shallower, superficial forms of machine learning [

Normally, CNN, SDA, etc. perform image classification on the raw image pixels. In this case, in an effort to increase the image classification accuracy, we propose an algorithm that converts the data to the wavelet domain. The first-order subbands become inputs into their own CNNs, and they produce individual classification results. We combine each subband’s classification results with the OR operator, surpassing the classification accuracy of CNN on the spatial image data. We also implement our proposed algorithm on SDA and compare it with its spatial counterpart [

We organize the rest of this article as follows: Section 2 gives the background; Section 3 describes the proposed methods; Section 4 discusses the experimental results; and Section 5 gives the summary and conclusion.

Wavelets represent functions as simpler, fixed building blocks at different scales and positions. The Discrete Wavelet Transform (DWT) derives from and simplifies the continuous wavelet transform, representing a sequence of sampled numbers from a continuous function [

Let an image f(x,y) have dimensions M × N. We define the two dimensional DWT transform pair as

W φ ( j 0 , m , n ) = 1 M ⋅ N ∑ x = 0 M − 1 ∑ y = 0 N − 1 f ( x , y ) φ j 0 , m , n ( x , y ) (1)

W ψ i ( j , m , n ) = 1 M ⋅ N ∑ x = 0 M − 1 ∑ y = 0 N − 1 f ( x , y ) ψ j , m , n i ( x , y ) (2)

We define the Inverse Discrete Wavelet Transform (IDWT) as

f ( x , y ) = 1 M ⋅ N ∑ m ∑ n W φ ( j 0 , m , n ) φ j 0 , m , n ( x , y ) + 1 M ⋅ N ∑ i = H , V , D ∑ j = j 0 ∞ ∑ m ∑ n W ψ i ( j , m , n ) ψ j , m , n i ( x , y ) (3)

where W φ are the approximation coefficients, W ψ are the detail coefficients, m & n are the subband dimensions, j is the resolution level, and i is the subband set {H,V,D}.

The Fast Wavelet Transform (FWT) can be expressed below:

W ψ ( j , n ) = ∑ m h ψ ( m − 2 k ) W φ ( j + 1 , m ) (4)

W φ ( j , n ) = ∑ m h φ ( m − 2 k ) W φ ( j + 1 , m ) (5)

where k is the parameter about the position. Equations (4) and (5) reveal the connection and usefulness between DWT coefficients of adjacent scales. This algorithm is “fast” because it efficiently computes the next level of approximation and detail coefficients interactively by convolving W_{ϕ}(j + 1, n) with the time reversed scaling and wavelet vectors h_{ϕ}(−n) and h_{𝜓}(−n), and sub-sampling the outcomes.

The two-dimensional FWT, like the one-dimensional FWT, filters the approximation coefficients at resolution level j + 1 to obtain approximation and details at the j^{th} resolution level. Furthermore, for the two-dimensional case, the detail coefficient expands from one to three coefficients (horizontal, vertical, and diagonal) [

_{j}), High-Low (HL_{j}), and High-High (HH_{j}),

j = 1 , 2, ⋯ , J are detail coefficients, noted above, where j is the scale and J denotes the largest or coarsest scale in the decomposition [_{j}) subband contains approximation coefficients.

The independent nature of the subbands allow image processing applications to perform optimally for each environment, if needful. After subband processing occurs, the IDWT reconstructs the image.

A stacked denoising autoencoder (SDA) is a deep neural network containing multiple denoising autoencoders (DAs) whose outputs connect to the inputs of the next DA [

Suppose an SDA has m layers. Designate l as the current layer. Let W^{(}^{k}^{,1)}, W^{(k,2)}, b^{(k,1)}, b^{(k,2)} represent the weights and biases for the k^{th} autoencoder. The SDA encodes by applying the following for each layer in a feedforward fashion [

a ( l ) = f ( z ( l ) ) (6)

z ( l + 1 ) = W ( l , 1 ) a ( l ) + b ( l , 1 ) (7)

SDAs take in data, and by stacking multiple DAs in succession, they train and learn deeper features at each progressive layer. This process utilizes greedy-wise training for efficiency [

Convolutional Neural Networks (CNN) follow the path of its predecessor Neocognitron in its shape, structure, and learning philosophy [

classifying images, video, etc. [

Additionally, the structural differences between them change how CNN learns and shares weights, and its dimensionality reduction at every layer.

Traditional neural networks have fully connected layers, where every node connects to each node in the subsequent layer. With CNNs, a region of nodes in the previous layer connect to one node in the subsequent layer. This region, better known as a local receptive field, operates like a sliding window function over the whole layer. Within the field, the connections each learn a weight value, and an overall bias value for the node in the subsequent hidden layer. Local receptive fields often take on a square shape of a predetermined size (i.e. 3 × 3, 5 × 5, etc.) [

These fields and biases remain the same for each node in the hidden layer. Unlike traditional neural networks, CNNs employ weight and bias sharing throughout the entire input layer to hidden layer [

y i j = σ ( b + ∑ l = 0 n − 1 ∑ m = 0 n − 1 W l , m a j + l , k + m ) (8)

where W_{l,m} represents the shared weights, b represents the bias, a_{j+l,k+m} is the input activation at a certain position, and n is the window size of the convolution filter.

CNNs adhere to a basic structure that takes after its forefather, Neocognitron, where the layers alternate between a convolutional layer and a pooling layer [

The pooling layer performs dimensionality reduction. This layer aids in keeping the computational costs lower than it would if learning occurs. The subsampling happens by condensing a neighborhood into one node value, and this process continues until it affects all nodes. Researchers primarily use max pooling and average pooling in this layer [

Average pooling calculates the average value of a region and uses it for the compressed feature map. Max pooling determines the maximum value of a region and uses it for the compressed feature map.

a k i j = 1 | R i j | ∑ ( p , q ) ∈ R i j a k p q (9)

a j = max ( p , q ) ∈ R i j ( a k p q ) (10)

In various arrangements, a complete CNN connects alternating convolutional layers to pooling layers. However, other auxiliary types of layers and processes exist to create more activations with robustness, regularize networks, etc. to achieve optimal performance. A sample of these layers of processing include dropout [

Traditionally, for image classification researchers execute CNN on the raw image pixels. This process yields accurate results, but oftentimes the efficiency of the algorithm decreases. This decrease in efficiency comes from the complexity and dimensions of the images in the spatial domain. We seek to remedy this issue by converting the data into the wavelet domain. This conversion allows us to process the images at lower dimensions and achieve faster execution times.

By exploiting the characteristics of the wavelet domain, we apply multiple CNNs onto the various frequency and time representations. This ensemble of CNNs on various subbands increases the classification accuracy of the data sets.

We outline the main steps below:

1) Convert images from spatial to the wavelet domain;

2) Apply Z-score normalization on subbands [

3) Normalize detail subbands [0, 1];

4) Perform CNN on subbands;

5) Combine subband results with the OR operator for final classification [

We present our applications of this algorithm in two contrasting ways. The first approach (hereafter CNN-WAV2) combines the detail coefficients (LH, HL, HH) prior to processing the images as shown by this equation [

H F = α ⋅ L H + β ⋅ H L + γ ⋅ H H (11)

where α, β, and γ are the weight parameters for each subband, whose values we calculate below [

α = T A L H T A L H + T A H L + T A H H (12)

β = T A H L T A L H + T A H L + T A H H (13)

γ = T A H H T A L H + T A H L + T A H H (14)

where TA is the test accuracy for each subband after CNN testing, defined in the Results. We show CNN-WAV2 in

We use MatConvNet [

We use the Haar wavelet basis when implementing our proposed methods in CNN and SDA. We compare multiple bases (

T A = # ofcorrectlyclassified # oftestedsamples × 100 % (15)

To test the strength of our OR gate technique, we compare it to two differing neural network techniques. The first technique connects the outputs of the individual subband CNNs to a multilayer perceptron (MLP), which merges the features of each CNN-WAV into one network. The other approach fuses the

features using Extreme Learning Machine (ELM) [

We base our CNN architecture on Zeiler’s stochastic pooling work [

The CIFAR-10 dataset contains natural scenes from ten classifications of images. It is a subset of the larger CIFAR-100 dataset. For this dataset, we use the full training set containing 50,000 images, and the full testing set containing 10,000 images. During training, the test set also serves as the validation set.

Our CNN-WAV4 method generates the greatest image classification accuracy out of all other approaches. The four individual subbands give this method a robust structure that corrects the errors in one or more subband results. Furthermore, each of the subbands in the ensemble detect a medium-to-high number of scenes on their own, and pick up unique detections that the other subbands miss.

Our CNN-WAV2 method underperforms in terms of accuracy. It detects a classification accuracy less than the traditional CNN method. However, it has the least expensive computational cost. This speedup results from the fusion of the detail coefficients into a new subband. This creates a two-subband ensemble that has a higher processing speed but loses accuracy due to the loss of information from the fusion.

The LL subband contributes the most towards the classification score of the whole network. The reason stems from the fact that this subband has the most similarity to the original image.

Our proposed ensemble methods bring the different results together from each subband network using an OR gate. We purposely construct it this way to

maximize the unique detections of each subband network. Due to the variation of representation of each subband, each one will achieve varying results in detection of scenes. By combining the results of the networks after individual classification rather than prior to classification, our approach achieves a greater accuracy than others do.

We compare our method to two approaches, which combine the subbands in training. This approach diminishes the accuracy, and effectiveness of the unique detections.

We further explore the effectiveness of our proposed methods and their advantage regarding unique detections per subband network. The nature of our ensemble allows each subband network to act as an error corrector for the others. Since each network performs its own classifications prior to the OR logic, we know whether a subband’s decision passes or not. Conversely, we also know which one(s) predict the correct scenes. Therefore, we permit the network(s) who predicts the ground truth correctly to trump the incorrect decisions of the

Method | Metrics | |
---|---|---|

Accuracy (%) | O (N) | |

CNN | 81.95 | 8.21E10 |

CNN-WAV2 | 78.23 | 2.36E10 |

CNN-WAV4 | 86.11 | 4.73E10 |

SDA | 48.64 | 3.68E11 |

SDA-WAV2 | 50.65 | 9.02E10 |

SDA-WAV4 | 67.45 | 1.80E11 |

Method | Metric |
---|---|

Accuracy (%) | |

CNN | 81.95 |

CNN-WAV2 | 78.23 |

CNN-WAV4 | 86.11 |

CNN-MLP2 | 71.42 |

CNN-MLP4 | 72.64 |

CNN-ELM2 | 69.75 |

CNN-ELM4 | 70.85 |

others. This fact allows an ensemble to have all but one of the subband networks incorrect and still report a correct detection. It emphasizes the importance of multiple subband representations being a part of the network. It also explains why the CNN-WAV4 method outperforms the CNN-WAV2 method.

The unique detections for each subband network show the strength of each subband. Not surprisingly, for both methods, the LL subband has the most unique detections. This subband has the most resemblance to the spatial images, and therefore has the most information to extract for features. The rest of the results trend downward, as the number of detections decrease for every subband.

The HH subband for both methods records the least number of unique detections, as it contains mainly edge details, and very little information concerning texture, details, etc. As a whole, the CNN-WAV2 method detects 2991 unique scenes, and the CNN-WAV4 method detects 1723 unique scenes.

Like the CIFAR-10 dataset, our network architecture for KDEF draws from Zeiler’s stochastic pooling work [

KDEF contains 4900 images of 35 people modeling seven basic emotions (afraid, angry, disgusted, happy, neutral, sad, and surprised) with their facial expressions. The models pose in five different directions (full left/right, half left/right, straight) for each emotion. Since KDEF does not specify a training or testing set, we randomly sort the images and select 3,900 as training data, and 1,000 as test data. Due to memory and time constraints, we resize the data to 128 by 128.

The KDEF results follow a similar trend as the CIFAR-10 results concerning the proposed methods. According to

Method | Metrics | |
---|---|---|

Accuracy (%) | O(N) | |

CNN | 82.5 | 1.34E11 |

CNN-WAV2 | 86.1 | 5.58E10 |

CNN-WAV4 | 93.9 | 1.12E11 |

SDA | 14.5 | 3.87E11 |

SDA-WAV2 | 84.1 | 7.73E10 |

SDA-WAV4 | 85.1 | 1.55E11 |

Like the CIFAR-10 results, when we compare our OR fusion method to other methods like MLP and ELM, ours prevails in classification accuracy. Our proposed method maximizes the strength of each subband network within the ensemble by summing up the unique detections towards the total classification accuracy. The other fusion methods combine the weaker outputs with the stronger outputs as they both become inputs into MLP and ELM. This combination dilutes the strength of the stronger activations, and thus leads to a less accurate classification.

The unique detections contribute to the higher accuracies of the proposed methods versus their traditional counterparts. From analyzing the CNN proposed methods, we can discern the importance of multiple subband networks and their influence. The CNN-WAV4 ensemble network has the ability to error correct more effectively than the CNN-WAV2 ensemble network, leading to why it performs better. These unique detections serve to show the power in the diversity of each subband representation.

Method | Metric |
---|---|

Accuracy (%) | |

CNN | 82.5 |

CNN-WAV2 | 86.1 |

CNN-WAV4 | 93.9 |

CNN-MLP2 | 77.5 |

CNN-MLP4 | 73.4 |

CNN-ELM2 | 78.3 |

CNN-ELM4 | 73.0 |

unique detections for CNN-WAV2 and CNN-WAV4. Like the CIFAR-10 results, the unique detections trend downward, with the LL bands having the greatest number of unique detections, due to its similarity to the original spatial image.

The experiments and results solidify our initial claims that a wavelet-based ensemble network would perform at a greater accuracy and comparable to greater computational cost than traditional deep neural network methods. Even with the emphasis on our methods with CNN, the proposed methods, when we apply them to SDA, also follow the aforementioned trends.

We conclude that CNN-WAV2 has a smaller computational cost than the other methods, but sacrifices accuracy. This dilution in accuracy comes from the detail subbands combining prior to any learning. This causes much of the information and uniqueness of each detail subband to be lost.

We also conclude CNN-WAV4 has the greatest robustness, with its ability to correct the errors of other subbands, resulting in the greatest accuracy across all methods. However, due to all of the subbands contributing to the network, the computational cost is higher than the CNN-WAV2 method.

Even with the trade-off in accuracy vs. efficiency, the CNN-WAV4 method proves itself as superior to the traditional CNN and SDA methods. It performs better in both categories, and its higher accuracy and comparable efficiency prove its superiority over our CNN-WAV2 method.

Our proposed methods have limitations in its present structure. Firstly, our CNN network applications to each subband are not variable. Secondly, we perform each network sequentially, instead of in parallel. Thirdly, we recognize that there are operations that can be employed to further reduce the computational complexity of the methods in all phases of calculation (preprocessing, training, post-processing, etc.). Lastly, we utilize only one type of wavelet basis (Haar), when others possibly perform better.

This area and topic, particularly concerning the hybridization of wavelets and deep learning networks has much more growth and contributions by researchers. Using parallel computing with the aid of multiple GPUs can increase the computational efficiency of both proposed methods, especially the CNN-WAV4 method. Creating subband-specific networks also can improve the individual classification accuracies. Expanding the algorithm to multiple decomposition levels also can further prove to increase classification accuracy. Working with datasets with larger images also can strengthen the points of this article, especially concerning computational costs.

This research is supported by the Title III HBGI PhD Fellowship grant from the U.S. Department of Education.

Williams, T. and Li, R. (2018) An Ensemble of Convolutional Neural Networks Using Wavelets for Image Classification. Journal of Software Engineering and Applications, 11, 69-88. https://doi.org/10.4236/jsea.2018.112004

Basis | Accuracy (%) | |
---|---|---|

CNN-WAV2 | CNN-WAV4 | |

bior1.1 | 78.22 | 86.15 |

coif1 | 78.53 | 85.59 |

haar | 78.23 | 86.11 |

rbio1.1 | 78.21 | 85.82 |

sym2 | 78.85 | 85.62 |

Basis | Accuracy (%) | |
---|---|---|

CNN-WAV2 | CNN-WAV4 | |

bior1.1 | 88.2 | 92.3 |

coif1 | 84.8 | 90.4 |

haar | 86.1 | 93.9 |

rbio1.1 | 85.6 | 92.4 |

sym2 | 82.7 | 91.5 |