Dual Channel with Involution for Long-Tailed Visual Recognition

With the rapid increase of large-scale problems, the distribution of real-world datasets tends to be long-tailed. Existing solutions typically involve re-balancing strategies (i.e., re-sampling and re-weighting). Although they can significantly promote the classifier learning of deep networks, they will unexpectedly impair the representative ability of the learned deep features to a certain extent. Therefore, this paper proposes a dual-channel learning algorithm with involution neural networks (DC-Invo) to take care of representation learning and classifier learning concurrently. In this work, the most important thing is to combine ResNet and involution to obtain higher classification accuracy because of involution’s wider coverage in the spatial dimension. The paper conducted extensive experiments on several benchmark vision tasks including Cifar-LT, Imagenet-LT, and Places-LT, showing that DC-Invo is able to achieve significant performance gained on long-tailed datasets.


Introduction
Visual recognition research has developed rapidly during the past few years, mainly driven by large image datasets [1] [2], deep convolutional neural networks (CNNs) and high-performance computing resources. In the traditional classification and recognition tasks, the distribution of training data is often artificially balanced. Visual phenomena, however, are more data biased. In the form of long-tailed distribution [3] [4], many standard methods fail to model correctly, resulting in a significant decrease in accuracy. Motivated by this, there have been some recent attempts to study long-tailed recognition, i.e., recognition in environments where the number of instances in each class is highly variable and follows a long-tailed distribution.
When learning with long-tailed datasets, a common challenge is that instance-rich (or heads) classes dominate the training process. The learned classification model performs better on these classes, however, it performs significantly worse for instance-scare (or tail) classes. To solve this problem and to improve the performance of all classes, prominent and effective approach is the class re-balancing strategy, which is proposed to mitigate the extreme imbalance of training data. In general, class re-balancing methods can be roughly divided into two groups, i.e., re-sampling [5] [6] and re-weighting [7] [8]. These methods can adjust network training by re-sampling instances or re-weighting the losses of samples within the SGD mini-batches, which are expected to be closer to the test distribution. Therefore, class re-balancing can effectively directly affect the classifier weights' update of the deep network, i.e., promoting classifier learning.
However, although re-balancing methods have good ultimate predictions, these methods still have adverse effects, i.e., they can also unexpectedly impair the representativeness of the learned deep features (i.e., representation learning) to some extent. Specifically, when the data imbalance is extreme, there are risks of over-fitting the tail data (by over-sampling) and under-fitting the whole data distribution (by under-sampling). For re-weighting, it distorts the original distribution by directly changing or even reversing the frequency of data presentation. To solve these problems, the BBN model [9] proposed a unified bilateral branch network to carry out feature learning and classifier learning of deep network simultaneously and a cumulative learning strategy to adjust bilateral learning for exhaustively improving the recognition performance of long-tailed tasks. Moreover, convolution has been a central component of modern neural networks, triggering the explosion of deep learning in vision. In 2021, Li et al. [10] reconsidered the inherent principles of standard convolution for visual tasks, especially spatial-agnostic and channel-specific. Instead, they proposed a new neural network operator by inverting the above design principles of convolution, named involution. More specifically, involution kernels are distinct in the spatial extent but shared across channels. Involution can summarize context in a broader spatial arrangement, thus overcoming the difficulties of modeling long-range interactions well, and can adaptively assign weights in different locations to prioritize visual elements with the most information in the spatial domain. Based on the above, this paper proposes a dual-channel structure with involution neural networks (DC-Invo) for both representation learning and classifier learning. At the same time, combined with DC-Invo model training, the cumulative learning strategy is used to adjust bilateral learning. As shown in Figure 1, the DC-Invo model consists of two channels, called the "traditional learning channel" and the "re-balancing learning channel". As the name implies, the traditional learning channel adopts uniform sampling to maintain the original data Open Journal of Applied Sciences distribution structure for representation learning. While, the re-balancing learning channel used a reversed sampler (i.e., small sampling weights for high frequency samples) to model the tail data. The predicted outputs of these dual channels are then aggregated in the cumulative learning part by an adaptive trade-off parameter α. α is automatically generated by the "Adapter" based on the number of training epochs, which adjusts the entire DC-Invo model to firstly learn general features from the original distribution and then gradually focus on tail data. More importantly, in the backbone network model, the involution neural network is combined with ResNet residual network to obtain higher classification accuracy on the long-tailed datasets because of the involution kernel's wider coverage in the spatial dimension (wider receptive field).
To demonstrate the effectiveness of the proposed DC-Invo, the paper conducts extensive experiments on four benchmark long-tailed datasets: CIFAR-10-LT, CIFAR-100-LT, Imagenet-LT, Places-LT. Empirical results on these datasets show that the model obviously outperforms existing state-of-the-art methods.
Summarily, the primary contributions of this paper are as follows: 1) The paper proposed a dual-channel learning algorithm with involution neural networks (DC-Invo) to deal with representation learning and classifier learning for exhaustively enhancing long-tailed recognition. In addition, a cumulative learning strategy is used to adjust bilateral learning. 2) The paper evaluated the DC-Invo model on four benchmark long-tailed visual recognition datasets, achieving higher accuracy than established state-of-the-art methods (different sampling strategies and new loss designs).

Re-Sampling
Re-sampling is a preprocessing technique to solve the problem of imbalanced data classification. In the past, a large number of sampling techniques have been proposed from different perspectives, mainly oversampling by simply repeating data for minority classes [11] [12] [13] and under-sampling by abandoning data for dominant classes [14] [15] However, re-sampling is not a really perfect solu-Open Journal of Applied Sciences tion because the tail data are often learned repeatedly, which lacks enough sample differences and is not robust enough, and the head data is often not fully learned [16] [17].

Cost-Sensitive Learning
The cost-sensitive function is an effective method to deal with unbalanced classification, which is mainly to make the model pay more attention to the few samples in the learning process, so as to alleviate the phenomenon that the model is too biased towards the majority of samples. Cost-sensitive function methods mainly include the adjustment of sample weights, the design of various types of loss functions, and techniques that are beneficial to the learning of a few types of samples. Ren et al. [18] proposed an approach based on primary learning, which automatically assigned weights to the training set samples according to the loss of validation set. In terms of the loss function, various novel loss functions have emerged in recent years. In 2017, Lin et al. [19] designed Focal Loss, a Loss function for online mining of difficult samples. In 2018, Dong et al. [20] added a kind of corrected loss on the basis of the Softmax loss function. Cui et al. [21] designed a weight adjustment scheme, which used the effective sample number of each class to adjust the weight of class loss, so as to generate a class balanced loss function. Cao et al. [22] proposed the LDAM (Label-Distor-Aare Margins) loss function, which encourages the decision boundary of model learning to be as far away from a few classes as possible, and theoretically and rigorously proved the rationality of the loss function.

Methodology
As shown in Figure 1, our DC-Invo mainly adds a new neural network operator to the backbone network structure of the BBN model [9], including three main components: traditional learning channel, re-balancing learning channel and cumulative learning strategy. The traditional learning channel obtains the input data from a uniform sampler, which is responsible for learning the general patterns of the original distribution. While the re-balancing channel receives input data from a reversed sampler and is designed to model tail data. The cumulative learning strategy aggregates output feature vectors t ϕ and r ϕ of the two channels to calculate the training loss.

Involution
Involution is a new neural network operator proposed by Li et al. in 2021 [11], which inverted the two inherent principles of convolution: spatial-agnostic into spatial-specific, and channel-specific into channel-agnostic. Finally, based on the two design principles (i.e., spatial-specific and channel-agnostic), a new type of operator was proposed, called involution. Compared with convolution, involution can aggregate the context in a wider space so as to overcome the difficulty of modeling remote interactions well and can adaptively allocate the weights of M. X. Li different positions so as to prioritize the visual elements with the most abundant information in the spatial domain. Let denote the input feature map, where H, W represent its height, width and C enumerates the channels. The kernel of involution is indicates that all channels share G kernels.
So the involution can be formulated as: is involution kernel. The general form of involution kernel generation is as follows: is an index set of the neighborhood of (i, j), therefore, where 0 represent linear transformation matrix, γ represents reduction ratio and σ implies Batch Normalization and non-linear activation functions that interleave two linear projections.
As shown in Figure 2, under the above simple instantiation of involution kernel, a complete schematic diagram of involution can be obtained.
The schematic is from the literature [11]. For the feature vector on a point of the input feature map, it is first expanded into the shape of the kernel through φ (FC-BN-ReLU-FC) and reshape (channel-to-space) transformation to obtain the corresponding involvement kernel on this coordinate point, and then multiply-add with the feature vector in the neighborhood of this coordinate point on the input feature map to obtain the final output feature map.

Modeling Process
At this point, Z is the predicted output, and then the softmax function is used to normalize Z to get the probability of each class:

Proposed Cumulative Learning Strategy
A cumulative learning strategy is proposed to dynamically adjust the learning focus between dual channels by controlling the feature weight generated by two channels and classification loss L. It is designed to learn the general patterns firstly, and then pay attention to the tail data gradually. In the training phase, the feature t ϕ of the traditional learning channel will be multiplied by α and the With the increase of training epochs, α will be gradually decreased. The motivation is to make the learning focus of our DC-Invo should gradually change from feature representation to classifiers, which can significantly improve the accuracy of long-tailed recognition.
In the experiment, we also provide this intuitive result by comparing different types of adapters, cf. Section 4.4.3.

Datasets and Empirical Settings
Long-tailed CIFAR-10 and CIFAR-100. According to the number of categories, CIFAR can be divided into CIFAR10 and CIFAR100 that contain 10 categories and 100 categories respectively. The two datasets respectively contain 60,000 images, 50,000 for training and 10,000 for validation. The paper generated the long-tailed version of CIFAR-10 and CIFAR-100 following those used in [22] with controllable degrees of data imbalance. The test dataset remains unchanged, and the number of samples of each category in the training dataset is set according to Long-tailed Imagenet. The paper constructed the long-tailed version of Imagenet following those used in [23]. The validation set and test set remain unchanged, and each type of sample in the training set is sampled following the Pareto distribution, where the power value α = 6. A total of 115,846 images were collected from the training dataset, with each category containing 1280 images at most and 5 images at least.
Long-tailed Places. The structure of the long-tailed version of Places is similar to that of Imagenet-LT. Following the settings in [23], 20 images are sampled from each category of the validation set, 50 images from each category of the test set, and samples from each category of the training set are sampled following the Pareto distribution with the power value α = 6. The training data set obtained by sampled has a total of 62,500 images, with each category containing 4980 images at most and 5 images at least.

Implementation Details
Implementation details on CIFAR. For long-tailed CIFAR-10 and CIFAR-100, the paper followed the simple data augmentation proposed in [24] for training: a 32 × 32 crop is sampled randomly from the original image or its horizontal flip with 4 pixels which are padded on each size. The paper trained the combination of ResNet-32 [24] and involution as our backbone network and used the stan-dard mini-batch stochastic gradient descent (SGD) with a momentum of 0.9, weight decay of 2 × 10 −4 for all experiments. The paper trained all the models on a GeForce RTX 2080Ti GPU with a batch size of 128 for epochs. For a fair comparison, the initial learning rate is set to 0.1 and decayed by 0.01 at the 120 th epoch and again at the 160 th epoch for our DC-Invo. A linear warm-up learning rate schedule [25] is used for the first 5 epochs.
Implementation details on Imagenet-LT and Places-LT. For Imagenet-LT and Places-LT, all images are first adjusted to 256 × 256. During training, images are randomly cropped to 224 × 224, and then flip horizontally with a 50% probability. The paper used the standard mini-batch stochastic gradient descent (SGD) with a momentum of 0.9 to train 60 epochs, and the learning rate is initialized to 0.1, which decayed to 10% of the original at the 20 th and 40 th epochs, respectively.

Comparison Methods
In experiments, this paper compared DC-Invo model with several methods:   Table 2 shows the experimental results of different algorithms on Imagenet-LT and Places-LT datasets. Similar to the results on the CIFAR-LT dataset, the DC-Invo model outperformed other algorithms, for example, the Classification accuracy of Imagenet-LT and Places-LT is 2.1% and 1.3% higher than the second place algorithm, respectively.

Experiment Result on Imagenet-LT and Places-LT
In conclusion, the comprehensive comparison of different algorithms on several datasets shows that DC-Invo model can well model long-tail distributed datasets.

Different Cumulative Learning Strategies
To verify the effectiveness of the proposed cumulative learning strategy, we explore a number of different strategies to generate the adaptive trade-off parameter α on CIFAR-10-IR50. The abscissa represents the completion degree of model training, the ordinate represents the value of α used in the training period, and each curve presents how α varies with the training process of the model, cf. Figure 3. The paper tested with both progress relevant strategies which adjust α with the number of training epochs (i.e., parabolic increment, cosine decay and linear decay, etc) and irrelevant strategies (i.e., equal weight, single weight, and β-distribution), cf. Table 3.   As shown in Table 3, the parabolic decay adapter is the best among these adapters. The results of the three decay strategies are all better than the single-weight strategy using a traditional learning channel. The results of the two channels using equal weight all the time are slightly lower than the single-weight strategy, and the parabolic increment strategy and the randomly generated β-distribution strategy have the worst results. These phenomena indicate that the model should emphasize representation learning first and then classifier learning. At the same time, compared with segment weight, the parabolic decay does not directly step from 1 to 0, but gradually decreases, so that the two channels can maintain the learning state simultaneously during the whole training process and the model pays attention to the tail data at the end of the iteration without damaging the learned features.

Conclusion
For long-tailed problems, some literature reveals class re-balancing strategies can not only promote classifier learning significantly but also damage representation learning to some extent. Motivated by this, this paper proposed a dual-channel structure with involution neural networks (DC-Invo) for both representation learning and classifier learning to effectively improve the recognition performance of long-tailed classification tasks. Through comparison with state-of-the-art methods and extensive ablation studies, this paper verified that our DC-Invo could achieve the best results on long-tailed benchmarks.

Conflicts of Interest
The author declares no conflicts of interest regarding the publication of this paper.