Action Recognition Using Multi-Scale Temporal Shift Module and Temporal Feature Difference Extraction Based on 2D CNN

Convolutional neural networks, which have achieved outstanding performance in image recognition, have been extensively applied to action recognition. The mainstream approaches to video understanding can be categorized into two-dimensional and three-dimensional convolutional neural networks. Although three-dimensional convolutional filters can learn the temporal correlation between different frames by extracting the features of multiple frames simultaneously, it results in an explosive number of parameters and calculation cost. Methods based on two-dimensional convolutional neural networks use fewer parameters; they often incorporate optical flow to compensate for their inability to learn temporal relationships. However, calculating the corresponding optical flow results in additional calculation cost; further, it ne-cessitates the use of another model to learn the features of optical flow. We proposed an action recognition framework based on the two- dimensional convolutional neural network; therefore, it was necessary to resolve the lack of temporal relationships. To expand the temporal receptive field, we proposed a multi-scale temporal shift module, which was then combined with a temporal feature difference extraction module to extract the difference between the features of different frames. Finally, the model was compressed to make it more compact. We evaluated our method on two major action recognition benchmarks: the HMDB51 and UCF-101 datasets. Before compression, the proposed method achieved an accuracy of 72.83% on the HMDB51 dataset and 96.25% on the UCF-101 dataset. Following compression, the accuracy was still impressive, at 95.57% and 72.19% on each dataset. The final model was more compact than most related works.


Introduction
Recently in the field of computer vision, human action recognition has become increasingly research-worthy. With the development of technology, action recognition has wide applications in the present era. Deep ConvNets, such as Inception-V1 [1], ResNet [2], and their variations [3] [4] [5] [6] have already achieved outstanding performance in image classification. Several studies on action recognition led to the direct inflation of the filters of these models from two-dimensional (2D) to three-dimensional (3D) to obtain inflated 3D ConvNets (I3D) [7], resolution 3D LLC (Res3D) [8], ResNeXt3D [9], among other models. Currently, there are two main approaches to action recognition: 2D CNN (convolutional neural network) and 3D CNN. The 2D CNN method performs convolution on one frame at a time, without temporal fusion. Conversely, the 3D CNN method performs convolution on multiple frames using 3D convolutional filters to achieve spatio-temporal learning.
In contrast to image recognition, video understanding requires learning the relevance of frames; therefore, the disadvantage of the 2D CNN method is its relatively limited performance when only RGB images are used for recognition. To improve their accuracy, most 2D CNN mainstream approaches, such as two-stream [10] and its variations [11] [12] [13], incorporate the optical flow field [14]; however, this leads to additional computational costs. Conversely, C3D [15], the mainstream method based on 3D CNN leverages the advantages of 3D convolutional kernels to perform convolution on multiple frames and effectively learn the correlation of adjacent sampled frames. However, compared with the 2D CNN, its architecture causes an explosion of parameters and calculations.
In this study, we aimed to further improve the performance of the traditional 2D CNN architecture for action recognition by expanding the temporal receptive fields. Although TSM [16] attempts to increase the temporal receptive fields to three, methods [11] [12] that generally use a stack of five optical flow frames as inputs with each RGB frame still have limitations. We propose the multi-scale temporal shift module (MSTSM), which can learn spatio-temporal information more effectively. Owing to the shift blocks with different scales, the entire model has higher receptive fields in the time dimension. Furthermore, many actions are prone to misprediction owing to the similarity of the movements that constitute them. The temporal feature difference extraction module of the proposed module can subtract the features of different frames to learn the uniqueness of the details of each action. Figure 1 is a schematic diagram of the proposed model. Finally, we filtered out similar kernels to make the model more compact.

Related Works
Recently, many approaches to video understanding and action recognition have been proposed. We discuss some mainstream works in this section. Compared with the traditional methods [17] [18], these works can be broadly categorized into two classes: 2D ConvNets-based methods and 3D ConvNets-based methods.

2D CNN
It is difficult to capture temporal relationship, which is crucial in video recognition, using methods based on 2D CNN; hence, most works incorporate other streams, such as optical flow or motion vector [19] [20] [21] [22], to compensate for this deficiency.
Simonyan et al. [10] designed a two-stream ConvNet framework that contains spatial and temporal streams. The input for the spatial stream is a still RGB image sampled from the source video, whereas that of the temporal stream is in the form of stacked dense optical flow. The outputs from these two streams are combined through late fusion to obtain the final prediction.
Wang et al. [11] proposed the TSN based on the aforementioned two-stream method. In this approach, the long-range temporal information is captured from the sampled frames using a sparse sampling strategy. First, the given input videos are sliced into several segments of equal length; then, one frame is sampled from each segment. Fusing the extracted features from these sampled snippets enable the framework to effectively learn the long-range relationships in the temporal dimension.
Lin et al. [16] proposed TSM, which propels the channel forward and backward along the temporal dimension; thus, the features of adjacent sampled frames are fused with the current frame after processing. It can be applied to any 2D Con-vNet backbone to achieve a similar effect as 3D ConvNet without extra costs.

3D CNN
Carreira et al. [7] proposed a framework called I3D, which inflates all the con-Journal of Software Engineering and Applications volutional filters and pooling layers of the Inception-V1 model [1] from 2D kernels to 3D kernels. Because of the design of Inception-V1 and the pre-trained weights on the Kinetics dataset [23], this framework has fewer parameters compared to C3D [15] and thus circumvents the overfitting problem. However, the I3D samples frames from the whole video at the inference stage, which causes heavy computational cost.
Wang et al. [24] proposed an I3D-based framework that incorporated long short-term memory (LSTM) [25] for improved accuracy to model the high-level temporal features extracted by the Kinetics-pretrained I3D model. However, similar to the I3D, this approach experienced parameter explosion due to the LSTM.
T-C3D, a framework proposed by Liu et al. [26], first divides the given input video into three clips and samples eight frames from each clip. Thus, it can capture short-term features from frames in the same clips using 3D kernels and long-term features when fusing the prediction from each clip. Furthermore, Liu et al. employed compression methods [27] [28] to reduce the model size.
Although the compression technique reduces the model size, T-C3D still requires several frames for inference, which still causes explosive FLOPs.

Multi-Scale Temporal Shift Module and Temporal Feature Difference Extraction Based on 2D CNN for Action Recognition
We combined two proposed modules, MSTSM and temporal feature difference extraction module (TFDEM), on ResNet-50 [2], as shown in Figure 2. First, we adopted the sparse sampling strategy [11] and fed the sampled frames to our model. In the MSTSM, after shifting the feature maps along the temporal dimension, we replaced and concatenated different frame features in the two temporal shift blocks and increased the temporal receptive fields. In the TFDEM, to highlight the difference between the frames, we subtracted the feature maps of the current frame and the next frame. Then, we integrated the cross-entropy loss [29] value of the two paths to update the weights of the entire model. Finally, to make the model more compact, we filtered the kernels in the layers that had passed the MSTSM. Figure 2. Overall architecture of proposed method. Backbone used was 2D ResNet-50; we performed convolution on different frames at different times. We pruned kernels of layers that passed MSTSM and denoted them as "MSTSM-p". Journal of Software Engineering and Applications

Sampling Strategy
Similar to Wang et al. [11], we adopted a sparse sampling strategy. The input video was sliced into several segments and one frame from each segment was sampled. This made it possible to understand the information conveyed by the entire video.
For example, given an input video with N frames, we divided the N frames into n parts of equal length; thus, each part was composed of k N n = frames.
Then, the sampled frames from each part formed a set and were denoted as follows: where the frame number i F for the training stage was a random number in the interval [ ] 1, k and plus ( ) For the testing stage, the frame number i F was the median of the interval [ ] 1, k and plus ( ) In our experiments, n is 8, unless otherwise specified.

Multi-Scale Temporal Shift Module
As  shifted from the frames before and after the current frame; this is also the relationship that needs to be learned most in action recognition. We selected the number of input channels with a higher ratio to be shifted. The intuitive idea was to replace the original feature maps with the shifted ones, as shown on the left side of Figure 3. However, the replaced feature map may contain very important channels from the original frame; thus, it may not be possible to learn the current frame effectively. Spatio-temporal learning can be achieved without incurring additional costs.
2) Two-unit Temporal Shift Block: After one-unit temporal shifting, to further increase the model's temporal receptive fields, we shifted the features by two units. Thus, we shifted the features of the frame before the previous frame and after the next frame to the current frame. Hence, as illustrated in Figure 4, we merged the information from the other four frames with those of the current frames, except for boundary cases.
To circumvent the aforementioned risk, we concatenated the shifted features with the original identical feature maps, as shown on the right side of Figure 3.
Thus, the features with two-unit shifts could be considered extra information.
However, although the information is critical, shifting a large number of channels with two units may confuse the model about the features that are close to the current frame; this may interfere with the learning of temporal order. Therefore, we selected the number of channels with a lower ratio in this case; this can also reduce computational costs.

Temporal Feature Difference Extraction Module
In several cases, the difference between the frames was subtle even though we adopted the sparse sampling strategy. Another problem was that the movements that constituted some actions differed only slightly. Thus, the proposed TFDEM module was designed to address these problems. Figure 5 shows the detailed structure of the TFDEM. To maintain the efficiency and enlarge the receptive fields during feature extraction, the stride of the convolutional layers before subtraction was set to 2; this shrinks the size of the feature maps. However, subtracting the low-resolution features did not have significant impact; therefore, the bilinear upsample layer was inserted before feature subtraction. Then, the convolutional, global average pooling [30], and fully connected layers extracted and aggregated the subtracted features.

Objective Function
We had two overall objective functions during the training stage: main  and TFDEM  . The first objective function main  was calculated based on the definition of the cross entropy with the output probability main p from the main path and the ground-truth label true y . The equation can be written as follows: where N is the total number of training videos, main W is the learnable weight of the main path, true y is the ground-truth label, C is the total number of categories, and ( ) , i c main p , produced by the main path, is the probability of the i th video belonging to the c th category.
Similarly, the equation of the second objective function TFDEM  can be written as follows: 1 arg min arg min ln where W is the learnable weight of the entire model.

Pruning Method
The kernel space of each convolutional layer contains some "similar" kernels. Journal of Software Engineering and Applications Performing convolution using similar kernels will result in similar outputs. Highly similar outputs may be redundant in the model; hence, some of them can be removed with no significant effect.
Then, we selected kernels according to the pruning ratio. Therefore, the "similar kernels" referred to hereafter are kernels with a smaller distance from GM K .
2) Target Layers: The target layers in our method were the layers with the MSTSM because the kernel similarity may either result in redundant spatial or temporal features, following shifting. Pruning this layer can filter out redundant spatial and temporal features, as illustrated in Figure 7.
3) Averaging Selected Input Feature Maps: It is known that each input feature map of the layer i L is the output feature map of

Results
In this section, we first describe our experimental environment and datasets. Then, we present the experimental results to demonstrate that the proposed modules can improve the performance. Further, the ablation studies are described to show how we determined the optimal settings. At the end of this section, we present the comparison results of the proposed method and several state-of-theart methods.

Experimental Environment
We implemented our proposed method using the Pytorch [32] framework. Owing to the benefits of transfer learning, we pre-trained our model on the Image-Net [33] and Kinetics [23] datasets. During the fine-tuning process, we froze the batch normalization [3] layer. We trained our model with a weight decay of 0.0005, and an initial learning rate of 0.00025, which was divided by ten every ten epochs; the batch size was 16, and the optimizer was the stochastic gradient descent (SGD) [34]. The CPU was an Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.1 GHz, and the GPU was NVIDIA TITAN V.

Datasets
We mainly evaluated our framework on the UCF-101 dataset provided by [35] and the HMDB51 dataset provided by [36]. We briefly introduce these two datasets in the following sections.
1) UCF-101 [35]: The UCF-101 is one of the most popular datasets for action recognition. It contains 13,320 realistic action videos from 101 categories in five groups; most of the videos were collected from YouTube. The video frame rate in this dataset is 25 frames per second and the resolution 320 × 240. The dataset was composed of three training and testing splits; we also presented the average accuracy of the three splits, similar to the majority of existing works.

Data Augmentation
To improve the performance of the model and avoiding overfitting, we applied similar data augmentation as [16] during training. First, we resized the raw image to make the shorter side equal to 256 pixels. Then, we performed corner and center cropping on the height and width to satisfy the size [ ] 256, 224,192,168 ∈ .
Subsequently, we resized the cropped image to 224 × 224. In the last step, a random horizontal flipping with probability 0.5 p = was performed.

Experimental Results and Ablation Studies
In this section, we present the results of the proposed methods step by step. Unless otherwise specified, the results were evaluated on the UCF-101 dataset.
We show the various settings of the proposed modules and discuss their influence on different aspects. Furthermore, we also attempted to apply our proposed method to a different backbone; we obtained competitive results. We also demonstrate the possibility of incorporating the optical flow into our method to  Based on the results shown in Table 1 The influence of each module is summarized in Table 1. Regardless of whether we combined the TFDEM and MSTSM, the accuracy was significantly higher than that of the baseline. The highest accuracy was achieved by the combination of the modules.
1) Incorporation with Optical Flow: Similar to other works, we also attempted to incorporate the optical flow stream into the proposed module to achieve accuracy. The resultant architecture is shown in Table 1  shift block, we also experimented with other scales, as shown in Figure 9. Figure   9(a) illustrates a variation of the two-unit-shift, which pads the features of the next frame to the vacancy bi-directionally. Figure 9(b) is the schematic of the MSTSM with three scales ranging from a one-unit-shift to a three-unit-shift.  MSTSM denotes the variation shown in Figure 9(a). From the experimental results shown in Table 3, the value of the temporal receptive field is not a fixed number because there are boundary cases. Furthermore, as the temporal receptive field increased, the corresponding accuracy did not necessarily increase. This is because some input videos have a large number of frames, causing the interval of our sampled frames to be so large that frames that were far away provided less relevant features. Therefore, we adopted the 2 MSTSM as our proposed module. 5) Effect of Minimizing L TFDEM : Although subtracting the feature maps of different frames can highlight the difference between them, our TFDEM can still extract features inaccurately. Hence, we also minimized the loss value of the proposed TFDEM path to update the weights in it. We present the comparison result between the performance of the TFDEM with and without minimizing the loss value of the TFDEM path in Table 4.

Comparison with Other Works
We evaluated our framework on the UCF-101 and HMDB51 datasets and compared the performance with other modules in this section. As shown in Table 5, the proposed method outperformed the others while using only the RGB modality and eight frames.

Conclusion
In this study, we designed an action recognition framework based on 2D Con-vNet. When an RGB image alone is used as the 2D ConvNet input, there will be no information regarding the temporal relationship. To expand the temporal receptive fields without increasing the number of parameters, we proposed an MSTSM with average-sized receptive fields to learn the features from other frames. We also proposed a TFDEM to avoid mispredictions in the case of similar actions. Further, our pruning method made it possible to filter out similar kernels and obtain a compact model. Experimental results show that both the MSTSM and TFDEM are effective and our modules can effortlessly be applied to any other backbone. With both proposed modules, it was possible to achieve an accuracy of 96.25% on the UCF-101 dataset, which is a 1.1% improvement on the I3D-LSTM [24]. For the HMDB51 dataset, we achieved 72.83% accuracy, an improvement of 1.8% on the LVR [39]. After compression, the number of parameters was effectively reduced by approximately 2M, and an accuracy of 95.57% and 72.19% was achieved on the UCF-101 and HMDB51 datasets, respectively.