Deep Learning Based Target Tracking and Classification for Infrared Videos Using Compressive Measurements

Although compressive measurements save data storage and bandwidth usage, they are difficult to be used directly for target tracking and classification without pixel reconstruction. This is because the Gaussian random matrix destroys the target location information in the original video frames. This paper summarizes our research effort on target tracking and classification directly in the compressive measurement domain. We focus on one particular type of compressive measurement using pixel subsampling. That is, original pixels in video frames are randomly subsampled. Even in such a special compressive sensing setting, conventional trackers do not work in a satisfactory manner. We propose a deep learning approach that integrates YOLO (You Only Look Once) and ResNet (residual network) for multiple target tracking and classification. YOLO is used for multiple target tracking and ResNet is for target classification. Extensive experiments using short wave infrared (SWIR), mid-wave infrared (MWIR), and long-wave infrared (LWIR) videos demonstrated the efficacy of the proposed approach even though the training data are very scarce.


Introduction
There are many applications such as traffic monitoring, surveillance, and security monitoring that use optical and infrared videos [1]- [6]. Object features in optical and infrared videos can be clearly seen as compared to radar based trackers [7] [8].
Compressive measurements [9] [10] are normally collected by multiplying the original vectorized image with a Gaussian random matrix. Each measurement contains a scalar value and the measurement is repeated M times where M is much fewer than N (the number of pixels). To track a target using compressive measurements, it is normally done by reconstructing the image scene and then conventional trackers are then applied. There are two drawbacks in this conventional approach. First, the reconstruction process using L 0 [11] or L 1 [12] [13] [14] based methods is time consuming, which makes real-time tracking and classification impossible. Second, there may be information loss in the reconstruction process [15].
In the literature, there are some trackers such as [23] that use the term compressive tracking. However, those trackers are not using compressive measurements directly. There are several advantages if one can directly perform target tracking and classification using compressive measurements. First, because reconstruction of video frames from compressive measurements using Orthogonal Matching Pursuit (OMP) or Augmented Lagrangian Method with L1 (ALM-L1) are time consuming, direct tracking and classification in compressive measurement domain will enable near real-time processing. Second, it is well-known that reconstruction tends to lose information [15]. Working directly using compressive measurement will generate more accurate tracking and classification results [15]- [22].
Recently, we developed a residual network (ResNet) [24] based tracking and classification framework using compressive measurements [10]. The compressive measurements are obtained by using pixel subsampling, which can be considered as a special case of compressive sensing. ResNet was used in both target detection and classification. The tracking is done by detection. Although the performance in [10] is much better than conventional trackers, there is still room for further improvement. The key area is to improve the tracking part, which has a significant impact on the classification performance. That is, if the target area is not correctly located, the classification performance will degrade.
In this paper, we propose an alternative approach, which aims to improve the tracking performance. The idea is to deploy a high performance tracker known as YOLO [25] for target tracking. YOLO is fast, accurate, and has comparable performance as other trackers such as Faster R-CNN [26]. It should be noted that YOLO is used for object detection and not for object tracking. The YOLO for tracking is done by object detection. That is, we custom train YOLO for detecting certain vehicles and the detection results (target location information) from each frame are recorded and then tracked. This is known as tracking by detection. The detection results (bounding boxes of objects) are fed into a classifier. The classification is using ResNet because ResNet has better classification than the default classifier in YOLO. Journal of Signal and Information Processing It is emphasized that a preliminary version of this paper was presented in an SPIE conference [27] in which we only focused on SWIR videos. Here, we have significantly expanded the earlier paper to include additional experiments using MWIR, and LWIR videos. The experiments clearly demonstrated that the performance of the proposed approach is accurate and applicable to different types of infrared videos. Moreover, another contribution of this paper is that our study is the first comprehensive study of vehicle tracking and classification of several types of infrared videos directly in compressive measurement domain (subsampling).
This paper is organized as follows. Section 2 describes the idea of compressive sensing via subsampling, YOLO detector, and ResNet. Section 3 presents the tracking and classification results directly in the compressive measurement domain using SWIR videos. Section 4 focuses on tracking and classification of vehicles in MWIR videos. Section 5 repeats the studies for LWIR videos. In all cases, a comparative study of YOLO and ResNet for classification is also presented.
Finally, some concluding remarks and future research directions are included in Section 6.

Compressive Sensing via Subsampling
Using Gaussian random to generate compressive measurement makes the target tracking very difficult. This is because the targets can be anywhere in a frame and the target location information is lost in the compressive measurements. To resolve the above issue, we propose a new approach in which, instead of using a Gaussian random sensing matrix, we use a random subsampling operator (i.e., keeping only a certain percentage of pixels at random from the original data) to perform compressive sensing. This is similar to using a sensing matrix by randomly zeroing out certain elements from the diagonal of an identity matrix.    Figure 1. (a) Visualization of the sensing matrix for a random subsampling operator with a compression factor of 2. The subsampling operator is applied to a vectorized image. This is equivalent to applying a random mask shown in (b) to an image.

YOLO
We used the so-called tracking by detection approach. In the target tracking literature, there are several ways to carry out tracking. Some trackers such as STAPLE [28] or GMM [29] require an operator to put a bounding box on a specific target and then the trackers will try to track this initial target in subsequent frames. The limitation of this type of trackers is that they can track one target at a time. Another limitation is that they cannot track multiple targets simultaneously. Other trackers such as YOLO and Faster R-CNN do not require initial bounding boxes and can simultaneously detect objects. We can call the second type of trackers: tracking by detection. That is, based on detection results, we determine the vehicle locations in all the frames.
YOLO tracker [25] is fast and has similar performance as Faster R-CNN [26].
We picked YOLO because it is easy to install and is also compatible with our hardware, which seems to have a hard time to install and run Faster R-CNN. YOLO also comes with a built-in classification module. However, based on our evaluations, the classification accuracy using YOLO is not good as can be seen in Sections 3 -5. This is perhaps due to a lack of training data.

ResNet Classifier
The ResNet-18 model is an 18-layer convolutional neural network (CNN) that has the advantage of avoiding performance saturation and/or degradation when training deeper layers, which is a common problem among other CNN architectures. The ResNet-18 model avoids the performance saturation by implementing an identity shortcut connection, which skips one or more layers and learns the residual mapping of the layer rather than the original mapping.
Training of ResNet requires target patches. The targets are cropped from training videos. Mirror images are then created. We then perform data augmentation using scaling (larger and smaller), rotation (every 45 degrees), and illumination (brighter and dimmer) to create more training data. For each cropped target, we are able to create a data set with 64 more images.

Tracking and Classification Results Using SWIR Videos
Our research objective is to perform tracking and classification of three trucks using the sponsor provided SWIR videos. One video (Video 4) starts with vehicles (Ram, Frontier, and Silverado) leaving a parking lot and moves on to a remote location. Another video (Video 5) is just the opposite. These videos are Journal of Signal and Information Processing challenging for several reasons. First, the target sizes vary a lot from near field to far field. Second, the target orientations also change drastically from top view to side view. Third, the illuminations in different videos are also different. Here, the compressive measurements are collected via direct sub-sampling. That is, 50% or 75% of the pixels are thrown away during the data collection process.
In our earlier paper [10], we have included some tracking results where conventional trackers such as GMM [29] and STAPLE [28] were used. The tracking performance was poor when there are missing data.

Tracking Results
We experimented with a YOLO tracker, which has been determined to perform better tracking than our earlier ResNet based tracker [10]. We used the following metrics for evaluating the tracker performance:

Conventional Tracker Results
We applied the GMM tracker to one of our videos. From the results shown in Figure 2, it can be seen that the tracking results are not satisfactory even when there are no missing pixels. In some frames, the GMM tracker simply lost the targets. STAPLE [28] is one of the high performing trackers in recent years. For this algorithm, the histogram of oriented gradients (HOG) features are extracted from the most recent estimated target location and used to update the models of the tracker. Then a template response is calculated using the updated models and the extracted features from the next frame. To be able to estimate the location of the target, the histogram response is needed along with the template response. The histogram response is calculated by updating the weights in the current frame. Then the per-pixel score is computed using the next frame. This score and the weights, calculated before, are used to determine the integral image, and ultimately, the histogram response. Together, with the template and histogram response, the tracker is able to estimate the location of the target.

Classification Results
To illustrate the difficulty of classifying the three trucks, we include the pictures of them below in Figure 11.    trunks. From a distance, it will be quite difficult to recognize them correctly. For vehicle classification, we deployed two approaches: YOLO and ResNet. The YOLO comes with a default classifier. For the ResNet classifier, we performed customized training where the training data are augmented with rotation, scaling, and illumination variations.
Classification Results Using Video 4 for Training and Video 5 for testing Classification is only applied to frames with detection of targets from the tracker. Tables 7-9 summarize the comparison between YOLO and ResNet classifiers for 0%, 50%, and 75% missing cases, respectively. We have two observations.

Discussions
We are interested in the tracking and classification performance in the 75% missing data case because only 25% of pixels need to be stored and transmitted. At this missing rate, using the numbers shown in Table 13, the averaged percentages of frames being detected are 58% for testing using Video 5 and 82% for testing using Video 4, respectively. From Table 14, the averaged percentages of classification are 60% for testing using Video 5 and 78% for testing using Video 4, respectively. Journal of Signal and Information Processing

Tracking and Classification Results Using MWIR Videos
Similar to the SWIR videos, we have also two MWIR videos from our sponsor. In Section 4.1, we present the conventional and our proposed tracking results. Section 4.2 shows the classification results.

Conventional Tracking Results
Here, we only include the STAPLE results because GMM tracker did not work at all. STAPLE appears to work reasonably well for zero and 50% missing rate cases ( Figure 12 and Figure 13). When the missing rate increases to 75%, the STAPLE tracker failed completely as shown in Figure 14. It is observed that one issue with STAPLE is that it is difficult for it to track multiple vehicles simultaneously.
MWIR Results: Train using Video 4 and Test using Video 5 Here, we used Video 4 for training and Video 5 for testing. Tables 15-17  show the performance metrics. Our first observation is that the number of frames with detection decreases when we have more missing pixels. This is reasonable.      Test using Video 4  Tables 18-20 show the metrics when we used Video 5 for training and Video 4 for testing. We can see that the numbers of frames with detection are high for low missing rates. For frames with detection, the CLE values generally increase whereas the DP and EinGT values are relatively stable. Figures 18-20 show the tracking results visually. It can be seen that we have some false detections in the parking lot area. However, when the targets are far away, the tracking appears to

MWIR Classification Results Using Video 4 for Training and Video 5 for
testing Classification is only applied to frames with detection of targets from the tracker. Tables 21-23 summarize the comparison between YOLO and ResNet classifiers for 0%, 50%, and 75% missing cases, respectively. We have two observations. First, the YOLO classifier outputs are worse than those of the ResNet.

Discussions
Similar to the SWIR study, we are interested in the tracking and classification performance in the 75% missing data case where one can have fewer pixels to save and transmit. At this missing rate, using the numbers shown in Table 27, the averaged percentages of frames being detected are 63% for testing using Video 5 and 60% for testing using Video 4, respectively. From Table 28, the

Tracking and Classification Results Using LWIR Videos
In this section, we summarize the tracking and classification results using LWIR videos.

Conventional Tracker Results
We first present tracking results using STAPLE. Similar to the SWIR and MWIR cases, STAPLE did not perform well for the various cases as shown in

Discussions
Similar to the SWIR study, we are interested in the tracking and classification performance in the 75% missing data case where one can have fewer pixels to save and transmit. At this missing rate, using the numbers shown in Table 41, the averaged percentages of frames being detected are 43% for testing using Video 5 and 16% for testing using Video 4, respectively. The detection percentages appear Journal of Signal and Information Processing

Conclusions
We present a deep learning approach for multiple target tracking and classification using infrared videos (SWIR, MWIR, and LWIR) directly in the compressive measurement domain. Key advantages include fast processing without time  One future direction is to integrate the proposed approach with video cameras and perform real-time tracking and classification.