Target Tracking and Classification Using Compressive Measurements of MWIR and LWIR Coded Aperture Cameras

Pixel-wise Code Exposure (PCE) camera is one type of compressive sensing camera that has low power consumption and high compression ratio. More-over, a PCE camera can control individual pixel exposure time that can ena-ble high dynamic range. Conventional approaches of using PCE camera in-volve a time consuming and lossy process to reconstruct the original frames and then use those frames for target tracking and classification. In this paper, we present a deep learning approach that directly performs target tracking and classification in the compressive measurement domain without any frame reconstruction. Our approach has two parts: tracking and classification. The tracking has been done using YOLO (You Only Look Once) and the classification is achieved using Residual Network (ResNet). Extensive experiments using mid-wave infrared (MWIR) and long-wave infrared (LWIR) videos demonstrated the efficacy of our proposed approach.


Introduction
There are many applications such as traffic monitoring, surveillance, and security monitoring that use optical and infrared videos [1]- [5]. Object features in optical and infrared videos can be clearly seen as compared to radar-based L 0 [13] [14] [15] or L 1 [16] sparsity-based algorithms. One problem with the reconstruction-based approach is that it is extremely time consuming to reconstruct the original frames and hence this may prohibit real-time applications.
Moreover, information may be lost in the reconstruction process [17]. For target tracking and classification applications, it will be ideal if one can carry out target tracking and classification directly in the compressive measurement domain.
Although there are some tracking papers [18] in the literature that appear to be using compressive measurements, they are actually still using the original video frames for tracking.
In our earlier paper [19], we presented a deep learning approach that directly incorporates the PCE measurements. In that work, we focused only on shortwave infrared (SWIR) videos. It is well-known that there are several key differences between SWIR, MWIR, and LWIR videos. First, SWIR cameras require external illuminations whereas MWIR and LWIR do not need external illumination sources because MWIR and LWIR are sensitive to heat radiation from objects. Second, the image characteristics are very different. Target shadows can affect the target detection performance in SWIR videos. However, there are no shadows in MWIR and LWIR videos. Third, atmospheric obscurants cause much less scattering in the MWIR and LWIR bands than in the SWIR band.
Consequently, MWIR and LWIR cameras are tolerant of smoke, dust and fog.
Because of the different characteristics in SWIR, MWIR, and LWIR videos, it is necessary to study the performance of the previously proposed deep learning approach [19] to MWIR and LWIR videos. In this paper, we propose a target tracking and classification approach in compressive measurement domain for MWIR and LWIR images. First, a YOLO detector [20] is used for target tracking. This is called tracking by detection. The training of YOLO tracker is very simple, which requires image frames with known target locations. Although YOLO can also perform classification, the performance is not good as we have a Journal of Signal and Information Processing very limited number of video frames for training. As a result, in the second step of target classification, we decided to use ResNet [21] for classification. We chose ResNet because it allows us to perform customized training by augmenting the data from the limited video frames. Our proposed approach was demonstrated using MWIR and LWIR videos with about 3000 frames in each video.
The tracking and classification results are reasonable. This is a big improvement over conventional trackers [22] [23], which do not work well in the compressive measurement domain. This paper is organized as follows. In Section 2, we describe some background materials, including the PCE camera, YOLO, ResNet, video data, and performance metrics. In Section 3, we summarize the tracking and classification results using MWIR and LWIR videos. Finally, we conclude our paper with some remarks for future research.

PCE Imaging and Coded Aperture
In this paper, we employ a sensing scheme based on PCE or also known as Coded Aperture (CA) video frames as described in [12]. Figure 1 illustrates the differences between a conventional video sensing scheme and PCE, where random spatial pixel activation is combined with fixed temporal exposure duration. First, conventional cameras capture frames at certain frame rates such as 30 frames per second. In contrast, PCE camera captures a compressed frame called motion coded image over a fixed period of time (Tv). For example, a user can compress 30 conventional frames into a single motion coded frame. This will yield significant data compression ratio. Second, the PCE camera allows a user to use different exposure times for different pixel locations. For low lighting regions, more exposure times can be used and for strong light areas, short exposure can be exerted. This will allow high dynamic range. Moreover, power can also be saved via low sampling rate in the data acquisition process. As shown in Figure 1, one conventional approach to using the motion coded images is to apply sparse reconstruction to reconstruct the original frames and this process may be very time consuming. Figure 1. Conventional camera vs. Pixel-wise Coded Exposure (PCE) Compressed Image/Video Sensor [12]. Journal of Signal and Information Processing Suppose the video scene is contained in a data cube where M × N is the image size and T is the number of frames. A sensing data cube is defined by M N T × × ∈ S R which contains the exposure times for pixel located at (m, n, t).
The value of ( ) , , m n t S is 1 for frames [ ] start end , t t t ∈ and 0 otherwise. [t start , t end ] denotes the start and end frame numbers for a particular pixel.
The measured coded aperture image The original video scene can be reconstructed via sparsity methods (L 1 or L 0 ). Details can be found in [12].
Instead of doing sparse reconstruction on PCE images or frames, our scheme directly acts on the PCE or Coded Aperture Images, which contain raw sensing measurements without the need for any reconstruction effort. Utilizing raw measurements has several challenges. First, moving targets may be smeared if the exposure times are long. Second, there are also missing pixels in the raw measurements because not all pixels are activated during the data collection process. Third, there are much fewer frames in the raw video because many original frames are compressed into a single coded frame. Consequently, training data may be scarce.
In this study, we have focused our effort into simulating the measurements that should be produced by the PCE-based compressive sensing (CS) sensor. We then proceed to show that detecting, tracking, and even classifying moving objects of interest in the scene is feasible. We carried out multiple experiments with three diverse sensing models: PCE/CA Full, PCE/CA 50%, and PCE/CA 25%.
PCE full refers to the compression of 30 frames to 1 with no missing pixels. PCE 50 is the case where we compress 30 frames to 1 and at the same time, only 50% of pixels are activated for a length of 4/30 seconds. PCE 25 is similar to PCE 50 except that only 25% of the pixels are activated for 4/30 seconds. Table 1 below summarizes the comparison between the three sensing models.
Details can be found in [19].

YOLO
Strictly speaking, YOLO is a detector rather than a tracker. Here, tracking is done via detection. That is, we apply YOLO to detect multiple targets and the target locations are extracted in every frame. Collecting the location information from the various frames will then create target trajectories.  [20] is fast and has similar performance as Faster R-CNN [24]. We picked YOLO because it is easy to install and is also compatible with our hardware, which seems to have a hard time to install and run Faster R-CNN. The training of YOLO is quite simple. Images with ground truth target locations are needed.
YOLO has 24 convolutional layers followed by 2 fully connected layers. Details can be found in [20]. The input images are resized to 448 × 448. It has some built-in capability to deal with different target sizes and illuminations. However, it is found that histogram matching is essential in order to make the tracker more robust to illumination changes.
YOLO also comes with a classification module. However, based on our evaluations, the classification accuracy using YOLO is not as good as ResNet in Section 3. This is perhaps due to a lack of training data.

ResNet Classifier
The ResNet-18 model is an 18-layer convolutional neural network (CNN) that has the advantage of avoiding performance saturation and/or degradation when training deeper layers, which is a common problem among other CNN architectures. The ResNet-18 model avoids the performance saturation by implementing an identity shortcut connection, which skips one or more layers and learns the residual mapping of the layer rather than the original mapping.
Training of ResNet requires target patches. The targets are cropped from training videos. Mirror images are then created. We then perform data augmentation using scaling (larger and smaller), rotation (every 45 degrees), and illumination (brighter and dimmer) to create more training data. For each cropped target, we are able to create a data set with 64 more images.

Data
We have mid-wave infrared (MWIR) and long-wave infrared (LWIR) videos from our sponsor. There are two videos from each imager: Video 4 and Video 5. Vehicles in Video 4 start from a parking lot and then travel to a remote location. Video 5 is just the opposite. Each frame contains up to three vehicles (Ram, Silverado, and Frontier), which are shown below in Figure 2.
It is challenging for target tracking and classification using the above videos for several reasons. First, the target orientation changes from the top view to side views. Second, the target size varies a lot in different frames. Third, the illumination is also different. Fourth, the vehicles look very similar to one another, as can be seen in Figure 2.
Here, we also briefly mention the image characteristics of SWIR, MWIR, and LWIR. From Figure 3 [25], one can see the bands are different. SWIR lies in the range of 0.9 to 1.7 microns; MWIR is in the range of 3 to 5 microns; LWIR is within the range of 8 to 14 microns. Because of those different wavelength ranges, the image characteristics are very different, as can be seen in Figure 4 and Figure 5. The daytime and nighttime behaviors are also different. Journal of Signal and Information Processing

Performance Metrics
We used the following metrics for evaluating the YOLO tracker performance: Ram Frontier Silverado Journal of Signal and Information Processing For classification, we used confusion matrix and classification accuracy as performance metrics.

Tracking and Classification Results Using MWIR Videos
In a companion paper [19], we have applied the YOLO + ResNet framework to some SWIR videos directly in compressive measurement domain. Since image characteristics are very different for SWIR, MWIR, LWIR, it is necessary to carry out a new study to investigate the deep learning-based framework in [19]. Here, we focus on the case of tracking and classification using a combination of YOLO and ResNet for MWIR and LWIR videos. There are three cases.
We have two MWIR videos. Each one has close to 3000 frames. One video (Video 4) starts with vehicles (Ram, Frontier, and Silverado) leaving a parking lot and moves on to a remote location. Another video (Video 5) is just the opposite. In addition to the aforementioned challenges, the two videos are difficult for tracking and classification because the cameras also move in order to follow the targets.

Conventional tracker results
We first present some tracking results using a conventional tracker known as STAPLE [22]. STAPLE requires the target location to be known in the first frame. After that, STAPLE learns the target model online and tracks the target.
However, even in PCE full cases as shown in Figure 6 for MWIR videos and in Figure 7 for LWIR videos. STAPLE was not able to track any targets in subsequent frames. This shows the difficulty of target tracking using PCE cameras.  Figures 8-10 where more incorrect labels can be seen in the high compression cases. It should be noted that labels came from the YOLO tracker, which has inferior performance than ResNet. We will see more classification results in the later sections.        We also see that more incorrect labels in high compression cases.

Classification Results
Here, we applied two classifiers: YOLO Tables 11-13 are similar to the earlier case. That is, ResNet is better than YOLO and classification performance drops with high compression rates.

Tracking and Classification Results Using LWIR Videos
Here, we summarize the studies for LWIR videos.

Tracking Results
LWIR: Train using Video 4 and Test using Video 5 From Table 14 (PCE full) case, the CLE, DP, and EinGT metrics all look normal. The numbers of frames with detection are lower than those of MWIR.
Frontier has higher detections than Ram and Silverado. For PCE 50 (Table 15) and PCE 25 cases (Table 16)

Conclusions
In this paper, we present a high-performance approach to target tracking and classification directly in the compressive sensing domain for MWIR and LWIR videos. Skipping the time consuming reconstruction step will allow us to perform real-time target tracking and classification. The proposed approach is based on a combination of two deep learning schemes: YOLO for tracking and ResNet for classification. The proposed approach is suitable for applications where limited training data are available. Experiments using MWIR and LWIR videos clearly demonstrated the performance of the proposed approach. One key observation is that the MWIR has better tracking and classification performance than that of LWIR. Another observation is that the ResNet has much better performance than the built-in classification in YOLO.
One potential direction is to integrate our proposed approach with real hardware to perform real-time target tracking and classification directly in the compressive sensing domain.