Optimal Features Selection for Human Activity Recognition (HAR) System Using Deep Learning Architectures ()
1. Introduction
In recent years, the e-vision community has directed its attention toward human activity recognition, which is crucial for numerous applications, including human-computer interaction [1], anti-terrorism [2], traffic surveillance [3], vehicle safety [4], pedestrian detection [5], video surveillance [6], real-time tracking [7], rescue operations [8], and human-robot interaction [9] are some examples of the various applications of digital surveillance. The efficient recognition of human activity in recorded videos is the focus of this study. Due to changes in appearance, color, and movement, developing a cost-effective algorithm to recognize a person from a video or image is difficult. Complicating factors include variations in the background and light. Numerous approaches, including feature extraction, segmentation strategies, and classifiers, have been developed to detect humans [10]. Unfortunately, several persons appearing in a scene or image is a challenge for present techniques, and they may not always produce the optimal results. Moreover, many methods are used to identify humans, including the Histogram of Gradients (HOG) [11], Haar-like features [12], adaptive contour features (ACF) [13], Hybrid Wind Farm (HWF) [14], Image Source Method (ISM) [15], edge detection [16], and movement characteristics [17]. When people are unclear or show notable positional variations, these extraction techniques might not be able to reliably capture the details. But, properly selecting pertinent features can greatly improve the ability to identify human activity. In this research, we propose a deep learning technique to overcome accuracy challenge of human activity recognition with selecting optimal features. This is accomplished by enhancing the quality of the frames that are extracted from videos and then classifying the areas based on specified feature vectors. There are five main phases in the proposed method including (a) data normalization, (b) image feature extraction, (c) best feature selection, (d) extract feature fusion, and (e) finally classify the targeted class. Several preprocessing steps, including background subtraction, noise reduction, and object extraction, are used during the normalization stage. We extract three types of features: HOG (Histogram of Oriented Gradients), Gabor, and color chromatic features. We use Principal Component Analysis (PCA) method on each set of features to get the best ones. Then, we combine these selected features. Finally, we use five different classifiers to find the most accurate results.
Major Contributions
Ineffective and lengthy preprocessing procedures decline the optimality as well as the accuracy of any algorithm. This work focuses on the efficient and accurate use of preprocessing and feature extraction steps. Thus, main contributions in this work include the following:
1) After removing the background, morphological operations are used to identify and define the specific area of interest precisely.
2) Independent scoring based on principal components for selecting feature subsets.
3) The optimal results are achieved by using a variety of classification methods.
2. Related Works
The following section provides a detailed examination of significant studies in the field of human activity recognition. In computer vision, one of the primary fields of study is action recognition, which is a subset of gesture recognition [18]. Researchers have employed a variety of methods to develop action-recognition systems, such as artificial intelligence (AI), hand-crafted features combined with traditional machine learning algorithms, and diverse deep learning approaches [19]. Traditional machine learning algorithms and a range of manual feature extraction methods were mostly used to construct the present human activity recognition (HAR) [20]. Conventional machine learning techniques for action recognition typically follow a three-step procedure. At the first phase, features are extracted using manually created descriptors. Then, a particular algorithm is used to encode these features. In the final stage, the encoded features are classified using a suitable machine-learning method [21]. Two distinct techniques are employed in different tasks: local feature-based approaches and global feature-based approaches. The main goal of local feature-based techniques is to characterize features as separate patches, interest points, and gesture information. The learned cues relevant to the current task are aligned with these features. Global features, on the other hand, encompass the entire region of interest.
Table 1. List of literature review and their performances.
SN |
Authors List |
Method(s) |
Dataset |
Accuracy (%) |
1 |
Simonyan et al. [21] |
Two-Stream CNN |
JHMDB |
Not specified |
2 |
Wensel et al. [22] |
ViT-ReT |
JHMDB |
85.20% |
3 |
Feichtenhofer et al. [23] |
Spatial-Motion |
UCF101 |
Not specified |
4 |
Tu et al. [24] |
Multi-Stream CNN |
JHMDB |
71.17% |
5 |
Gammulle et al. [25] |
LSTM-based fused |
UCF Sports |
92.2% |
6 |
Ijjina et al. [26] |
Hybrid Technique |
UCF11 |
69.0% |
7 |
Meng et al. [27] |
LSTM |
Not specified |
93.2% |
8 |
Xu et al. [28] |
Deep Learning |
Not specified |
Not specified |
9 |
Najmul et al. [29] |
BiLSTM |
UCF11 |
85.30% |
10 |
Riahi et al. [30] |
Dilated CNN + BiLSTM + RB |
UCF11 |
79.40% |
11 |
Rama et al. [31] |
Deep Learning Architecture |
JHMBD |
67.24% |
12 |
Gammulle et al. [32] |
Two-Stream Long
Short-Term Memory (LSTM) |
JHMBD |
55.70% |
13 |
Yang et al. [33] |
Various DL Algorithms |
JHMBD |
65.00% |
As noted in the references [22], background removal and tracking techniques are frequently used to accomplish this. Yasin et al. [23], who have presented a fundamental method for identifying actions in video sequences by utilizing keyframe selection and applying it to human activity recognition (HAR). Zhao et al. [24] presented the HAR system that leverages keyframes and employs conventional machine learning techniques for multi-feature fusion. Yala et al. [25] introduce a novel activity recognition system that relies on streaming data. The strategy shows a remarkable level of accuracy in recognizing important human activities. Nunes et al. [26] introduced a framework aimed at daily life human activity recognition. Many features are first extracted by the suggested method. Following this, two successive automatically recognized key positions are encircled around each human activity frame, from which the maximal static and dynamic features are retrieved. Kantorov and Laptev [27] identified the application of Fisher vectors for feature encoding and effectively used linear classifiers to obtain accurate action identification. In [28], Lan et al. presented a process for improving action recognition systems’ operational strategies by switching from data-driven to data-independent approaches. While standard machine learning algorithms have made significant progress in the last 10 years, they still have limits because of human cognition. These limitations include labor intensiveness, time consumption, and the difficulty of feature engineering [29]. Deep learning is a move toward more automated and adaptive techniques that can get over the limitations of handcrafted features. Deep learning departs from the conventional three-step machine learning architecture by introducing a contemporary end-to-end framework. With this framework, classification tasks can be completed simultaneously with the learning and representation of highly discriminative visual features. This following presents the previous related publication for human activity recognition (HAR) using deep leaning techniques. Table 1 presents the list of papers and its corresponding dataset with accuracy.
From this literature reviews, we can understand the deep learning model that uses feature extraction and classification techniques to improve the accuracy of human action recognition.
3. Proposed Method
The proposed method introduces a new technique for human activity recognition. This innovative approach involves five fundamental steps, which include: (a) identifying moving objects within the video sequence; (b) extracting the HOG, Gabor, and color features of the moving object; (c) selecting the most effective characteristics; (d) merging the selected features sequentially; and (e) classifying the moving object. Figure 1 illustrates the entire process of the proposed technique.
3.1. Data Preprocessing Steps
During the preprocessing step, a region-wise sliding window approach is implemented to account for variation in each consecutive frame. This approach helps in ignoring unnecessary regions, such as the background. The result of this process is a binary image is extracted through background subtraction, which is then subjected to a noise removal technique. To enhance the image, the binary image is initially converted to RGB color. Then, the RGB image is transformed into the Hue Saturation Intensity (HSI) format. The next stage is drawing a box around a person to identify them. This step aims to improve the quality of the video-extracted frames by enhancing the foreground features. The following are detailed steps involved in the preprocessing stage are presented in Figure 2.
Figure 1. Proposed model block diagram.
Figure 2. Initially, preprocessing stages include: (i) original images data; (ii) background subtraction images; (iii) image enhancement; (iv) object detection; (v) conversion: binary to RGB; (vi) image cropping.
3.2. Feature Extraction
In this stage, three types of feature extractors named Histogram of Oriented Gradients (HOG), Gabor, and chromatic features that are used to analyze each frame. These three types of feature extractors are considered in our study because the goal is to accurately identify and classify various activities, optimal feature selection becomes pivotal for creating models that are both accurate and practical for real-world deployment [34]. The resulting feature vectors have dimensions of 1 × 3780 for HOG, 1 × 60 for Gabor, and 1 × 9 for co-occurrence matrices and chromatic features. These features capture aspects like gradient distributions, textures, spatial relationships, and color characteristics in the frames, contributing to a comprehensive set for further analysis or classification.
3.2.1. HOG Features
Feature extraction using HOG Method, the image is initially divided into smaller segments, which are then processed individually. Afterwards, these segments are combined back together. To compute directional gradients
and
the Sobel kernel function is utilized on the processed images according to the following mathematical equations [35].
(1)
(2)
where,
denotes magnitude,
supplies the gradient angle, i and j stand for rows and columns concurrently. Based on the gradient, the angle divides the cell votes into bins. Later, each block of the histogram is used to create the standardized vector.
Figure 3. The HOG features and its graphical representation.
The following equation represents the eight bin cells that are used to implement the HOG feature descriptor on the segmented image.
(3)
where, the vector V is the non-normalized vector that contains all of the histograms in a block, and
is a minor constant that does not split by zero. A single block containing all of these vectors is the HOG feature vector. Each feature’s range and mean variance are also measured. Figure 3 shows the HOG features and tis graphical representation.
3.2.2. Gabor Features
The following equation illustrates how a complex sinusoidal wave is used to use the “Gaussian Kernel” feature of a modified 2D Gabor filter in the geographic area.
(4)
here, Y represents the spatial characteristics where the elliptical support of the Gabor function is defined; p' and q' are detailed in the following equations. fs indicates sinusoidal frequency,
represents band similarity direction of an activity described by Gabor,
indicates the phase offset, and σ indicates the Standard Deviation (SD) of the Gaussian wrapper [36]. The gabor feature has five scales and six directions of implementation. The chosen measurement for the gabor feature is 1 × 30. By using the Gabor feature, one may measure the variance and mean. Figure 3(c) describes the graphical representation of HOG features.
3.2.3. Chromatic Features
Because of this method’s extensive use, it has become a standard. In contrast, other studies depended on a limited set of functions, including entropy (H), correlation (COR), energy (E), and local homogeneity (LH). To calculate this Chromatic Features, we used these equations (5)-(9) [37].
(5)
(6)
(7)
(8)
(9)
where,
and
are the vertical statistics,
is the horizontal mean, and
is the variance. Quantifying color information in images in demonstrate in Figure 3(d).
3.3. Feature Selection
In the presented technique, Principal Component Analysis (PCA) is employed for the purpose of feature selection. This allows the identification and selection of the most significant features from the outcomes of various methods, namely Histogram of Oriented Gradients (HOG), Gabor filters, and chromatic feature vectors. In general, the PCA method transforms a set of n vectors from a d-dimensional space to another space with d' dimensions [39]. This is represented by the equation, which gives the resulting in vectors as (
).
(10)
where,
displays the greatest eigenvalues of the distribution and the eigenvectors corresponding to the d'-dimensional space. Conversely, an, r represents predictions of the vectors
across the eigenvectors.
3.4. Feature Fusion
The action recognition method will be efficient and effective because of feature fusion. Additionally, in complex settings, this improves the human action classification rate. Compared to the original Gabor and HOG features, feature fusion in this method yields significantly better results in both the high brightness environment and the dark background. The feature vectors have sizes of 1 × 60, 1 × 3780, and 1 × 9 for HOG, Gabor, and chromatic features, respectively. Let, C1, C2, C3, ⋯, Cn be the human activity classes that require classification in order to perform feature fusion. Consider,
presents the total number of model training samples
are the three feature vectors that have been extracted. The size is defined as:
(11)
The sizes of the feature vectors are denoted by FV1, FV2, and FV3, representing HOG, Gabor, and cooccurrence matrices with chromatic features, respectively. These feature vector sizes can be further described using the set k, where k ∈ {60, 3780, 9}. The sizes of extracted feature sets are: (Ύ_HOG→1 × 3780, Ύ_Gab→1 × 60, Ύ_Chrom→1 × 9). The final extracted vector is indicated as:
(12)
(13)
(14)
3.5. Classification
We compared five different ways of making predictions: linear SVM (LS), cubic-SVM (CS), Complex tree (CT), fine-KNN (FK), and subspace KNN (SK). Figure 4 shows how we selected and combined features to get the best results.
Figure 4. Summary of selecting feature vectors, combining them, and classifying it.
Subspace-KNN did the best on the KTH dataset, and cubic-SVM did better than the other classifiers model on the Weizmann dataset.
4. Results and Analysis of Experiment
In this section, we talk about the datasets we used for our experiments and the results we got based on different performance measures.
4.1. Datasets
The total success of the machine learning model, which includes the suggested network, is heavily influenced by the caliber and applicability of the dataset. In this study, we consider Weizmann Dataset and KTH dataset.
4.1.1. Weizmann Public Dataset
Weizmann dataset comprises 2513 images depicting various human activities, performed by nine different actors and covering five types of human behavior. After selecting and combining features, classification techniques are functional to assess the results. Figure 5 provides a sample of images from the Weizmann dataset. This dataset consists of five classes: Hand Waving, Running, Jumping, Walking, and Bending [39].
Figure 5. Images from the Weizmann dataset.
4.1.2. KTH Public Datasets
The KTH dataset comprises 1628 images showcasing six distinct types of human activities. Figure 6 displays sample images from the Weizmann datasets, which encompass activities such as boxing, clapping, hand waving, running, and walking [40]. Table 2 presents the combined summary of both datasets (Cn for Class label, n = 1, 2, 3…).
Figure 6. Image from the KTH dataset.
Table 2. Datasets summary.
KTH Dataset |
|
Weizmann Dataset |
Classes |
Activity |
Images |
Classes |
Activity |
images |
C1 |
Clapping |
312 |
C1 |
Hand Waving |
624 |
C2 |
Jogging |
191 |
C2 |
Running |
206 |
C3 |
Hand Waving |
581 |
C3 |
Jumping |
421 |
C4 |
Running |
109 |
C4 |
Walking |
271 |
C5 |
Walking |
27 |
C5 |
Bending |
375 |
C6 |
Boxing |
408 |
Total images |
2513 |
Total images |
1628 |
4.2. Performance Measures
In Section 4.2 of the research, they checked how well their new algorithm performed. They used a few different measurements to do this:
Specificity (SPE): This checks how good the algorithm is at identifying things that are not what it’s looking for.
Area Under the Curve (AUC): This measures how well the algorithm can tell the difference between what it’s looking for and what it’s not.
Precision (PRE): This looks at how often the algorithm correctly identifies the things it’s looking for out of all the things it identifies.
Sensitivity (SEN): This checks how good the algorithm is at finding the things it’s looking for.
Accuracy (ACU): This is a general measure of how often the algorithm is correct in its predictions overall.
These measurements help us to calculate overall the classification performance.
4.3. Experimental Result and Discussion
To quantify the results, three distinct experiments are conducted, each involving a different number of features. Table 3 provides a detailed description of all experiments, specifying the number of classes, folds, and features. When assessing the performance of a machine learning model, a technique called “k-fold cross-validation” is commonly used. This involves dividing the dataset into subsets (folds), training the model on some folds, and evaluating it on others. The process is repeated multiple times. Using different values for k helps in obtaining a more reliable estimate of how well the model performs. After all the runs, the results are averaged to get a comprehensive assessment of the model’s effectiveness. This approach provides a more robust evaluation compared to a single train-test split. Each experiment uses five classification methods and calculates sensitivity, specificity, precision, and AUC (Area Under Curve) for the Weizmann and KTH datasets. This helps us see which classification method is the best fit for these specific datasets.
In Experiment 1, we observed that cubic-SVM achieved the highest specificity of 99.80% for the Weizmann dataset among all the algorithms. Meanwhile, Fine-KNN attained an accuracy of 99.93%, specifically in terms of precision, for the KTH dataset, as indicated in Table 4.
In Experiment 2, we observed that cubic-SVM achieved the highest sensitivity of 99.85% for the Weizmann dataset among all the algorithms. Meanwhile, Subspace-KNN attained an accuracy of 99.94%, specifically in terms of specificity, for the KTH dataset, as indicated in Table 4.
In Experiment 3, we observed that cubic-SVM achieved the highest sensitivity of 99.84% for the Weizmann dataset among all the algorithms. Meanwhile, Fine-KNN attained an accuracy of 99.93%, specifically in terms of specificity, for the KTH dataset, as indicated in Table 4.
Table 3. Images, its features and validation for KTH and Weizmann datasets.
Experiment no. |
No. of classes |
Image Features |
Validation |
No. of Images |
No. of Images |
KTH Class |
Total Images |
Weizmann Class |
Total Images |
Shape |
Texture |
Color |
Folds |
1 |
6 |
1628 |
5 |
2513 |
100 |
60 |
9 |
5 |
2 |
6 |
5 |
300 |
60 |
9 |
10 |
3 |
6 |
5 |
800 |
58 |
9 |
5 |
Table 4. Classification results for KTH and Weizmann datasets.
Experiment no. |
Methods |
Weizmann dataset |
KTH dataset |
Sensitivity (%) |
Specificity (%) |
Precision (%) |
Accuracy (%) |
Sensitivity (%) |
Specificity (%) |
Precision (%) |
Accuracy (%) |
1 |
LS |
98.18 |
99.17 |
98.55 |
98.18 |
99.83 |
99.92 |
99.04 |
99.81 |
CS |
98.83 |
99.80 |
98.96 |
99.31 |
99.81 |
99.90 |
99.19 |
99.71 |
CT |
85.96 |
97.35 |
86.16 |
89.01 |
99.68 |
98.28 |
97.75 |
98.41 |
FK |
98.98 |
99.79 |
99.33 |
99.03 |
99.79 |
99.77 |
99.93 |
99.61 |
SK |
90.38 |
98.36 |
91.75 |
93.38 |
99.88 |
99.87 |
99.75 |
99.83 |
2 |
LS |
98.83 |
99.76 |
98.53 |
98.87 |
99.84 |
99.92 |
99.23 |
99.82 |
CS |
98.85 |
99.84 |
98.97 |
99.23 |
99.82 |
99.90 |
99.20 |
99.73 |
CT |
85.93 |
97.36 |
86.07 |
89.03 |
98.46 |
99.69 |
97.78 |
98.42 |
FK |
98.97 |
99.79 |
99.25 |
99.02 |
99.76 |
99.94 |
99.77 |
99.62 |
SK |
90.33 |
98.36 |
91.74 |
93.37 |
99.85 |
99.93 |
99.76 |
99.82 |
3 |
LS |
98.87 |
99.73 |
98.56 |
98.86 |
99.83 |
99.92 |
99.23 |
99.82 |
CS |
98.86 |
99.84 |
98.99 |
99.36 |
99.82 |
99.89 |
99.24 |
99.72 |
CT |
85.91 |
97.43 |
86.16 |
89.54 |
98.01 |
99.47 |
97.67 |
98.44 |
FK |
99.67 |
99.79 |
99.20 |
99.05 |
99.76 |
99.93 |
99.77 |
99.64 |
SK |
90.34 |
98.32 |
91.76 |
93.38 |
99.65 |
99.91 |
99.73 |
99.82 |
LS = Linear-SVM, CS = Cubic-SVM, CT = Complex Tree, FK = Fine-KNN, SK = Subspace-KNN.
4.4. Result Comparison
Table 5 compares the previously implemented algorithms and the proposed algorithm. The table provides a clearer understanding of the performance metrics and highlights the proposed algorithm’s superiority. The basis for this conclusion is derived from the discussion that follows.
The proposed algorithm distinguishes itself by incorporating three distinct feature extractors. This strategic combination yields a notable improvement in accuracy compared to the algorithms that were previously implemented. The enhanced accuracy is a result of the synergistic effect created by the integration of these feature extractors, which collectively contribute to a more robust and effective algorithm. In summary, the proposed algorithm surpasses its predecessors in terms of accuracy, making it a promising advancement in the field. The utilization of multiple feature extractors enhances the algorithm’s ability to capture and leverage diverse information, leading to improved performance in comparison to existing methods. This reinforces the significance of the proposed approach and its potential for applications requiring high-precision and reliable results.
Table 5. Comparison of action recognition results.
Dataset Name |
Reference Paper |
Publication Year (Sort by Year) |
Accuracy (%) |
Weizmann |
Li et al. [41] |
2013 |
95.43 |
JPaul et al. [42] |
2014 |
95.54 |
Candès et al. [43] |
2016 |
88.16 |
Imran et al. [44] |
2016 |
90.43 |
Ahemed Sharif et al. [38] |
2017 |
95.88 |
S. Aly et al. [45] |
2019 |
99.02 |
D. K. Vishwakarma et al. [46] |
2020 |
96.06 |
Our Proposed Method |
2023 |
99.80 |
KTH |
Le. Shao et al. [47] |
2014 |
95.09 |
Jain et al. [48] |
2015 |
95.23 |
J. Yang et al. [49] |
2015 |
96.55 |
H. Liu et al. [50] |
2016 |
97.17 |
Ribeiro et al. [51] |
2017 |
94.93 |
M. Sharif et al. [45] |
2017 |
96.36 |
Kong et al. [52] |
2020 |
94.82 |
Ibrahim et al. [53] |
2019 |
91.63 |
Our Proposed Method |
2024 |
99.94 |
5. Conclusion
This study suggests a novel method for identifying and detecting human activity in multimedia frames and films. Preprocessing, feature extraction, feature selection, serial feature fusion, and classification are the five main steps of the algorithm. Through a series of experiments using the KTH and Weizmann datasets, the algorithm demonstrates superior performance, particularly excelling in activities. The study emphasizes the importance of shape features for accurate classification and identifies texture and color features as crucial for detecting various human activities. Additionally, the integration of feature selection and fusion significantly enhances the system’s accuracy and sensitivity. The proposed algorithm is much more accurate than existing methods, with a 99.94% accuracy rate on the KTH dataset and 99.80% on the Weizmann dataset. This shows how effective it is at recognizing and classifying human activities. Overall, the research contributes a robust approach to activity detection and classification in multimedia, outperforming current methods.
Author Contributions
Conceptualization: Subrata Kumer Paul, Md. Atikur Rahman, Md. Ekramul Hamid, Rakhi Rani Paul.
Methodology: Subrata Kumer Paul, Md. Momenul Haque.
Data collection and preprocessing: Rakhi Rani Paul, Md. Atikur Rahman.
Used Software: Subrata Kumer Paul, Md. Ekramul Hamid.
Writing Original Draft and final copy: Subrata Kumer Paul, Md. Atikur Rahman.
Overall Supervision: Throughout the project, Md. Ekramul Hamid provided overall supervision, ensuring coherence and adherence to project goals.
Data Availability Statement
The data are available in a publicly accessible repository. The data presented in this study are openly available.
Acknowledgements
I extend my sincere thanks to the Information and Communication Technology Division of the Ministry of Posts, Telecommunication, and Information Technology of Bangladesh, People’s Republic of Bangladesh, for their invaluable support and funding of my ICT fellowship program in the MPhil program. Additionally, I would like to express my gratitude to my supervisor and co-authors for their guidance and contributions to this research.