The classification of point cloud data is the key technology of point cloud data information acquisition and 3D reconstruction, which has a wide range of applications. However, the existing point cloud classification methods have some shortcomings when extracting point cloud features, such as insufficient extraction of local information and overlooking the information in other neighborhood features in the point cloud, and not focusing on the point cloud channel information and spatial information. To solve the above problems, a point cloud classification network based on graph convolution and fusion attention mechanism is proposed to achieve more accurate classification results. Firstly, the point cloud is regarded as a node on the graph, the k-nearest neighbor algorithm is used to compose the graph and the information between points is dynamically captured by stacking multiple graph convolution layers; then, with the assistance of 2D experience of attention mechanism, an attention mechanism which has the capability to integrate more attention to point cloud spatial and channel information is introduced to increase the feature information of point cloud, aggregate local useful features and suppress useless features. Through the classification experiments on ModelNet40 dataset, the experimental results show that compared with PointNet network without considering the local feature information of the point cloud, the average classification accuracy of the proposed model has a 4.4% improvement and the overall classification accuracy has a 4.4% improvement. Compared with other networks, the classification accuracy of the proposed model has also been improved.
In recent years, with the rapid development of 3D laser scanning technology, the acquisition of 3D point cloud data has become more and more convenient. Like image data, lidar point cloud data has gradually become basic data for deep learning. For 3D point cloud data, light, temperature and other external factors will not affect it, with rich geometry, scale, shape and other spatial information. The classification of point cloud data is the key technology for the acquisition of point cloud data information and the reconstruction of 3D model reconstruction. Through the classification of point cloud data, we can divide the disorderly point cloud data into multiple categories. It has a wide range of application prospects in the fields of automatic driving [
With the development of deep learning technology, the use of deep learning methods to study 3D point clouds has become a major trend [
However, due to the disorder, sparsity and non-structural characteristics of point clouds [
In summary, the existing network based on deep learning methods to classify point clouds has numerous defects, such as the conversion of point clouds makes point clouds lose certain feature information, does focus on the local feature information of point clouds, and does not take into account the connection between points. These defects limit the performance of the classification network and cannot achieve better classification results in the point cloud classification task.
Since graph convolution has achieved good results in 2D image processing [
For the PointNet network model, its network structure mainly includes three main parts: point cloud alignment transformation, feature extraction and max pooling to achieved global features. The PointNet network contains two T-Net transform networks and two multi-layer perceptrons with shared weights. T-Net network aligns the point cloud to standardize the point cloud features, the multi-layer perceptron is mainly used to extract the features of the point cloud, and the network fuses multiple features through maximum pooling to achieved a global feature of 1024 dimensions for final classification. The method of extracting point cloud features based on multi-layer perceptron does not pay attention to the local features of point cloud and ignores the connection between points. Aiming at the above problems, a point cloud classification network based on graph convolution and fusion attention mechanism is proposed. The proposed point cloud classification network structure is mainly composed of dynamic graph convolution module and fusion attention mechanism module.
The overall network structure is shown in
The proposed network model contains four Graph Conv modules and two F-Attention fusion attention mechanism modules. The convolution kernels of four Graph Conv modules from left to right are 64, 64, 128 and 256, respectively, and the two F-Attention modules are behind the Graph Conv layer with convolution kernels of 128 and 256, respectively. The point cloud input is N × D dimension, where N delegate the number of sampling points in the point cloud, and D delegate the data dimension of the point cloud. The most common D = 3 delegate that each point only has three-dimensional space coordinate information.
In the Graph Conv module, the input is the feature of N × f dimension, where N delegate the number of input points and f delegate the input dimension of the point. k means the number of neighbor points of the point cloud center point in the graph after using the KNN algorithm to construct the graph and extracting the edge features of each point through n weight-sharing multi-layer perceptrons (mlp{L1, L2, ..., Ln}). The maximum pooling function is used to update the features of nodes for the extracted edge features, and finally the features with dimension of N × Ln are achieved.
The process of the entire network model is that the input N × D dimensional point cloud data is first extracted by two Graph Conv modules with a convolution kernel of 64 acquired N × 64 dimensional point cloud feature information; after feature extraction by Graph Conv module with convolution kernel of 128, F-Attention (fused attention mechanism module) is used to aggregate local neighborhood features to output N × 128 dimensional point cloud feature information. After feature extraction by Graph Conv module with convolution kernel of 256, F-Attention (fused attention mechanism module) is used to aggregate
local neighborhood features to output N × 256 dimensional point cloud feature information. Finally, the features of 64, 128, 256 dimensions are spliced, and the N × 1024 dimension point cloud features are acquired by the pooling method combining the maximum pooling and the average pooling. This feature contains both the global features of the point cloud and the local features. It also focuses on the spatial information of the point cloud and the channel information of the point cloud, and then passes through three fully connected layers (512, 256, C) acquired the final classification score C.
The proposed model uses graph convolution instead of the method of using multi-layer perceptron to extract point cloud features in the PointNet network, so that the network can not only extract the global features of point cloud but also extract the local features of point cloud. Since in the PointNet++ network, it is verified that the T-Net network does not increase the classification performance of the network, the T-Net network is removed in the proposed model, and the ability of local feature extraction of the network is enhanced by stacking multiple graph convolutions. The attention mechanism module is introduced into the proposed model. To reduce the network complexity, two fusion attention mechanism modules are added, so that the network can focus on both the spatial information characteristics of the point cloud and the channel information characteristics of the point cloud. In the processes of getting global features, the multi-layer feature fusion method is used to make the network model better integrate the high-level features of the point cloud with the low-level features, and focus on the original information features of the point cloud as much as possible. The combination of maximum pooling and average pooling is used to replace the maximum pooling method in the PointNet network model. The global features and local features of the point cloud are better fused, and the fused 1024-dimensional global features are obtained for the final classification, and the final size of the whole model is 44.7 MB. Although it is 3.1 MB larger than the PointNet network model, the proposed model improves classification results than the PointNet network model through experimental analysis.
In two-dimensional images, graph convolution is a convolutional neural network that can directly act on the graph and use its structural information. The main operation of graph convolution is to first construct the data into a graph with vertices and edges, and then convolve on the graph data. For graph convolution, it is divided into spatial domain graph convolution [
In the proposed network model, the point cloud input can be expressed as:
X = { x 1 , ⋯ , x n } ⊆ R D (1)
Among them, X means a set of point clouds, xi means each point in the point cloud set. For each point in the point cloud set, they all have D dimensional features. The most common D = 3 means that each point only has three-dimensional spatial coordinate information.
For the local point cloud structure, it can be represented by a directed graph: G = (V, E), where V is the set of N local nodes:
V = { x 1 , ⋯ , x n } (2)
E means the set of edges between nodes:
E = { e i j } i , j = 1 N (3)
eij means the edge between node i and node j. The local directed graph G in the model is constructed using the K-nearest neighbor classification (KNN) algorithm. In a locally directed graph G, assuming that point i is a central node, then K nearest neighbors j including node i can be calculated by KNN algorithm. The edge feature eij of two adjacent nodes can be described as:
e i j = h θ ( x i , x j ) (4)
In Formula (4), the parameter θ means the set of parameters such as weights in the model, hθ is a nonlinear function of the learnable parameter θ, xi and xj represent the characteristics of node i and its neighbor node j, respectively.
In the Graph Conv module, edge functions and aggregation operations play an important role in the local feature extraction of point clouds. The PointNet network can be regarded as a special form of graph convolutional neural network. There is no edge information between points. Its edge function is:
h θ ( x i , x j ) = h θ ( x i ) (5)
However, the edge function (5) only considered the global information of the point cloud in the local directed graph, and ignores the local information. In the proposed network model, the following edge function is defined by considering both global and local information of point clouds:
h θ ( x i , x j ) = h θ ( x i , x j − x i ) (6)
In the processes of aggregating features, for the central node xi of the directed graph, x ′ i is defined as the aggregation of the edge features of k points around it:
x ′ i = ∑ j : ( i , j ) ∈ E h θ ( x j − x i ) (7)
Firstly, the KNN algorithm is used to construct the graph structure from the point cloud to the point cloud set and the process of learning the aggregated edge features through the Graph Conv module is shown in
The above process is the graph convolution process. The main step is to select a point as the center point in the point cloud set, find its neighbor points by the K-nearest neighbor algorithm (five neighbor points shown on the graph), and
learn the edge features between the center point and its neighbor points through the Graph Conv module. Finally, the edge features are aggregated. To increase the feature extraction ability of the network and expand the receptive field of the model, the proposed model realizes the dynamic update of the graph structure by stacking multiple Graph Conv modules in the network, thus forming a dynamic graph convolution. The input and directed graph structure of Layer l can be expressed as:
X l = { x 1 ( l ) , x 2 ( l ) , ⋯ , x n ( l ) } ⊆ R D l (8)
G = ( V ( l ) , E ( l ) ) (9)
The output of the l + 1 graph convolution is updated to:
x i ( l + 1 ) = ∑ j : ( x i , x j ) ∈ E ( l ) h θ ( l ) ( x i ( l ) , x j ( l ) − x i ( l ) ) (10)
The attention mechanism [
The entire fusion attention module, shown in
In the spatial attention mechanism module (Spatial attention), the input point cloud feature matrix is defined as A, and its dimension can be expressed as B × N × C. For matrix A, new feature matrices A1 and A2 containing more spatial information can be achieved by corresponding linear transformation, and the dimension of the matrix is B × N × C. The matrix A1 is transposed, and the
transposed matrix is multiplied by the matrix A2. Then the softmax function is used to achieve the spatial attention coefficient matrix E with the size of C × C. The calculation method is shown in Formula (11):
a j i = exp ( A 1 i ⋅ A 1 j ) ∑ i = 1 N exp ( A 1 i ⋅ A 1 j ) (11)
Aji is the value calculated by the function Softmax, which means the influence of spatial position i on j in matrix E. The feature matrix A is input into a 1 × 1 convolution layer to achieve a new feature matrix A3 with a dimension of B × N × C. Then, the output feature with a dimension of B × N × C is achieved by matrix multiplication of the feature matrix A3 and the attention coefficient matrix E. A learnable linear parameter λ is introduced for this output feature. The main purpose is to adjust the weight during the training process. After the above steps, a feature matrix updated by the attention mechanism can be achieved. Finally, the feature matrix and the elements in the original feature matrix A are summed one by one to achieve the final output M of feature A, as shown in Formula (12):
M j = λ ∑ i = 1 N ( a j i A 3 i ) + A j (12)
The parameter λ in formula (12) is initialized to 0 to gradually assign more weights through network training. The final feature M achieved in this module not only contains the relevant features of the original point cloud, but also contains the spatial location features of the point cloud, and the updated feature M better aggregates the global context information.
In the more concerned point cloud channel information attention mechanism (Channel attention) module, which is similar to the above spatial attention module, the input point cloud feature matrix is defined as A, the feature size is B × N × C, the matrix A is transposed, the transposed matrix is multiplied by the original matrix, and then the softmax function is used to achieve the channel attention coefficient matrix F with a size of C × C. The calculation method is as shown in formula (13):
b j i = exp ( A i ⋅ A j ) ∑ i = 1 N exp ( A i ⋅ A j ) (13)
where bji measures the effect of channel i on channel j. The feature matrix A and the attention coefficient matrix F are multiplied by the matrix to achieve the output feature with a dimension of B × N × C. A learnable linear parameter χ is introduced for this output feature, and the weight can be adjusted during the training process. After the above steps, a feature matrix updated by the channel attention mechanism can be achieved. Finally, the feature matrix and the elements in the original feature matrix A are summed one by one to make up for the information of the input feature to achieve the final output W of the feature A, as shown in Formula (14):
W j = χ ∑ i = 1 N ( b j i A i ) + A j (14)
Similarly, the parameter χ is initialized to 0 and trained to assign weights. As shown in
The data set used for the point cloud classification task uses the Princeton University’s standard public data set ModelNet40, including 40 artificial object categories, with a totality of 12311 CAD models, of which 9843 models are used for training, and 2468 models are used for testing. For the input point cloud data, 1024 points are sampled from it, and the dimension D of the sampling points is 3, and the sampling points only contain 3D coordinate information.
The software environment required to run this model is Ubuntu 20.04.2 LTS + CUDA10.1 + PyTorch 1.6 + python3.7, the experimental parameter learning rate is 0.001, the number of iterations is 250, the batch size is 32, and the Adam optimizer is used.
The experiment mainly reflects the classification accuracy of the model by comparing the average classification accuracy (mAcc) and overall classification accuracy (OA) of different network models. The calculation formula of average classification accuracy and overall classification accuracy is:
OA = TP + TN TP + TN + FP + FN (15)
mAcc = OA TP + TN + FP + FN (16)
TP means the number of samples predicted to be positive, TN means the number of samples predicted to be negative, FP and FN represent the number of false negative and false positive samples, respectively.
To get better experimental comparison results, different classical point cloud classification network models are selected on the ModelNet40 dataset. The selected network models are VoxNet, MVCNN, ECC, PointNet, PointNet++, LDGCNN, DGCNN. The data from different network model experiments are shown in
In the data achieved in
Method | Model Input (size) | Model Size/MB | Average Classification Accuracy/% | Overall Classification Accuracy/% |
---|---|---|---|---|
VoxNet [ | Voxels (12) | — | 82.8 | 85.7 |
MVCNN [ | Views (80) | — | 89.1 | — |
ECC [ | Points (1024) | — | 83.2 | 87.4 |
PointNet [ | Points (1024) | 41.6 | 85.8 | 88.7 |
PointNet++ [ | Points + normal (1024) | 24.2 | — | 90.5 |
DGCNN [ | Points (1024) | 21.0 | 89.1 | 91.3 |
Ours | Points (1024) | 44.7 | 90.2 | 92.5 |
the spatial information characteristics of the point cloud, so its classification accuracy is higher.
Like the PointNet network model, this model uses a classification network that directly inputs point cloud data without any changes to the point cloud data, and the network model is constructed with reference to the PointNet network model. The proposed model compares the PointNet network on the ModelNet40 dataset. The results achieved by classifying each category separately are shown in
It can be seen from
Category | PointNet | Ours | Category | PointNet | Ours |
---|---|---|---|---|---|
Airplane | 1.000 | 1.000 | Laptop | 1.000 | 1.000 |
Bathtub | 0.870 | 0.920 | Mantel | 0.930 | 0.950 |
Bed | 0.960 | 0.970 | Monitor | 0.950 | 0.980 |
Bench | 0.700 | 0.750 | Night stand | 0.742 | 0.776 |
Bookshelf | 0.910 | 0.930 | Person | 0.920 | 0.950 |
Bottle | 0.940 | 0.960 | Piano | 0.900 | 0.930 |
Bow | 0.900 | 0.940 | Range hood | 0.920 | 0.952 |
Car | 0.960 | 0.980 | Sink | 0.780 | 0.850 |
Chair | 0.970 | 0.980 | Sofa | 0.960 | 0.970 |
Cone | 0.950 | 1.000 | Stairs | 0.800 | 0.900 |
Cup | 0.780 | 0.800 | Stool | 0.850 | 0.800 |
Curtain | 0.900 | 0.920 | Table | 0.800 | 0.870 |
Desk | 0.800 | 0.900 | Tent | 0.950 | 0.953 |
Door | 0.800 | 0.920 | Toilet | 0.980 | 0.970 |
Dresser | 0.696 | 0.726 | Television stand | 0.800 | 0.860 |
Flower pot | 0.220 | 0.250 | Vase | 0.820 | 0.830 |
Glass box | 0.950 | 0.970 | Wardrobe | 0.750 | 0.790 |
Guitar | 1.000 | 0.980 | Xbox | 0.650 | 0.750 |
Keyboard | 1.000 | 1.000 | Plant | 0.760 | 0.780 |
Lamp | 0.950 | 0.963 | Radio | 0.750 | 0.800 |
the proposed model not only considered the local information characteristics of the point cloud but also considered the global information characteristics of the point cloud, and focus on the spatial information characteristics and channel information characteristics of the point cloud. Therefore, the recognition rate of the proposed model is improved in the categories with obvious and unobvious features.
For the proposed network model, in the processes of constructing graph structure by the K-nearest neighbor algorithm, different K values represent different local geometric information, which will affect the final classification results. Comparing different K values, the classification accuracy achieved by experimental analysis is the best when the K value is 20, as shown in
The proposed network model contains four Graph Conv layers and two F-Attention modules, and the two F-Attention modules are after the Graph Conv layer with convolution kernels of 128 and 256, respectively. The effectiveness of the fusion attention mechanism for classification tasks is verified by reducing the F-Attention module.
It can be seen from
K = 10 | K = 20 | K = 30 | K = 40 | |
---|---|---|---|---|
Average Classification Accuracy/% | 89.3 | 90.2 | 89.7 | 88.7 |
Overall Classification Accuracy/% | 91.2 | 92.5 | 91.8 | 90.5 |
Lead-in F-Attention | Average Classification Accuracy/% | Overall Classification Accuracy/% |
---|---|---|
Nil | 89.1 | 91.3 |
F-Attention (128) | 89.7 | 92.0 |
F-Attention (256) | 89.3 | 91.8 |
F-Attention (128) + (256) | 90.2 | 92.5 |
classification accuracy respectively improved by 0.2% and 0.5%, when the F-Attention module was introduced after the Graph Conv layer with 256 convolutional kernels. After introducing two F-Attention modules, the average classification accuracy and the overall classification accuracy are respectively improved by 1.1% and 1.2% compared to the original model. The addition of the fusion attention mechanism to the model can not only focus on the spatial information features of the point cloud but also focus on the channel information features of the point cloud, and can emphasize the useful information features in the classification task to suppress the useless information features, and strengthen the ability of the network to extract features, to getting better classification results. The experimental results show that the F-Attention module has a significant improvement in the performance of the proposed model classification.
A point cloud classification network model based on dynamic graph convolution and fusion attention mechanism is proposed to address the shortcomings of the existing point cloud classification network, such as inadequate extraction of point cloud local information, ignoring the information in other neighborhood features in point cloud and not focusing on point cloud channel information and spatial information. By introducing graph convolution to achieve the extraction of point cloud local features, the local features and global features are better fused, and the attention mechanism of fusing more attention point cloud spatial information and point cloud channel information is introduced to enhance the feature extraction ability of the network. The experimental results show that compared with some existing classical point cloud classification networks, the proposed model has a certain improvement in classification accuracy and better classification performance. However, since this model does not pay attention to the direction feature information of the point cloud, it has certain deficiencies. Subsequent work will consider how to integrate the direction information features of point clouds into the model to improve the classification results.
This work is partially supported by the Development Program of Youth Innovation Teams in Colleges and Universities of Shandong Province (2019KJN048).
The authors declare no conflicts of interest regarding the publication of this paper.
Song, T.T., Li, Z., Liu, Z.G. and He, Y.Z. (2022) Point Cloud Classification Network Based on Graph Convolution and Fusion Attention Mechanism. Journal of Computer and Communications, 10, 81-95. https://doi.org/10.4236/jcc.2022.109006