WADE-Net: Weighted Aggregation with Density Estimation for Point Cloud Place Recognition

Point cloud based place recognition plays an important role in mobile robotics. In this paper, we propose a weighted aggregation method from structure information adaptively for point cloud place recognition. Firstly, to preserve the prior distributions and local geometric structures, we fuse learned hidden features with handcrafted features in the beginning. Secondly, we further extract and aggregate adaptively weighted features concerning density and relative spatial information from these fused features, named Weighted Aggregation with Density Estimation (WADE) module. Then, we conduct the WADE block iteratively to group the latent manifold structures. Finally, comparison results on two public datasets Oxford Robotcar and KITTI show that the proposed approach exceeds the comparison approaches on recall rate ave-ragely 7% - 8%.


Introduction
Large-scale place recognition plays a significant role in robotics and automatic driving, since it usually enhances localization and mapping optimization [1]- [6]. Vision data based large-scale place recognition has been investigated, and some successful solutions are presented in several surveys [7] [8] [9]. However, they are sensitive to season and illumination variations. Meanwhile, with the help of spatial-aware feature information, 3D points based methods are relatively robust to these changes [10] [11] [12] [13] [14]. In Figure 1, an outdoor scene shows that 3D point cloud could better describe the different spatial distributions of Figure 1. Different point distributions in an outdoor scene. Three enlarged dashed circles are an advertising board, the rear of one car, and afforestation, respectively. different local regions. As a consequence, place recognition from point cloud data is becoming an increasingly attractive research topic. The main challenge of point cloud recognition lies in how to extract effective features and generate a discriminative representation. Various feature extraction strategies based on handcrafted or deep learning methods for 3D point clouds emerge gradually [15] [16].
Traditionally, some 3D recognition methods pay attention to handcrafted local feature extraction, including normal orientation, curvature and distribution histogram [17] [18] [19] [20] [21]. Specifically, [19] [20] generate histograms based on geometric attribution to obtain local information. These methods, however, often consume much time and more computational resources. By contrast, some works [18] [21] try to improve the efficiency of local feature extraction methods. They can extract local point cloud features with lower computation cost. However, they do not perform well at sparse space locations. Moreover, these approaches only concentrate on some local views, but not from a global perspective.
Therefore, methods for extracting global descriptors are proposed gradually [22] [23] [24] [25] [26]. Ensemble of Shape Functions (ESF) [22], Normal Aligned Radial Feature (NARF) [23] and Viewpoint Feature Histogram (VFH) [24] are able to filter out some local locations with sparse distribution of point cloud. The conception of projecting 3D point cloud to 2D image is utilized [5] [27] [28]. He et al. propose a global descriptor, Multiple 2D Planes (M2DP) [28], for place recognition and loop detection. It projects the 3D point cloud into multiple 2D planes and generates a descriptor vector for place representation. ognition on the basis of 3D point database. Scan-Context separates the whole point cloud into many bins by radius and azimuth, and defines the maximum height of points in each bin as the feature value. In general, these traditional works, based on prior knowledge, obtain the handcrafted spatial feature of 3D data and have made contributions on many tasks. However, some latent features may be neglected due to the disadvantage of these handcrafted methods.
Fortunately, deep learning based methods have powerful feature extraction capability relying on numerous data fittings. They have received high attention and have been widely utilized to extracting high-dimension features from order-less point clouds [29]- [38]. Generally speaking, there are three ways of feature extraction for point cloud. Firstly, some works [29] [30] [31] [32] [33] represent the input point cloud as regular 3D gridding or voxel, but this operation may cause complex pre-process and high computational cost [12]. Secondly, motivated by CNN in 2D images, some works [34] [35] consider projecting 3D points into 2D images and using multi-view to analyze point clouds roundly. Finally, PointNet [36] and PointNet++ [37] make it possible to input raw point cloud into a network directly, but they are designed to handle small object classification and indoor scene segmentation. PointNet extracts point-wise learning features, while PointNet++ enriches them by grouping neighbor points for local information. However, it does not consider features in global perspective.
Furthermore, PointNetVLAD [10] uses the NetVLAD [39] block to generate global descriptors. Recently, various modifications of PointNetVLAD emerge [11] [12]. Specifically, PCAN [11] introduces an attention mechanism into the NetVLAD block. However, these methods may ignore the prior information of input data. LPD-Net [12] considers using traditional features to enrich input data, and adds a graph-based neighborhood aggregation module to improve the feature extraction of the network. However, it does not consider the structure information such as density and normal in local regions.
Overall, the existing methods have two main disadvantages: not considering both prior structure and latent manifold structure from data manifold. Therefore, in this paper we propose a traditional feature fusion module for prior structure extraction and a Weighted Aggregation with Density Estimation (WADE) module for iteratively extracting latent structure, respectively. Our contributions include the following three aspects: • We fuse point coordinates with handcrafted features and neural network learned features to enrich the input information of deep network.
• We provide an iterative WADE module for local structure encoding. Specifically, the WADE introduces a weighted density into local points relative relationships.
• We conduct experiments on two benchmark datasets Oxford Robotcar [40] and KITTI [41] to demonstrate the superiority of WADE-Net over other state-of-the-art methods. Our approach exceeds most of the comparison methods on the recall rate at least 10% at TOP 1. The rest of this paper is organised as follows. In Section 2, we introduce two mostly related methods, and the proposed method is based on their framework. In Section 3, our WADE-Net approach is explained in detail. In Section 4, we report the comparison results and some ablation experiments. In Section 5, we draw the conclusions.

Related Works
Traditional feature based methods. Usually, handcrafted traditional feature extraction is designed according to prior knowledge of human beings. There are several works for traditional point cloud feature extraction. Spin image (SI) [27] projects 3D points within a cylinder onto a 2D spin image. The 3D shape context [17], Point Feature Histogram (PFH) [20] and Signature Histogram of OrienTation (SHOT) [19] leverage geometric attribution to obtain local features. Fast Point Feature Histogram (FPFH) [21] and 3D Scale-Invariant Feature Transform (SIFT) [18] are proposed to extract local point cloud features with lower computation cost. Subsequently, Ensemble of Shape Functions (ESF) [22], Normal Aligned Radial Feature (NARF) [23] and Viewpoint Feature Histogram (VFH) [24] are proposed to generate a global descriptor for point cloud representation. Multiple 2D Planes (M2DP) [28] projects the 3D point cloud into multiple 2D planes and finally generates a descriptor vector. Kim et al. propose Scan-Context [5] to separate the whole point cloud into many bins by radius and azimuth, and a global feature map is obtained. Yan et al. [42] propose a sparse semantic map building method and utilize the semantic map to generate special texture features for scene recognition. LiDAR-Iris [25] generates a global descriptor based on a binary signature image obtained from the point cloud. DELIGHT [26] leverages LiDAR intensity information and encodes the information into a representative descriptor. In conclusion, the traditional feature based methods make many contributions to point cloud recognition, but few works fuse them into a learning framework.
Furthermore, PointNetVLAD [10] proposes a new point cloud place recognition method via a global descriptor module. Recently, LPD-Net [12] and PCAN [11] improve PointNetVLAD to recognize places efficiently. However, PCAN may ignore the prior information of input data and it leads to high cost in the proposed attention module. SeqLPD [4] and LPD-AE [47] utilize LPD-Net as a place recognition module to implement environment construction. Moreover, [15] projects input point cloud into cylindrical coordinates and converts 3D point cloud to 2D image for place recognition. MinkLoc3D [48] uses a 3D feature pyramid network [49] to extract local features, and then it introduces Generalized-Mean (GeM) [50] pooling for global descriptor generation. Locus [51] considers fusing the segmentation, topological and temporal information for point cloud representation. In this paper, we try to enhance the important structure information, including density and spatial relationship.

Methodology
For fusing and aggregating meaningful structure and features from point cloud, our network framework is composed of three modules: the prior feature fusion  (green-dashed block), the iterative WADE  (yellow-dashed block), and the global descriptor generation module  (red-dashed block), as shown in Figure 2. The network maps the input raw point cloud into a high-dimension feature space for place representation. For a certain place

Traditional Feature Fusion Module
In this part, we fuse the point coordinates with handcrafted features and the learned features for prior information enhancement (green-dashed block in Figure 2). The extracted handcrafted features including range value, density feature and normal description are shown in Figure 3.
has the capability to record the relative distance between the target point where × is the cross product of vector, and j C is the neighboring point.
We concatenate these three handcrafted features to get the local prior features with size 5 N × in module  , which is different from the local feature extraction block of LPD-Net [12]. The cross contrast experiments of the two handcrafted features are shown in Section 4.
Simultaneously, we use a two-layer MultiLayer Perceptron (MLP) [52] to extract learned point-wise features. After concatenating the point coordinate with traditional features and the learned features, we get the high dimension features . This feature fusion block makes good use of both latent and structure features. However, due to the non-uniform distribution in a point cloud, the significance of the local structure of different points may be different. We need some adaptive sampling and weighting during feature integration.

WADE Module
In the iterative WADE module  , we further consider the weighted density distribution adaptively for feature extraction and aggregation. Figure 4, the following Sampling and Grouping (SG) operation and Feature Encoding steps describe the one WADE module.
To aggregate the feature concerning density and relative spatial information, the grouped point set G P and its corresponding grouped features G F are put into the following three branches: D-branch, W-Branch and feature aggregation, as shown in Figure 4.
This branch is about the generation of density factor, since it can represent the important structure of distribution and is proportional to the significance of the sampled point. In Figure 6, for each sampled point the Gaussian kernel function is used to calculate the density from the raw point cloud P, which is formulated by    W-Branch. Considering that the relative spatial relationships can reflect contributions of one point to the surroundings structure, we learn a position relation of grouped points. In Figure 6, where  is the point-wise product in each group, and the size of ( ) The details are depicted in Figure 6.
In the beginning of feature encoding, the grouped feature points G F is fed into the shared MLP to obtain the point-wise feature extraction , as shown in Figure 4.
Then, we conduct the feature aggregation via a matrix multiplication between the weighted density ratio ( ) G R P and features G F as follows where ⊗ is matrix multiplication. Furthermore, the output of feature encoding ( ) G MLP F is generated by one MLP with out C output channels.
To get an efficient structure constrained local feature aggregation, we conduct the aforementioned WADE module iteratively. Finally, we obtain the output from the iterative WADE module

Global Descriptor and Metric Learning
Applying NetVLAD block [39], we aggregate the local features into a discriminative global descriptor for each point cloud. The NetVLAD block will learn c K cluster centers { }  Therefore, a more discriminative metric constraint pushes similar descriptors closer, and away from dissimilar ones.
Generally, a set of triplet tuples from the training dataset is obtained with supervised position information (GPS). We introduce a traditional triplet constraint [54] [55] in an intuitive way, denoted: where max pos δ means the maximum in pos δ , min neg δ means the minimum in neg δ , pos δ is the Euclidean distance between descriptor of query (current descriptor) and that of similar place, and neg δ means the distance between current and dissimilar one. This loss can learn a more discriminative and robust mapping in order to optimize parameters in the network.
The cluster center numbers of VLAD, K c is set to 32, and the dimension of the output global descriptor D is set to 512. All the comparison experiments is tested in the same machine, and the input point cloud number is set to 1024 uniformly, so the fairness is guaranteed.

Benchmark Datasets
The comparison experiment conducts on two public outdoor large-scale datasets Oxford Robotcar dataset [40] and KITTI dataset [41]. The processing of these two datasets is described as follows and showed in Figure 8.
Oxford Robotcar dataset [40]: KITTI dataset [41]: It captures real-world traffic situations and ranges from freeways over rural areas to urban scenes with many static and dynamic objects. We choose 11 scenes named KITTI 00 to KITTI 10 for training and testing, since they supply accurate odometry ground truth information. For each scene, we utilize the reduplicative frames of places that are passed more than twice as testing samples, and other frames as training. Limitation of positive pairs and negative pairs are set to 5 m and 50 m respectively. On evaluation stage, the relative distance of correct match is 5 m and we choose 4 scenarios primarily used by researchers for evaluation. The ground points are removed using the method in [57].

Evaluation Results
The evaluation results are given in Figure 9 and Table 1. TOP 1 (@1) represents that the similar place of current frame is recognized the first time among candidate places. TOP 1% (@1%) means that the correct area is retrieved within 1% frame number of current scene.  Figure 9 shows that the proposed approach performs better than other networks in different datasets. The evaluation curve generated by our approach is, on the whole, numerically higher than these comparison ones. In KITTI 06 an 07, the advantage of the proposed method cannot be reflected fully because of simplicity of this scene.    Table 1 shows that our approach exceeds most of the comparison ones on the recall rate at least 10% at TOP 1 and TOP 1% on Oxford dataset. Compared with LPD-Net, we have almost 2% -3% increase in retrieval results at both TOP 1 and TOP 1%. At the comparison experiment in KITTI dataset, our network performs much better than the best of the other comparison methods at TOP 1, which means that it is more possible to recognize the passing place all at once. What is more, at TOP 1%, our network has at least 1% -2% increase to the best of others. Considering that the TOP 1% candidates number has relation with the frame number of the outdoor scene, there may be little difference in results at TOP 1%.
Additionally, Table 1 shows that the first two traditional methods, VFH and ESF, cannot perform as well as other learning based approaches. Empirically, traditional methods rely on prior knowledge and they may have little ability to view surroundings roundly, especially in outdoor environment. For traditional place loop detection algorithms, e.g., Scan-Context and M2DP, they do not perform well as some other methods. Analytically speaking, the point number of each point cloud affects the performance of these two methods.

Analysis and Discussion
Iteration number of WADE module. The proposed WADE module is considered as a feature extraction layer, and the proposed WADE-Net iterates it for obtaining multi-scale features. On account of point cloud number difference, parameters of WADE module have different settings in each iteration. Table 2 shows the settings in the iterative WADE modules. Parameter N denotes the output point number of each iteration in WADE module, r and h denote the radius of ball query in grouping operation of SG operation and the bandwidth of Gaussian kernel function in Equation (2), respectively, and K is the number of neighbor points of  in Equation (2). As Table 2 shows, r, h and C out increase gradually with the point number decreases during each itera- Moreover, Table 3 illustrates that the WADE-Net can perform the best when the iteration number is set to 3. Obviously, as the iteration number increase, the evaluation recall rates will decrease.
Ablation results of different modules. The effectiveness of different modules used in our network, i.e., 3D point coordinates, traditional features (TF) and iterative WADE module, is described in Table 4. As we can see from rows 1 -2, PNVLAD + WADE performs much better than the baseline method, having average 9% recall rate increase for most of the data. The rest of rows reflects that taking traditional local feature and point coordinates into consideration is reasonable.
Ablation results of different handcrafted features. Table 5 shows the effectiveness of handcrafted feature extraction strategies used in LPD-Net and the proposed method. If the traditional features is replaced by those in LPD-Net, the recognition results are worse than ours.     Ablation studies about hyper-parameters. Table 6 shows the ablation experiments for the hyper-parameters K c and D. The table illustrates that if the K c is set 32, and D is set to 512, the WADE-Net performs better.
Moreover, Equation (8) represents the constraint condition to discriminate the relationship of descriptors in positive pairs and negative pairs. In order to balance the distances of positive pairs and negative pairs, α is put forward and its ablation experiment results are depicted in Figure 10. It shows that it is able to get a moderately better result as 0.5 α = .
In this paper, a strict mechanism is considered to focus on the most dissimilar positive sample and the most similar sample, so the margin value should be intuitively decreased. The ablation experiment testifies this idea.
Time and resources consumption. In Table 7, we list the average inference time and computational resources among the deep learning based methods. The process of inference time represents that the input point cloud in inputted into the network and a global descriptor is generated. Parameters in Table 7 means the learned parameters w and b in network framework. GFLOPs means 1 billion floating-point numbers. The smaller the result value is, the more efficient the approach is.
From Table 7, we can analysis that the parameters and GFLOPs of WADE-Net are smaller than the others. However, it does not perform well at the inference time because the TF and feature fusion module is conducted online and is based CPU. Figure 11 shows the sampled points of each iteration stage, and the first iterative stage (stage 1) has the same point number with the input point cloud. It illustrates that the sampling algorithm can keep the scene structure, and the maintained points can be considered significant or presenting local relation information. Figure 12 gives the low dimension manifold visualization of place descriptors in the road trajectory from KITTI dataset. Each point of the sub-figure describes the global descriptor, and different colors represent different places. Figure 12 illustrates that the proposed method can generate more discriminative descriptor and retains similar topology structure of the road trajectory.      Figure 13 depicts the retrieved trajectory map of comparison approaches mentioned in Figure 12 in Oxford data. Each trajectory point in test region is colored and the brightness of the color corresponds to TOP N candidates number of correct place. For each colored point, the darker the color is, the better the recognition result is. This visualization illustrates that our approach gets a more accurate result than most of other comparison methods. When compared with LPD-Net, we mark out one local region in LPD-Net and ours. The enlarged regions from Figure 13 show that our approach outperforms LPD-Net.

Conclusion
In this paper, we have proposed new point cloud representation framework via an iterative weighted density aggregation method. It enhances the input prior information for traditional feature fusion module. Then the network extracts the important structure information, including density and spacial relationship, via iterative WADE module. At last, we compare our approach with some off-theshelf methods on two public datasets with different kinds of outdoor scenes. Experiments and visualization results show that our network has competitiveness and performs better than the others.