WADE-Net: Weighted Aggregation with Density Estimation for Point Cloud Place Recognition ()
1. Introduction
Large-scale place recognition plays a significant role in robotics and automatic driving, since it usually enhances localization and mapping optimization [1] - [6]. Vision data based large-scale place recognition has been investigated, and some successful solutions are presented in several surveys [7] [8] [9]. However, they are sensitive to season and illumination variations. Meanwhile, with the help of spatial-aware feature information, 3D points based methods are relatively robust to these changes [10] [11] [12] [13] [14]. In Figure 1, an outdoor scene shows that 3D point cloud could better describe the different spatial distributions of
Figure 1. Different point distributions in an outdoor scene. Three enlarged dashed circles are an advertising board, the rear of one car, and afforestation, respectively.
different local regions. As a consequence, place recognition from point cloud data is becoming an increasingly attractive research topic. The main challenge of point cloud recognition lies in how to extract effective features and generate a discriminative representation. Various feature extraction strategies based on handcrafted or deep learning methods for 3D point clouds emerge gradually [15] [16].
Traditionally, some 3D recognition methods pay attention to handcrafted local feature extraction, including normal orientation, curvature and distribution histogram [17] [18] [19] [20] [21]. Specifically, [19] [20] generate histograms based on geometric attribution to obtain local information. These methods, however, often consume much time and more computational resources. By contrast, some works [18] [21] try to improve the efficiency of local feature extraction methods. They can extract local point cloud features with lower computation cost. However, they do not perform well at sparse space locations. Moreover, these approaches only concentrate on some local views, but not from a global perspective.
Therefore, methods for extracting global descriptors are proposed gradually [22] [23] [24] [25] [26]. Ensemble of Shape Functions (ESF) [22], Normal Aligned Radial Feature (NARF) [23] and Viewpoint Feature Histogram (VFH) [24] are able to filter out some local locations with sparse distribution of point cloud. The conception of projecting 3D point cloud to 2D image is utilized [5] [27] [28]. He et al. propose a global descriptor, Multiple 2D Planes (M2DP) [28], for place recognition and loop detection. It projects the 3D point cloud into multiple 2D planes and generates a descriptor vector for place representation. Following the M2DP, Kim et al. propose Scan-Context [5] to handle place recognition on the basis of 3D point database. Scan-Context separates the whole point cloud into many bins by radius and azimuth, and defines the maximum height of points in each bin as the feature value. In general, these traditional works, based on prior knowledge, obtain the handcrafted spatial feature of 3D data and have made contributions on many tasks. However, some latent features may be neglected due to the disadvantage of these handcrafted methods.
Fortunately, deep learning based methods have powerful feature extraction capability relying on numerous data fittings. They have received high attention and have been widely utilized to extracting high-dimension features from order-less point clouds [29] - [38]. Generally speaking, there are three ways of feature extraction for point cloud. Firstly, some works [29] [30] [31] [32] [33] represent the input point cloud as regular 3D gridding or voxel, but this operation may cause complex pre-process and high computational cost [12]. Secondly, motivated by CNN in 2D images, some works [34] [35] consider projecting 3D points into 2D images and using multi-view to analyze point clouds roundly. Finally, PointNet [36] and PointNet++ [37] make it possible to input raw point cloud into a network directly, but they are designed to handle small object classification and indoor scene segmentation. PointNet extracts point-wise learning features, while PointNet++ enriches them by grouping neighbor points for local information. However, it does not consider features in global perspective.
Furthermore, PointNetVLAD [10] uses the NetVLAD [39] block to generate global descriptors. Recently, various modifications of PointNetVLAD emerge [11] [12]. Specifically, PCAN [11] introduces an attention mechanism into the NetVLAD block. However, these methods may ignore the prior information of input data. LPD-Net [12] considers using traditional features to enrich input data, and adds a graph-based neighborhood aggregation module to improve the feature extraction of the network. However, it does not consider the structure information such as density and normal in local regions.
Overall, the existing methods have two main disadvantages: not considering both prior structure and latent manifold structure from data manifold. Therefore, in this paper we propose a traditional feature fusion module for prior structure extraction and a Weighted Aggregation with Density Estimation (WADE) module for iteratively extracting latent structure, respectively. Our contributions include the following three aspects:
· We fuse point coordinates with handcrafted features and neural network learned features to enrich the input information of deep network.
· We provide an iterative WADE module for local structure encoding. Specifically, the WADE introduces a weighted density into local points relative relationships.
· We conduct experiments on two benchmark datasets Oxford Robotcar [40] and KITTI [41] to demonstrate the superiority of WADE-Net over other state-of-the-art methods. Our approach exceeds most of the comparison methods on the recall rate at least 10% at TOP 1.
The rest of this paper is organised as follows. In Section 2, we introduce two mostly related methods, and the proposed method is based on their framework. In Section 3, our WADE-Net approach is explained in detail. In Section 4, we report the comparison results and some ablation experiments. In Section 5, we draw the conclusions.
2. Related Works
Traditional feature based methods. Usually, handcrafted traditional feature extraction is designed according to prior knowledge of human beings. There are several works for traditional point cloud feature extraction. Spin image (SI) [27] projects 3D points within a cylinder onto a 2D spin image. The 3D shape context [17], Point Feature Histogram (PFH) [20] and Signature Histogram of OrienTation (SHOT) [19] leverage geometric attribution to obtain local features. Fast Point Feature Histogram (FPFH) [21] and 3D Scale-Invariant Feature Transform (SIFT) [18] are proposed to extract local point cloud features with lower computation cost. Subsequently, Ensemble of Shape Functions (ESF) [22], Normal Aligned Radial Feature (NARF) [23] and Viewpoint Feature Histogram (VFH) [24] are proposed to generate a global descriptor for point cloud representation. Multiple 2D Planes (M2DP) [28] projects the 3D point cloud into multiple 2D planes and finally generates a descriptor vector. Kim et al. propose Scan-Context [5] to separate the whole point cloud into many bins by radius and azimuth, and a global feature map is obtained. Yan et al. [42] propose a sparse semantic map building method and utilize the semantic map to generate special texture features for scene recognition. LiDAR-Iris [25] generates a global descriptor based on a binary signature image obtained from the point cloud. DELIGHT [26] leverages LiDAR intensity information and encodes the information into a representative descriptor. In conclusion, the traditional feature based methods make many contributions to point cloud recognition, but few works fuse them into a learning framework.
Learning feature based methods. Several deep learning based point cloud feature extraction methods have been proposed in recent years. 3D ShapeNets [29], Vote3Deep [30], VoxelNet [31], 3D Generative Adversarial Network (3D-GAN) [32], and Volumetric CNN [33] transform point cloud inputs into regular 3D gridding or voxel representations, which may cause complex pre-processing operations and high computational cost [12]. Multi-View based Convolutional Neural Network (MVCNN) [34] and Group-View CNN (GVCNN) [35] project 3D points into 2D images and use multi-view to analyze point cloud roundly.
PointNet [36] and PointNet++ [37] have the ability to extract point-wise features from a raw point cloud. Inspired by them, networks such as PointCNN [43], Frustum Pointnet [44], SO-Net [45] and Splatnet [46] are proposed for point cloud feature extraction.
Furthermore, PointNetVLAD [10] proposes a new point cloud place recognition method via a global descriptor module. Recently, LPD-Net [12] and PCAN [11] improve PointNetVLAD to recognize places efficiently. However, PCAN may ignore the prior information of input data and it leads to high cost in the proposed attention module. SeqLPD [4] and LPD-AE [47] utilize LPD-Net as a place recognition module to implement environment construction. Moreover, [15] projects input point cloud into cylindrical coordinates and converts 3D point cloud to 2D image for place recognition. MinkLoc3D [48] uses a 3D feature pyramid network [49] to extract local features, and then it introduces Generalized-Mean (GeM) [50] pooling for global descriptor generation. Locus [51] considers fusing the segmentation, topological and temporal information for point cloud representation. In this paper, we try to enhance the important structure information, including density and spatial relationship.
3. Methodology
For fusing and aggregating meaningful structure and features from point cloud, our network framework is composed of three modules: the prior feature fusion
(green-dashed block), the iterative WADE
(yellow-dashed block), and the global descriptor generation module
(red-dashed block), as shown in Figure 2. The network maps the input raw point cloud
into a high-dimension feature space for place representation. For a certain place
, a descriptor
is generated by
(1)
3.1. Traditional Feature Fusion Module
In this part, we fuse the point coordinates with handcrafted features and the learned features for prior information enhancement (green-dashed block in Figure 2).
The extracted handcrafted features including range value, density feature and normal description are shown in Figure 3.
- Range value
has the capability to record the relative distance between the target point
and the original of coordinates.
- Density value can indicate some local distribution information of each point, and can be formulated by
, where
represents the
Figure 2. Network architecture of the proposed method.
Figure 3. Traditional handcrafted feature extraction. It is composed of range, density and normal vector.
set of neighbor points of the target point, and
is the Euclidean distance between neighborhood
and the target point B (the red point). The nearer neighbor points are, the larger density value is.
- Normal vector
is depicted in Figure 3(c), which can be approximated via
, where × is the cross product of vector, and
is the neighboring point.
We concatenate these three handcrafted features to get the local prior features with size
in module
, which is different from the local feature extraction block of LPD-Net [12]. The cross contrast experiments of the two handcrafted features are shown in Section 4.
Simultaneously, we use a two-layer MultiLayer Perceptron (MLP) [52] to extract learned point-wise features. After concatenating the point coordinate with traditional features and the learned features, we get the high dimension features
. This feature fusion block makes good use of both latent and structure features. However, due to the non-uniform distribution in a point cloud, the significance of the local structure of different points may be different. We need some adaptive sampling and weighting during feature integration.
3.2. WADE Module
In the iterative WADE module
, we further consider the weighted density distribution adaptively for feature extraction and aggregation. Figure 4, the following Sampling and Grouping (SG) operation and Feature Encoding steps describe the one WADE module.
1) Sampling and Grouping (SG) Operation. Assuming that the inputs of the WADE module are a point cloud
with size
and its corresponding features
with size
. Notice that the input features of the first WADE module is
.
By utilizing Farthest Point Sampling (FPS) [53] we get a sampling subset
with size
, as shown in Figure 5.
Then, for each sampled point
, the ball query is used to find its
neighbor points
from the input point set
. For observing their relative spatial locations and local geometric patterns, the x-y-z coordinates and features of all neighbouring points are grouped as
Figure 4. Flow chart of one WADE module.
means the number of sampled points.
represents the neighbor number for grouping.
and
are dimensions of input and output features.
Figure 5. Schematic diagram of Sampling and Grouping (SG).
is the sampled points set, and
is the grouped 3D point set.
and
respectively.
is the index set of neighbourhood point. So, the outputs of SG operation include the grouped 3D point set
with size
and its corresponding grouped features
with size
.
2) Feature Encoding.
To aggregate the feature concerning density and relative spatial information, the grouped point set
and its corresponding grouped features
are put into the following three branches: D-branch, W-Branch and feature aggregation, as shown in Figure 4.
D-Branch. This branch is about the generation of density factor, since it can represent the important structure of distribution and is proportional to the significance of the sampled point. In Figure 6, for each sampled point
, the Gaussian kernel function is used to calculate the density from the raw point cloud P, which is formulated by
(2)
Figure 6. Detailed description of D-Branch and W-Branch.
where h is the bandwidth of the kernel function, and
. Then we normalize the estimated density by
.
After the MLP encoding, we choose sigmoid as the nonlinear activation function of last layer to compute the density value
for each grouped point
. The reason for choosing sigmoid is that the generated density factor should be close to the binary choice mechanism. Moreover, the reason to use the nonlinear transform is for deciding adaptively whether to use the density value. Mathematically, The
is formulated as
where
is the concatenation operation. So, we obtain the density factor
with the size
for the grouped 3D point set
.
W-Branch. Considering that the relative spatial relationships can reflect contributions of one point to the surroundings structure, we learn a position relation of grouped points. In Figure 6, for every grouped points
, we transform neighbor grouped
points
into the local coordinate system of
, to get relative local coordinates
. Moreover, weights
are generated by MLP with Relu activation, which is the same as other MLPs of the proposed network. The reason for choosing Relu is that it helps stabilize the output of network, and it screens parameters of the network. Mathematically, for the
neighbor points
, we compute the
as
where
is the concatenation operation. Finally, we can obtain the relative spatial factor
with size
, where
is the concatenation operation for all
.
Feature Aggregation. We propose a weighted density based feature aggregation in this block, as shown in the black dashed box of Figure 4. Before feature aggregation, a weighted density ratio
is acquired by integrating the density factor
and spatial relation factor
via point-wise product
(3)
where
is the point-wise product in each group, and the size of
is
. The details are depicted in Figure 6.
In the beginning of feature encoding, the grouped feature points
is fed into the shared MLP to obtain the point-wise feature extraction
with size
, as shown in Figure 4.
Then, we conduct the feature aggregation via a matrix multiplication between the weighted density ratio
and features
as follows
(4)
where
is matrix multiplication. Furthermore, the output of feature encoding
is generated by one MLP with
output channels.
To get an efficient structure constrained local feature aggregation, we conduct the aforementioned WADE module iteratively. Finally, we obtain the output from the iterative WADE module
with size
.
3.3. Global Descriptor and Metric Learning
Applying NetVLAD block [39], we aggregate the local features into a discriminative global descriptor for each point cloud. The NetVLAD block will learn
cluster centers
and get their correlations
to feature points
by softmax operation. The representation of each cluster center
can be expressed as the following feature vector
(5)
where
and
are learned weights and biases that determine the contribution of the correlation between feature points and each cluster center
.
Then we do intra-normalization of each vector
firstly, and concatenate them, followed by
normalization to get the global feature vector
with size
. At last, a fully connected layer is utilized to reduce the dimension of global descriptor to 512, and then
normalization guarantees that the learned global descriptor
is unit length.
The learned global descriptor
represents one point cloud in the descriptor space. For recognition assignment, the similarity or dissimilarity of two point clouds should be considered, and the metric constraint is chosen. The metric constraint can balance the similarity between intra-class and inter-class via triplet constraint. The process and purpose of metric constraint can be explained in Figure 7. Each mark means the global descriptor of a specific place. Blue disks
Figure 7. A simple example showing the goal of metric constraint of the whole framework. Each mark means the global descriptor of a specific place. The blue disc shape belong to one place, and the green square and yellow triangle respectively represent another place.
are similar descriptors, and the green square and yellow triangle are both dissimilar with blue disks. Intuitively, a proper metric constraint should make blue disks closer and maintain a margin between blue disk and other categories. Therefore, a more discriminative metric constraint pushes similar descriptors closer, and away from dissimilar ones.
Generally, a set of triplet tuples from the training dataset is obtained with supervised position information (GPS). We introduce a traditional triplet constraint [54] [55] in an intuitive way, denoted:
(6)
where
is similar to
, and dissimilar to
.
denotes the distance between similar samples
and
, and
denotes the distance between dissimilar samples
and
. It means that the distance of samples representing different places should be as large as possible, while as small as possible in the same place. Traditionally, hinge loss function [56] mainly be used for the triplet constraint
(7)
where
is the hinge loss, which means
if
and
otherwise.
is the margin value to clear the edge between similar ones and dissimilar ones.
Moreover, due to the complexity of outdoor environment, we introduce Lazy Triplet Loss [10] to augment the relation of similarity and dissimilarity. Mathematically, the Lazy Triplet constraint is calculated as
(8)
where
means the maximum in
,
means the minimum in
,
is the Euclidean distance between descriptor of query (current descriptor) and that of similar place, and
means the distance between current and dissimilar one. This loss can learn a more discriminative and robust mapping in order to optimize parameters in the network.
4. Experiments
In this section, we evaluate the proposed approach with some traditional and deep learning based methods as follows: VFH [24], ESF [22], Scan-Context (SC) [5], M2DP [28], PNMAX, PNSTD [36], PNVLAD [10], PCAN [11] and LPD-Net [12]. The computer is equipped with an NVIDIA GTX 1080Ti GPU with 11GB memory, Xeon(R) CPU E5-2650-v4, 64 RAM. The code is implemented with Ubuntu 16.04 operating system with TensorFlow. For the super parameters of our network, we set batch size = 2,
,
, margin
and maximum iteration = 15, which follow the comparison methods. The cluster center numbers of VLAD, Kc is set to 32, and the dimension of the output global descriptor D is set to 512. All the comparison experiments is tested in the same machine, and the input point cloud number is set to 1024 uniformly, so the fairness is guaranteed.
4.1. Benchmark Datasets
The comparison experiment conducts on two public outdoor large-scale datasets Oxford Robotcar dataset [40] and KITTI dataset [41]. The processing of these two datasets is described as follows and showed in Figure 8.
Oxford Robotcar dataset [40]: It includes 21,711 3D point cloud submaps made up of point clouds within the car’s 20 m (meters) trajectory for training, and 3030 submaps for testing. The data were collected in different seasons, times and weathers. For evaluation details, each submap is tagged with a Universal Transverse Mercartor (UTM) coordinate. Point clouds are defined as positive pairs if they are at most 10 m apart and negative pairs if they are at least 50 m apart. In the evaluation process, the retrieved point cloud is regarded as a correct match if the distance is within 25 m between the retrieved point cloud and the query point cloud.
KITTI dataset [41]: It captures real-world traffic situations and ranges from freeways over rural areas to urban scenes with many static and dynamic objects. We choose 11 scenes named KITTI 00 to KITTI 10 for training and testing, since they supply accurate odometry ground truth information. For each scene, we utilize the reduplicative frames of places that are passed more than twice as testing samples, and other frames as training. Limitation of positive pairs and negative pairs are set to 5 m and 50 m respectively. On evaluation stage, the relative distance of correct match is 5 m and we choose 4 scenarios primarily used by researchers for evaluation. The ground points are removed using the method in [57].
4.2. Evaluation Results
The evaluation results are given in Figure 9 and Table 1. TOP 1 (@1) represents that the similar place of current frame is recognized the first time among candidate places. TOP 1% (@1%) means that the correct area is retrieved within 1% frame number of current scene.
Figure 9 shows that the proposed approach performs better than other networks in different datasets. The evaluation curve generated by our approach is, on the whole, numerically higher than these comparison ones. In KITTI 06 an 07, the advantage of the proposed method cannot be reflected fully because of simplicity of this scene.
Figure 8. Datasets visualization and pre-processing.
Figure 9. Comparison of the recall rate for different methods. (a) Oxford dataset. (b)-(f) different scenes in KITTI dataset.
Table 1. Comparison of retrieval recall rate (%) at TOP 1(@1) and TOP1% (@1%) among different networks.
Table 1 shows that our approach exceeds most of the comparison ones on the recall rate at least 10% at TOP 1 and TOP 1% on Oxford dataset. Compared with LPD-Net, we have almost 2% - 3% increase in retrieval results at both TOP 1 and TOP 1%. At the comparison experiment in KITTI dataset, our network performs much better than the best of the other comparison methods at TOP 1, which means that it is more possible to recognize the passing place all at once. What is more, at TOP 1%, our network has at least 1% - 2% increase to the best of others. Considering that the TOP 1% candidates number has relation with the frame number of the outdoor scene, there may be little difference in results at TOP 1%.
Additionally, Table 1 shows that the first two traditional methods, VFH and ESF, cannot perform as well as other learning based approaches. Empirically, traditional methods rely on prior knowledge and they may have little ability to view surroundings roundly, especially in outdoor environment. For traditional place loop detection algorithms, e.g., Scan-Context and M2DP, they do not perform well as some other methods. Analytically speaking, the point number of each point cloud affects the performance of these two methods.
4.3. Analysis and Discussion
Iteration number of WADE module. The proposed WADE module is considered as a feature extraction layer, and the proposed WADE-Net iterates it for obtaining multi-scale features. On account of point cloud number difference, parameters of WADE module have different settings in each iteration.
Table 2 shows the settings in the iterative WADE modules. Parameter N denotes the output point number of each iteration in WADE module, r and h denote the radius of ball query in grouping operation of SG operation and the bandwidth of Gaussian kernel function in Equation (2), respectively, and K is the number of neighbor points of
in Equation (2). As Table 2 shows, r, h and Cout increase gradually with the point number decreases during each iteration. It is natural that the query ball radius r should enlarge as the sampled number decrease. The change of h aims to keep balance in terms of the range of density kernel. The Cout is related to features from different scales as N decreases.
Moreover, Table 3 illustrates that the WADE-Net can perform the best when the iteration number is set to 3. Obviously, as the iteration number increase, the evaluation recall rates will decrease.
Ablation results of different modules. The effectiveness of different modules used in our network, i.e., 3D point coordinates, traditional features (TF) and iterative WADE module, is described in Table 4. As we can see from rows 1 - 2, PNVLAD + WADE performs much better than the baseline method, having average 9% recall rate increase for most of the data. The rest of rows reflects that taking traditional local feature and point coordinates into consideration is reasonable.
Ablation results of different handcrafted features. Table 5 shows the effectiveness of handcrafted feature extraction strategies used in LPD-Net and the proposed method. If the traditional features is replaced by those in LPD-Net, the recognition results are worse than ours.
Table 2. Parameters of iterative WADE module in the proposed network.
Table 3. The comparison results of the iteration number about WADE module.
Table 4. Ablation results of different modules by retrieval recall rate (%).
Table 5. Comparison on the effectiveness of handcrafted traditional features (TF) extraction part between LPD-net and our method.
Ablation studies about hyper-parameters.
Table 6 shows the ablation experiments for the hyper-parameters Kc and D. The table illustrates that if the Kc is set 32, and D is set to 512, the WADE-Net performs better.
Moreover, Equation (8) represents the constraint condition to discriminate the relationship of descriptors in positive pairs and negative pairs. In order to balance the distances of positive pairs and negative pairs,
is put forward and its ablation experiment results are depicted in Figure 10. It shows that it is able to get a moderately better result as
. In this paper, a strict mechanism is considered to focus on the most dissimilar positive sample and the most similar sample, so the margin value should be intuitively decreased. The ablation experiment testifies this idea.
Time and resources consumption. In Table 7, we list the average inference time and computational resources among the deep learning based methods. The process of inference time represents that the input point cloud in inputted into the network and a global descriptor is generated. Parameters in Table 7 means the learned parameters w and b in network framework. GFLOPs means 1 billion floating-point numbers. The smaller the result value is, the more efficient the approach is.
From Table 7, we can analysis that the parameters and GFLOPs of WADE-Net are smaller than the others. However, it does not perform well at the inference time because the TF and feature fusion module is conducted online and is based CPU.
4.4. Visualization Results
Figure 11 shows the sampled points of each iteration stage, and the first iterative stage (stage 1) has the same point number with the input point cloud. It illustrates that the sampling algorithm can keep the scene structure, and the maintained points can be considered significant or presenting local relation information.
Figure 12 gives the low dimension manifold visualization of place descriptors in the road trajectory from KITTI dataset. Each point of the sub-figure describes the global descriptor, and different colors represent different places. Figure 12 illustrates that the proposed method can generate more discriminative descriptor and retains similar topology structure of the road trajectory.
Table 6. Experiment recall results about cluster center number (Kc) and output dimension of global descriptor (D) in Feature Aggregation Module.
Table 7. The comparison results of inference time (ms) and computational resources. In the comparison methods, the deep learning based approaches are chosen.
Figure 10. The ablation experiment results in Oxford dataset for the margin value
of loss function.
Figure 11. Visualization of the extracted sampled points of each iteration in WADE module in one scene. (a) is for one scene in Oxford Dataset. The shown scene id is 2014-11-14-16-34-33 (b) is for one scene in KITTI Dataset. The frame id of shown scene is KITTI-02-900.
Figure 12. t-SNE visualization of descriptors generated by deep learning methods for KITTI dataset. Every color point in each subfigure represents the global descriptor of one frame. Row 1-4 means that visualization of different methods is conducted on KITTI 00, 02, 05, 06 respectively. Column 1 is the trajectory of each scene, which is colored by the accurate position of frames. Columns 2-7 represent different approaches including ours.
Figure 13. Retrieved trajectory map of comparison methods on Oxford dataset. Each point of the retrieved trajectory is colored. The darker the color is, the better the recognition result is.
Figure 13 depicts the retrieved trajectory map of comparison approaches mentioned in Figure 12 in Oxford data. Each trajectory point in test region is colored and the brightness of the color corresponds to TOP N candidates number of correct place. For each colored point, the darker the color is, the better the recognition result is. This visualization illustrates that our approach gets a more accurate result than most of other comparison methods. When compared with LPD-Net, we mark out one local region in LPD-Net and ours. The enlarged regions from Figure 13 show that our approach outperforms LPD-Net.
5. Conclusion
In this paper, we have proposed new point cloud representation framework via an iterative weighted density aggregation method. It enhances the input prior information for traditional feature fusion module. Then the network extracts the important structure information, including density and spacial relationship, via iterative WADE module. At last, we compare our approach with some off-the-shelf methods on two public datasets with different kinds of outdoor scenes. Experiments and visualization results show that our network has competitiveness and performs better than the others.
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grants 11771276, 12026416 and 11871327, and in part by Shanghai Science and Technology Innovation Action Plan 18441909000. Moreover, this work is supported by shanghai science and technology innovation action plan (18441909000) and the capacity construction project of local universities in Shanghai (No. 18010500600).
NOTES
*Corresponding author.