Point Reg Net: Invariant Features for Point Cloud Registration Using in Image-Guided Radiation Therapy ()
1. Introduction
1.1. Overview
A point cloud is a set of unorganized irregular 3D points in a unified coordinate system, capturing 3D spatial information of an object or scenery. 3D point cloud registration refers to a research problem in which two isomorphic point cloud sets are known by coordinates respectively, and an optimal homomorphous transformation between them is remained to be solved. It is a fundamental problem in 3D computer vision. Many applications have been developed including 3D modeling [1] [2], object recognition [3], surface alignment [4] [5] and antonymous driving [6].
There is no overwhelming solution for this challenging task due to unknown initial positions, noisy raw data, varying point densities, partial loss or overlap, and more unknown difficulties. Generally, current point cloud registration methods can be classified into two categories: iteration-based [7] and feature-based [8] [9] [10]. Iteration-based methods inevitably carry heavy computational load which is unwanted in the scenery of real-time surgery and image-guided radiation therapy.
In this paper, we focus on the feature-based method in which registration is achieved by extracting features, matching features and generating point-to-point correspondences, which is independent of initial proximity information between coordinate systems. For feature-based methods, local feature descriptors play a core role in feature matching [11]. In general, a good feature descriptor should be highly descriptive, unambiguous, compact and computationally efficient to enable a good matching performance. And the descriptors should also be robust to common noises and adaptive to various modalities, which is a challenging task due to the varieties of nuisances and the point cloud’s irregular format.
At present, numerous local feature descriptors have been demonstrated [9] [12] [13] [14]. All of these feature descriptors either make use of local geometric statistics (e.g., normal, curvature), or spatial distribution of the neighboring points. For almost planar or spherical surfaces, they either suffer from low descriptiveness, or sensitiveness for surface noise and single outliers, as a result of lack of global topology information. In addition, descriptors derived from mathematical formula doubtfully make fully use of common statistical characteristics of a given dataset.
In this context, inspired by [15], we conduct a novel type of deep learning network that extracts feature descriptors aiming at dataset of a specific category, which makes it possible to take advantage of characteristics of medical data and its unique statistics distribution. Softmax layer and Max Pooling layer are innovatively modified to be Weighted Extraction Layer (WEL), and are applied to point-wise information. Local features are weighted, synthesized and aggregated into global descriptors as well as coordinates of control points, which are mean coordinates across local subsets of points. Utilizing control points instead of key-points allows the available geometrical data to be better exploited [16], and improves the robustness of registration. The global descriptors and coordinates of control points are internally promising sorted, omitting the time-consuming and error-prone feature matching procedure.
The contributions of this work are as follows:
1) We improve the architecture of voxel feature encoding (VFE) layer based on Voxel Net [6], which is suitable for consuming unordered point sets in 3D;
2) We propose Weighted Extraction Layer (WEL) which selectively synthesizes local features into global descriptors and coordinates of control points;
3) We show how CPN can be trained to perform 3D registration without point matching procedure;
1.2. Related Works
Registration of pre- and intra-interventional data is one of the enabling technologies for image-guided radiation therapy, radio surgery, and interventional radiology. Different combinations of modalities and dimensionalities, either 3D/2D or 3D/3D, have been studied [17] [18]. However, seldom has application adopted point cloud as format of input data. [19] presents an 2-D/3-D registration based on CNN regression using for image-guided interventions. [16] explores the use of 3D point cloud sensors in medical augmented reality applications.
On the other side, many studies on general point cloud registration have been conducted, e.g. spin images [12], fast point feature histograms (FPFH) [9], signature of histograms of orientations (SHOT) [13], and rotational projection statistics (RoPS) [14]. [20] proposes local feature statistics histogram (LFSH)feature descriptor, and gives a performance comparison between these features.
Recent days, point cloud processing based on deep learning has become research focus. [21] provides a classic unified architecture for unordered point cloud classification, segmentation, and semantic parsing. [6] unifies feature extraction and bounding box prediction into a single end-to-end trainable deep network. [15] represents latest progress of deep-learning-based algorithm for point cloud registration. It encodes local 3D geometric structures into super-points using unsupervised auto-encoder. However, a matching procedure and a fine-tuning stage must be performed before transformation estimation.
2. Algorithm Principle
The CPN takes as input two point clouds which are similar in shape by different in local coordinates, extracting local features per voxel, then synthesizes them into coordinates of control points. One group of control points paired with one point cloud, representing its position and pose. The global descriptors and coordinates of control points are internally promising sorted, therefore transformation estimation with RANSAC can be immediately applied without feature matching procedure. The number of control points M is fixed regardless of point cloud size. Currently we set
.
2.1. Control Point Net Architecture
The proposed CPN consists the following layers or functional blocks:
1) Data Grouping, in which the 3D space is equally subdivided into cubic voxels, points are grouped according to the voxel they reside in;
2) Random Sampling, in which fixed number inside each voxel are randomly selected;
3) Voxel Feature Encoding (VFE) Layers, in which point-wise features and locally aggregated feature are combined to learn descriptive shape;
4) Multiple Fully Connected Layers, in which each point-wise feature are aggregated within a middle layer, then packed up into a fixed-length feature vector;
5) Weighted Extraction Layer, in which feature vectors in a same voxel are first normalized by softmax arithmetic along elements, then synthesized to be the feature vector of this voxel. Same operations apply to coordinate. Finally, features and coordinates extracted from all voxels are maxpooled along elements and the global descriptors and super-point coordinates are generated. The architecture of CPN is illustrated in Figure 1.
2.2. Data Grouping and VFE
These blocks are designed based on operations in [6], aiming to extract and learn descriptive shape information from local point distribution. Points are subdivided and grouped along Z, Y, X respectively into voxels. Points with highly variable density throughout different voxels are randomly sampled to keep a fixed number T.
As shown in Figure 2, the architecture of VFE Layer consists of a point-wise shared fully connected layer, an element-wise maxpooling layer, and a doubled point-wise concatenated feature vector as output.
The inputs of VFE-1 is
where the centroid of points in each voxel is denoted as
.
The inputs of VFE-2 are the outputs of VFE-1.
2.3. Multiple Fully Connected Layers
We design a 3-layer fully connected network, with Batch Normalization and ReLU applied sequentially. All layers are operated on point-wise feature generated from VFE-2 layer. Layers share same weights for all points.
2.4. Weighted Extraction Layer
In order to weight and synthesize point-wise information, softmax arithmetic on point-wise features and coordinates in same voxel is first applied.
Let
be the inputs of softmax block, where T is the number of points in a voxel, K is the number of non-empty voxels.
is the feature vector generated from Fully Connected Layers, where M is output dimension.
Let
be the coordinates of points in point cloud.
(1)
(2)
are the voxel-wise softmax feature vectors and coordinate vectors, where the definition of softmax arithmetic is
(3)
Then feature vectors and coordinate vectors are extracted from all voxels are maxpooled into global descriptors and control points
(4)
(5)
where
in (5) is paired with k in (4) (Figure 3).
2.5. Loss Function
CPN is designed to extract features for registration. In order to achieve minimum registration error, the coordinates of control points should be invariant under the point cloud coordinate system. Thus for every sample in dataset, we define loss function as
(5)
where
is rigid transformation matrix that projects corresponding points from one point cloud onto another.
are coordinates of control points generated from CPN. The L1-Norm is preferred to utilize robustness upon existence of outlier points.
3. Experimental
3.1. Dataset
We’re mostly interested in the application on medical scenery such as real-time surgery or image-guided radiation therapy. Thus we establish a medical point cloud dataset using for both training and evaluating. The raw data are download from The Cancer Imaging Archive (TCIA) [22]. Collections of CT scans for lung cancer (National Lung Screening Trial, LCTSC, NSCLC-Radiomics) are adopted. All point clouds are extracted from CT images by edge detection along Z-axis positive direction. The intensity threshold is set to −250, which effectively detects the surface of skin. On average, 40,000 points can be extracted from a series of CT images and aggregated into a point cloud sample. Samples that have too few points are dropped. Samples that have too many points are down-sampled. All samples are cropped into a same 3D size, 40 cm × 40 cm × 40 cm, and the mass centers are translated into origin of coordinates. In total, we prepared 1000 samples for training and 100 for evaluating.
Labels are homograph transforms
whose rotation part is generated as a uniform distribution in
, and translation part is a uniform distribution between (−5 cm, 5 cm). Every training sample is composed of two identical point clouds, and can be transformed from one to another by homograph
(Figure 4).
With only 1000 training sample, training our network from scratch will undoubtedly suffer from overfitting. We introduce three kinds of data augmentation techniques to relieve this tendency, including jitter on points, scaling on points, and jitter on transform
. The augmentations are all done on-the-fly by GPU without being stored on disk.
3.2. Experimental Setup
All experiments are conducted under Intel Corel i7-5960X, 32.0GB RAM, NVIDIA GeForce GTX1080Ti, Windows 7 SP1. The development environment is Tensorflow 1.2.0, Python 3.6.0.
The voxel size is a tradeoff parameter between precision and speed. It’s also a
bound between local details and global coverage. Currently we adopt 10 × 10 × 10 as the size of grid.
3.3. Evaluation and Analysis
For quantitative analysis of the correspondences, we use percentage of success correspondences [22]. When the transformation error
is below a predefined threshold (we used 1 cm), the registration result is marked “success”. The smaller the RMSE, the better the two point clouds are aligned. The output of CPN can be used to calculate success correspondences directly. In contrast, for classic feature extractors, a matching algorithm should be first applied to obtain feasible correspondences between
and
. We evaluated the performance on testing dataset with Gaussian noise (standard deviation = 1 mm). The comparison results are shown in Table 1. Parameters for classic feature descriptors are set as Table 2.
Table 1 suggests that CPN outperforms all classic descriptors in the percentage of success correspondences (71.4%) and achieves a promising matching result.
The time efficiency comparison is illustrated in Table 3. All operations in CPN except the data grouping and sampling blocks are regular and dense, enabling the acceleration by GPU in a parallel manner. One can see that our CPN is superior to other descriptors in time efficiency. FPFH and SHOT descriptors are faster than the PFH descriptor by approximately an order of magnitude. In addition, the time for RANSAC matching should not be neglected.
To verify the effectiveness and robustness of the registration algorithm under nuisances, we apply modifications on test data including: randomly moving 5% - 30% of the points a uniformly distributed distance of up to 5 mm; randomly removing 5% - 30% of the points to test sensitivity to the density change.
Figure 5(a) and Figure 5(b) summarize the results. All descriptors can keep a relative high percentage of success correspondences under low-level noise or tiny change of density. The CPN always performs better compared with classic descriptors. However, the performance of CPN drops quickly when more points are removed. Even in challenging case with 30% points removed, our network is still workable and gives an acceptable percentage of success.
(a)(a)
Figure 5. (a) Success percentage under moving; (b) Success percentage under removing.
Table 1. Performance comparison (Averaged).
a. NC denotes the number of correspondences. For CPN, all super-point coordinates are regarded as correspondences; b. PSC represents the percentage of success correspondences.
Table 3. Time efficiency (Averaged).
a. CPN is accelerated by GPU. Classic features and RANSAC matching run on single CPU.
4. Conclusions
This paper presented CPN, an innovative network that extracts feature descriptors and control points for medical point cloud registration. The CPN deals with the challenges of extracting robust and unambiguous point cloud features and achieve great improvement over state-of-art algorithms.
A possible explanation why CPN outperforms classic feature descriptors is that the training stage makes it possible for network to utilize a certain assumption on input data statistical characteristics and distribution.
A possible explanation why CPN is much faster than classic feature descriptors is that the architecture of CPN is easily accelerated by parallel computational architecture. While it includes an offline training stage, the online stages can be implemented efficiently in parallel, making it suitable for serving real-time applications
In future work, we intend to adapt this approach for scalable size of voxel. Another interesting direction is to design a multi-scale version of CPN, similarly as Point Net++ [23].
Acknowledgements
This work was supported by the National Natural Science Foundation of China (61601012).