^{1}

^{1}

Monocular visual odometry (VO) is the process of determining a user’s trajectory through a series of consecutive images taken by a single camera. A major problem that affects the accuracy of monocular visual odometry, however, is the scale ambiguity. This research proposes an innovative augmentation technique, which resolves the scale ambiguity problem of monocular visual odometry. The proposed technique augments the camera images with range measurements taken by an ultra-low-cost laser device known as the Spike. The size of the Spike laser rangefinder is small and can be mounted on a smartphone. Two datasets were collected along precisely surveyed tracks, both outdoor and indoor, to assess the effectiveness of the proposed technique. The coordinates of both tracks were determined using a total station to serve as a ground truth. In order to calibrate the smartphone’s camera, seven images of a checkerboard were taken from different positions and angles and then processed using a MATLAB-based camera calibration toolbox. Subsequently, the speeded-up robust features (SURF) method was used for image feature detection and matching. The random sample consensus (RANSAC) algorithm was then used to remove the outliers in the matched points between the sequential images. The relative orientation and translation between the frames were computed and then scaled using the spike measurements in order to obtain the scaled trajectory. Subsequently, the obtained scaled trajectory was used to construct the surrounding scene using the structure from motion (SfM) technique. Finally, both of the computed camera trajectory and the constructed scene were compared with ground truth. It is shown that the proposed technique allows for achieving centimeter-level accuracy in monocular VO scale recovery, which in turn leads to an enhanced mapping accuracy.

Visual odometry (VO) is a process, which estimates the camera poses from a series of successive images [

Scale ambiguity can be retrieved by imposing additional information, including known initials, additional constraints, and the addition of other sensors. Klein and Murray [

Additional constraints, such as known camera height above the ground, have also been proposed to resolve scale ambiguity. Kitt et al. [

The addition of other sensors was considered by some researchers to resolve the VO scale ambiguity, either through direct or indirect measurement. As an example, Scaramuzza et al. [

Some recent research suggested scale estimation methods, but they were dedicated to pedestrians. For example, [

This paper introduces a novel scale recovery approach using the Spike rangefinder measurements. Our approach estimates the translation scale from the measured distances of the sequential images using Spike, which results in an accurate VO solution. Through such an ultra-low-cost sensor, our visual odometry approach can recover the scale with centimeter-level accuracy, which makes it attractive to a number of applications such as pedestrian navigation and augmented reality. This paper is structured as follows. Section 2 provides some background information about the used Spike device. Section 3 introduces the VO method used in this paper. In Section 4, the data acquisition is presented. The obtained results and some discussion are presented in Section 5. Some concluding remarks are presented in Section 6.

The Spike is a small, low-cost laser-based rangefinder device. It is typically attached to a smartphone to measure the distance to an object and then localizes it by making use of the smartphone’s photo (

The Spike laser rangefinder supports ranges between 2 - 200 meters, with an accuracy of ±5 cm [

Spike App and can be exported as a Spike file (XML format).

The VO technique used in this paper was carried out using the MATLAB computer vision toolbox. The workflow of the VO approach is presented in

In order to estimate the relative pose between sequential images, feature points are extracted through the Speeded-Up Robust Features (SURF) approach [

P i T F P ( i + 1 ) = 0 (1)

[ x ′ i y ′ i 1 ] ( f 11 f 12 f 13 f 21 f 22 f 23 f 31 f 32 f 33 ) [ x i y i 1 ] = 0 (2)

where P_{i} and P_{(i+1)} are vectors in the homogeneous coordinate system containing the detected point coordinates in the image frame (i) and its correspondence in the image frame (i+1), respectively, and (F) is the fundamental matrix. The eight-point algorithm, which requires a minimum of eight points and their correspondences, can be used to estimate the fundamental matrix since it is defined up to a scale factor. When n matched points are found, where n > 8, the least-squares estimation method is used to compute the fundamental matrix. In this case, Equation (3) can be re-written as:

[ x 1 x ′ 1 x 1 y ′ 1 x 1 y 1 x ′ 1 y 1 y ′ 1 y 1 x ′ 1 y ′ 1 1 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ x n x ′ n x n y ′ n x n y n x ′ n y n y ′ n y n x ′ n y ′ n 1 ] [ f 11 f 12 f 13 f 21 f 22 f 23 f 31 f 32 f 33 ] = 0 (3)

When the intrinsic camera parameters are known, the essential matrix (E) is used, which is related to the fundamental matrix through Equation (4) [

E = K T F K (4)

where (K) is the calibration matrix of the camera system. The essential matrix has five degrees of freedom and can be decomposed using singular value decomposition to yield the relative rotation matrix (R) and the normalized translation (T) between the frames. The decomposition process is given in detail in [

The range measurements acquired by the Spike are used to calculate the scale factor of the relative pose. While moving in a straight line, the baseline between two frames(S) equals the range difference between the previous frame (r_{p}) and the current frame (r_{c}), i.e.,

S = r p − r c (5)

The current scaled location and orientation of the system relative to the first frame can be obtained using Equations (6) and (7). Where (C) is the current location, (P) is the previous location, (R_{p}) is the previous orientation, and (R_{c}) is the current orientation. The coordinates of the subsequent frame can then be computed relative to the previous frame.

C = P + S ∗ R p ∗ T (6)

R c = R ∗ R p (7)

Two datasets were collected along precisely surveyed tracks in both of outdoor and indoor environments, as shown in

The proposed approach has been tested in both of outdoor and indoor environments and processed, as explained in Section (3). The outdoor images were processed using Pix4D mapper software [

were geolocated using the camera poses estimated from VO after correcting for the scale factor using the Spike measurements. The generated point cloud from this scenario will be referred to as the Spike-based point cloud in the sequel. In the second scenario, the images were geolocated using iPhone GPS coordinates. The generated point cloud from this case will be referred to as the iPhone-based point cloud in the sequel. The camera calibration parameters are also estimated through the MATLAB camera calibration tool, as shown in

The matching results between sequential frames before removing the outliers are shown in

images are shown in

The essential matrix can be computed using the matched points between the two frames. Then, the essential matrix is decomposed to obtain the normalized relative translation (T) and rotation matrix (R), which represents the rotation between the two frames. The following numerical example shows the mathematical steps to calculate the second frame pose relative to the first frame, assuming that the first frame coordinates are (0, 0, 0) and the first rotation matrix is the identity matrix.

The measured ranges of the first and second frames are:

r p = 13.12 m

r c = 11.1 m

Consequently, from Equation (5), the baseline can be computed as follow

S = r p − r c = 2.02 m

The normalized relative translation (T) and rotation matrix (R) obtained through the decomposition of the essential matrix are:

T = [ − 0.00307 − 0.05695 0.99837 ] , R = [ 0.99998 0.00417 0.00267 − 0.0041 0.99998 − 0.00376 − 0.00268 0.00375 0.99998 ]

Using Equations (6) and (7), the current location and orientation can be obtained as:

CurrentLocation = [ 0 0 0 ] + 2.02 ∗ [ 1 0 0 0 1 0 0 0 1 ] ∗ [ − 0.00307 − 0.05695 0.99837 ] = [ − 0.0062 − 0.1150 2.016 ]

CurrentOrientation = [ 0.99998 0.00417 0.00267 − 0.0041 0.99998 − 0.00376 − 0.00268 0.00375 0.99998 ] * [ 1 0 0 0 1 0 0 0 1 ] = [ 1 0.0026 0.0041 − 0.0042 1 − 0.0038 − 0.0026 0.0037 1 ]

By repeating the previous steps, the estimated camera poses relative to the first

frame can be obtained.

X_{RMSE} (meter) | Y_{RMSE} (meter) | Total_{RMSE} (meter) | |
---|---|---|---|

Outdoor dataset | 0.22 | 0.65 | 0.69 |

Indoor dataset | 0.19 | 0.54 | 0.58 |

is about 60 cm. However, it is observed that the estimated trajectory using Spike is close to the reference trajectory. This shows that the proposed approach using the Spike measurements allows for scale recovery of the monocular VO and precise localization of the camera.

Distances between sequential Poses | Scale error | |
---|---|---|

Ground Truth | VO using Spike | |

2.005 | 2.017 | 0.012 |

2.001 | 2.008 | 0.007 |

1.992 | 2.027 | 0.035 |

2.001 | 2.018 | 0.017 |

1.999 | 2.060 | 0.061 |

2.010 | 1.987 | −0.023 |

2.009 | 1.969 | −0.041 |

1.989 | 1.985 | −0.003 |

1.991 | 2.019 | 0.028 |

2.007 | 2.010 | 0.003 |

Distances between sequential Poses | Scale error | |
---|---|---|

Ground Truth | VO using Spike | |

2.00 | 2.01 | 0.01 |

2.00 | 1.97 | −0.03 |

2.00 | 2.02 | 0.02 |

2.00 | 1.98 | −0.02 |

2.00 | 1.99 | −0.01 |

2.00 | 1.98 | −0.02 |

1.00 | 1.03 | 0.03 |

1.07 | 1.09 | 0.02 |

2.08 | 1.99 | −0.09 |

2.05 | 2.01 | −0.04 |

2.05 | 1.98 | −0.07 |

2.03 | 2.02 | −0.01 |

2.04 | 1.98 | −0.06 |

2.08 | 1.99 | −0.09 |

2.04 | 2.01 | −0.03 |

To further assess the effectiveness of the proposed approach, the point clouds of the two data sets were generated using the Pix4D mapper. Figures 13-16 show the results of comparing the Spike-based and the iPhone-based point clouds. We compared the dimensions of different features from both point clouds, using CloudCompare software, with the ground truth measured in the field using a tape with 2 mm precision (Figures 13-16). It was found that the Spike-based point cloud is more precise than the iPhone-based counterpart. Knowing that the points were manually picked, and hence, there is also a manual measurement error in the estimated distances in Figures 13-16. This shows that

the use of Spike measurements to recover the scale ambiguity is an efficient and cost-effective approach, which significantly improves the mapping accuracy.

In this paper, we presented a novel approach, which takes advantage of the low-cost Spike laser rangefinder to resolve the scale ambiguity in monocular visual odometry. The proposed approach was tested in both of outdoor and indoor scenarios, and the results were compared to a ground truth measured by a high-end total station. It was shown that the proposed solution allows for achieving centimeter-level accuracy in monocular VO scale recovery, which leads to an enhanced mapping accuracy.

This research is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and Ryerson University. The first author would like to thank Mr. Nader Abdelaziz, a Ph.D. candidate at Ryerson University, for his help in the field data collection.

The authors declare no conflicts of interest regarding the publication of this paper.

El Amin, A. and El-Rabbany, A. (2020) Monocular VO Scale Ambiguity Resolution Using an Ultra Low-Cost Spike Rangefinder. Positioning, 11, 45-60. https://doi.org/10.4236/pos.2020.114004