^{1}

^{*}

^{2}

^{*}

^{1}

^{*}

^{2}

^{*}

Imagine that hundreds of video streams, taken by mobile phones during a rock concert, are uploaded to a server. One attractive application of such prominent dataset is to allow a user to create his own video with a deliberately chosen but virtual camera trajectory. In this paper we present algorithms for the main sub-tasks (spatial calibration, image interpolation) related to this problem. Calibration: Spatial calibration of individual video streams is one of the most basic tasks related to creating such a video. At its core, this requires to estimate the pairwise relative geometry of images taken by different cameras. It is also known as the relative pose problem [1], and is fundamental to many computer vision algorithms. In practice, efficiency and robustness are of highest relevance for big data applications such as the ones addressed in the EU-FET_SME project SceneNet. In this paper, we present an improved algorithm that exploits additional data from inertial sensors, such as accelerometer, magnetometer or gyroscopes, which by now are available in most mobile phones. Experimental results on synthetic and real data demonstrate the accuracy and efficiency of our algorithm. Interpolation: Given the calibrated cameras, we present a second algorithm that generates novel synthetic images along a predefined specific camera trajectory. Each frame is produced from two “neighboring” video streams that are selected from the data base. The interpolation algorithm is then based on the point cloud reconstructed in the spatial calibration phase and iteratively projects triangular patches from the existing images into the new view. We present convincing images synthesized with the proposed algorithm.

If you visited a rock concert recently, or any other event that attracts crowds, you probably recognized how many people are taking videos of the scenario, using their mobile phone cameras. Combining these video streams potentially allows viewing the scene from arbitrary angles or creating a new video with an artificially designed camera trajectory. This is one of the challenges of SceneNet^{1}, which aims to develop software for aggregating such audio-visual recordings of public events, in order to create multi-view high quality video sequences. The general setup of the SceneNet computational infrastructure is depicted in

In order to achieve this goal, there are several challenges that require an efficient solution. The first challenge is the mobile infrastructure: the individual user needs to be related to the event, including time and location tags. Then, large amounts of audio-visual data need to be transferred via the cellular network to a server. To this end, we have developed a framework which reduces the transmitted data. In this framework the server performs spatial registration based on image features and sensor measurements that are computed on the devices. From this registration and from video quality measurements (also done on the devices), the server chooses a small subset of videos that are transferred to the server for the multi-view video generation. The bandwidth reduction is therefore a function of the minimal number of features and auxiliary data that is needed for an accurate spatial registration.

The second challenge is spatial registration, i.e. the task of determining the relative position and orientation of two video streams. This is the classical task of epipolar geometry which describes the relative geometry of two images depicting the same scene. It is encoded in a 3 × 3 singular matrix known as the fundamental matrix [

A common practice, for estimating the fundamental matrix, is to use matching invariant features, e.g. SIFT, SURF, etc. followed by a robust model-fitting algorithm. Typically, a RANSAC [

These days, it is common to use mobile cameras that have built-in sensors such as accelerometer, compass and gyros. These can be used for measuring the motion and orientation of the mobile camera. In this paper we present an algorithm for estimating the epipolar geometry given, the relative orientation of the cameras and the

intrinsic parameters of the cameras. This allows computing an improved, accurate registration with less feature points than traditional algorithms [

Given the spatially calibrated cameras, the ability to interactively control the viewpoint while watching a video, is an exciting application. This poses an additional challenge, i.e. an efficient image interpolation algorithm is required in order to obtain free viewpoint video.

Novel view interpolation, also known in the literature as image-based rendering (IBR), is a classic problem in computer vision and graphics [

The main challenge of novel view interpolation is how to robustly estimate pixel correspondences between two given cameras frames and to interpolate the pixel motion into the novel image in a coherent way. In the proposed approach each rendered image is constructed from two existing images of the two most similar cameras, by transferring triangular patches one at a time. The triangles were defined by the correspondences of the two existing images projected onto the novel view according to 3D structure that has been estimated. We do not impose strong assumption on the scene structure or the camera movement, allowing for arbitrary input and even for wide baseline setups.

The rest of the paper is organized as follows: Section 2 analyzes the complexity of the problem; Section 1 reviews the state of the art works that are related to this presentation; and Section 4.3 presents the spatial registration algorithm. In Section 5 we present the virtual image generating algorithm; Section 6 presents experimental results that validate the presented algorithms; and Section 7 concludes the presentation.

In this section we give a theoretical and practical analysis of the computational complexity of the spatial registration algorithm for the specific problem of numerous users capturing the same scene.

The amount of data that needs to be transfer via the network is illustrated in the following example. Consider a Samsung Galaxy S6 device that records UHD video (3840 × 2160) with a bit-rate of 48 Mbps, and Full-HD video (1920 × 1080) with a bit-rate of 17 Mbps. In a moderate scene with 50 devices filming UHD and 50 devices filming Full-HD, the total bit-rate is accumulated to 3.25 Gbps. Hence, real-time transfer of all the videos to a common server is not feasible due to bandwidth limits of the wireless network which cannot exceed 500 Mbs for an individual user. The practical approach would be to submit only the detected features that can be computed on the device for each frame. This approach can be rendered insufficient since a typical number of detected features may be beyond several thousands. For example let n = 5000 be the number of features, and assume that each detected feature is described by a SIFT descriptor [

The core of the registration algorithm relies on comparisons between images. However, only images taken from nearby positions will give valuable results. Since, in the first frame we have no initial guess on the location of the cameras, we need to compare all the images pairs, which results in complexity of

As an example consider the N = 50 cameras scenario described above. In order to initialize the spatial registration an order of ^{2} ≈ 10^{3}. Hence, the registration algorithm described here requires the transmission of a smaller number of features, which results in a lower bandwidth consumption with an additional advantage of efficient computation time.

The focus of the paper is on the spatial registration task and on synthesis of virtual camera images. Hence the description of the state-of-the-art on these topics is given in the following subsections.

In this section we review the relative pose estimating algorithm that are relevant to this correspondence. In order to determine the relative pose between two cameras, one need to estimate the fundamental or essential matrix. These matrices represent the epipolar geometry that described the relative geometry of two cameras.

The fundamental matrix, F, encapsulates the two cameras’ intrinsic parameters which are the focal lengths and the principle points, and the extrinsic parameters which include the relative orientation, and the translation vector from the first to the second camera. The intrinsic parameters are usually publicity available through the camera manufacturer or the operation system of the mobile device. Given the intrinsic parameters the problem reduced to estimating the essential matrix, E, that encode only the relative pose parameters.

Numerous methods have been proposed for estimating the fundamental matrix, which can be classified as linear, iterative and robust methods. There are a set of “n-point” feature-based algorithms to compute the fundamental matrix, or the essential matrix in case of known intrinsic parameters. The fundamental matrix can be estimated by the normalized-8-point algorithm [

The performance of the “n-point” algorithm is significantly depends on the quality of the feature correspondences detected between the images. There are two main sources for degradation in the correspondences quality, (a) bad point localization due to image noise and (b) outliers caused by wrong matching between corresponding feature points. The “n-point” algorithms can compensate for feature localization errors by adding redundant points and solving a least-squares minimization problem. Alternatively, iterative methods are in general more accurate than the linear “n-point” methods. They use sophisticated computational approaches to solve a non- linear optimization problem [

Robust methods aim to tolerate both image noise and outliers. Robust parameters estimation in presence of outliers is a general problem in computer vision and thorough reviews can be found in [

The weak point of the sampling-based algorithms is the necessity to sample an oulier free set. The number of random samples needed to get an outlier free set depends exponentially on the number of elements required for the estimation and on the inlier fraction. Thus, reducing the size of the sampling set is of utmost importance when applying RANSAC scheme. For example, to get an outlier free sample with 99% certainty from a data set with 50% outlier ratio, one needs to sample 146 times while using the 5-point algorithm, 1177 times while using the 8-point algorithm, and 35 while using the proposed 3-point algorithm. A factor of 4 and 33 speedup, respectively. Hence, the proposed 3-point algorithm will be much more efficient, which might be very important for limited computational power devices such as smartphones.

In addition, RANSAC scheme can yield inaccurate hypothesis estimation due to image noise. In order to improve the performance of the RANSAC scheme many algorithms have been developed, such as LO-RANSAC [

Recently, many researchers aim at exploiting auxiliary information either visual, like vanishing points, or using external sensor attached to the camera like Inertial Measurement Unit (IMU). For example, in [

The developed approach was inspired by several state-of-the-art works on image-based rendering and image stabilization. A well established technique for image morphing is based on feature matching. For a survey of such methods see [

Other class of approaches for novel image synthesis is based on 3D structure information, usually reconstructed using standard structure-from-motion (SFM) algorithm [

Recently, Kopf et al. [

Nowadays, cameras are often attached with high-quality sensors, such as accelerometer, magnetometer (compass) and gyro. In addition, for vision-based robot navigation, low-cost inertial measurement units (IMUs) based on micro-electro-mechanical systems (MEMS) devices can be used to estimate the relative pose.

In this section we present an effective algorithm that exploits sensors’ measurements for estimating the essential matrix and thus solves the relative pose problem. A short version of the proposed method has been presented at GAMM 2015, see [

In this presentation we follow the standard notations and mathematical foundation presented in Hartley and Zisserman [

The pinhole camera model is given by a projection matrix

where K is the calibration matrix, R is the rotation matrix that rotates a vector in the world coordinate system to the camera’s reference frame, and C is the camera’s center of projection in the world coordinate system.

The calibration matrix is a 3 × 3 matrix that includes the focal length, and the principle point of projection. Practically, the image’s center point is commonly regarded as the principle point. The focal length can be estimated using dedicated algorithms, e.g., [

The geometry that described the relative pose of two cameras is known as epipolar geometry [

The fundamental matrix is composed of the extrinsic parameters that describe the relative pose between the two views, and the intrinsic parameters that include the focal length and the principal point (center of projection) of the cameras. As stated above, the intrinsic parameters may be regarded as known, and can be extracted from the camera API and thus reduce the fundamental matrix to the essential matrix:

where any tagged symbol represents entities of the 2^{nd} camera. The essential matrix E is composed of the relative rotation and translation,

where

It has 5 degrees of freedom (DOF), 3 for the 3 rotation angles, and 2 for the normalized 3-vector that represents the translation between the two view.

In order to estimate the essential matrix and thus solve the relative pose problem, one uses the epipolar constraint equation:

where

by homogeneous vectors

polar constraint from multiple correspondence pairs.

It is common to represent the rotation matrix by its Euler-angles:

Given the relative rotation matrix, a set of points from the first image

ponding to a set of points from the other image

Using the Euler-angle representation, each corresponding pair

Hence, the optimization problem (7) is reduced to a 3-dimensional linear optimization problem:

A valid solution is possible with only two equations because t has only 2 degrees of freedom. But to avoid degenerated situations we use it with a minimal set of 3 correspondences.

Similar to the 8-point algorithm [

The rotation matrix in Equation (3) represents the rotation from the second camera to the first, where the first camera is located at the origin and is align with the coordinate system major axis. In case the rotation matrices of both cameras are given by sensor measurements with respect to some global reference frame, the relative orientation need to be computed for the proposed algorithm. Let

Then,

A schematic illustration of the virtual camera problem is illustrated in

After identifying the two most suitable cameras, our approach for novel view image synthesis consists of the following steps:

1) Projection of feature correspondences from two nearest views into new image.

2) Delaunay triangulation of projected points.

3) Warp triangles into new view.

4) Fill holes and background from a background model.

The following subsections detail our approach with respect to the processing steps.

For a new camera pose, i.e., 3D rotation and spatial location, we commence by identifying the two closest cameras, and the subset of 3D points that corresponds to matching feature pairs. The two cameras are the two that have the most similar orientation out of the set of nearby cameras. This stage utilizes the 3D background model that been constructed in the spatial calibration task.

Using the epipolar constraint and guided matching more features pairs are added and triangulated, where the triangulation procedure is the standard Delaunay triangulation. An example of the pairwise matching and the projection on the new view is given in

The triangulation on the new view induce triangular meshes on the two existing images. In the next phase we sequentially go over the triangles in the new view and render them one at a time.

In order to choose the best “looking” triangle we compute the Procrustes distance [

A normalization procedure precedes the computation of the Procrustes distance. The centroid of the configuration is translated to the origin and the norm of the configuration is rescaled to unity. The distance is defined as:

where

We add another condition on the appearance of the triangles if the sum of squared difference (SSD) of the pixels from the two triangle is above a predifined threshold we reject this specific triangle. This condition, has two proposes. First, it aims to handle matching errors, that can reduce appearance artifacts. Second, it reduces the artifacts caused by triangle that its pixels cover segments in multiple depths.

Each new rendered triangle is expanded by a constant band that overlaps the already rendered pixels. Similar to image quilting [

In this section we provide a quantitative evaluation of the proposed approach on synthetic generated data as well as on several real-world datasets. These evaluations show that the presented algorithm leads to estimators that are competitive with state-of-the-art algorithms.

In this experiment we synthesized 3D data points and two cameras with known poses and intrinsic parameters. We report two kinds of experiments, the first examine the resilience of the estimation algorithm to additive spatial noise that perturbed the feature points in the image domain. The second experiment, examine the behavior of the estimation algorithm in the presence of mis-matched points, i.e., in the presence outliers. For the synthetic

experiment we compared our algorithm to the standard normalized-8-points algorithm (N8P) [

For each experiment we test 3 camera configurations. First, two cameras with the same orientation looking forward located along the x-axis. Second, two cameras located along the x-axis looking forward with different orientations. Third, the two cameras located one in front of the other, with the same orientation.

We report two evaluation criteria. First, the root mean square error (RMSE) on the basis of the Sampson distance, which is defined as:

where

The results of the experiments are illustrated in

We evaluate the propose calibration approach on two benchmark datasets that are extensively used for evaluating structure-from-motion algorithms [

Both datasets are available online^{3}, and include accurate camera matrices from which we extract the rotation matrices. The first dataset is the Fountain-P11, which contains eleven images of a fountain. The other dataset is the Herz-Jesu-P25 which contain 25 images of a building. Feature points were detected using standard corner detector and were described by SIFT descriptors [

The estimated camera locations and the estimated 3D structure of both datasets are illustrated in

In addition, we compare the camera location estimates of the proposed algorithm to the results reported by Dalalyan and Keriven [^{4}. The accuracy and timing performance of our algorithm compared to their algorithm both running implemented in Matlab and given the same input is reported in

This experiments on real datasets provided both qualitative and quantitative results that prove the validity of the proposed algorithm for solving spatial calibration problems.

Estimation Error | Time (Section) | |
---|---|---|

Proposed 3-pt. alg. | 0.000128 | 15.3 |

Dalalyan & Keriven [ | 0.000342 | 265.0 |

Estimation Error | Time (Section) | |
---|---|---|

Proposed 3-pt. alg. | 0.000968 | 11.7 |

Dalalyan & Keriven [ | 0.040161 | 269.23 |

In this section we demonstrate the validity of the proposed virtual camera scheme. The reported results are based on the Fountain-P11 dataset, more results are available in the project’s website^{5}. We applied the spatial registration algorithm (Section 4) and reconstructed the 3D point cloud as is illustrated in

Synthetic image is presented in

These days, given the amount of images and videos capture of the same scene at the same time, it is natural to ask, how all this huge volume of information can be utilized for a different and better experience for the user? In this paper we present two algorithms that can be the first building blocks for such an ambitious goal.

The first algorithm is a spatial calibration algorithm, which gives the captures images and the measurements of the device’s sensors accurately and efficiently estimates the pose of the cameras. We validate the proposed algorithm on synthetically generated data and on well-established benchmark datasets, and compare it to state- of-the-art algorithms.

The second algorithm is a virtual images generator that gets the calibrated cameras poses and the reconstructed 3D point cloud, and generates virtually appealing images of a virtual camera moving along a specific trajectory. The proposed algorithm generates the virtual image from two images of the cameras that have similar viewing angle and are not far from the virtual camera. We present several results that demonstrate the quality of the algorithm.

The authors gratefully acknowledge the support by the European Union under the 7th Research Framework, programme FET-Open SME “SceneNet” (GA 309169).

AmirEgozi,DovEilot,PeterMaass,ChenSagiv, (2015) A Robust Estimation Method for Camera Calibration with Known Rotation. Applied Mathematics,06,1538-1552. doi: 10.4236/am.2015.69137