Modeling Camera Image Formation Using a Feedforward Neural Network

One fundamental problem in computer vision and image processing is modeling the image formation of a camera, i.e., mapping a point in three-dimensional space to its projected position on the camera’s image plane. If the relationship between the space and the image plane is assumed to be linear, the relationship can be expressed in terms of a transformation matrix and the matrix is often identified by regression. In this paper, we show that the space-to-image relationship in a camera can be modeled by a simple neural network. Unlike most other cases employing neural networks, the structure of the network is optimized so as for each link between neurons to have a physical meaning. This makes it possible to effectively initialize link weights and quickly train the network.


Introduction
A camera can be considered as a device that records objects in three-dimensional (3D) space in the form of their two-dimensional (2D) images.In some technical fields where the use of a camera is required, such as computer vision and image processing, accurate and efficient modeling of the camera's image formation process is a basic problem that must be solved.
For a camera installed in a certain task, the image formation in the camera is characterized with the internal and external parameters of the camera [1].The internal parameters include focal length, optical image center, and lens distortion coefficients, whereas the external parameters are those for specifying the geometric position and orientation of the camera.The camera model parameter determination process is called camera calibration [2].Once a camera is calibrated, it is possible to computationally relate objects in 3D world and their projections on the camera's image plane.
Camera modeling and calibration have received great attention in photogrammetry, computer vision, machine vision, and image processing communities particularly since 1980s as cameras and computers became smaller, cheaper, more powerful, and easier to use than before thanks to the rapid technical advances in electronics.The most widely used method is mathematically estimating the parameters of a camera model that best relate control points in 3D world and their corresponding 2D image points in the model [3][4][5].To increase the accuracy of camera calibration, control points must be collected evenly from the space viewed by the camera.However, it is difficult to make accurate position measurements of the 3D points.Methods of automatic calibration [6,7] and using planar points [7,8] have been proposed to overcome this difficulty.Existing camera modeling and calibration techniques are well reviewed in [9,10].
The relationship between the coordinates of a 3D point and the coordinates of its corresponding 2D image point is expressed in terms of a 3×4 matrix when the relationship in a camera is assumed linear.The elements of this transformation matrix can be determined by a regression technique using six or more control points and their image points.
In this paper, we show that the relationship between 3D points and their 2D images can be expressed by a neural network (NN).The model parameter can then be learned by training the NN.The proposed method is quite different from most existing NN-based methods for camera calibration, where NNs are usually used for identifying unknown parts which are not accommodated in a camera model.For example, in [11], an NN is used for learning camera's nonlinearity after linear parameter estimation.The nonlinearity is mostly due to lens distortion [12].If the linear NN model of this paper is combined with an existing NN for learning nonlinearity, a complete camera model can be constructed with only NNs.

Pin-hole Model
Pin-hole camera model is widely used to relate the image coordinates of an object point visible by a camera and the coordinates of the point in the world coordinate system by distortion-free linear mapping [1,2].All rays of sight from 3D points in a scene are assumed to pass one particular spatial point, pin-hole, in the model.Figure 1 shows the pin-hole camera model, where the following relationships are assumed where f is the focal length.Combining above equations leads us to the following equation (4)

Neural Network Implementation
A feedforward neural network is capable of computing output values from given input values by propagating weighted values through links between neurons.We want to design an NN as shown in Figure 2 that can represent the image formation process described in Section 2.1.However, it is not possible to build a network in this structure directly from Equation ( 4) due to the scale factor s, which is the coordinate z C of a 3D point.Instead, Equation ( 4) leads us to a structure shown in Figure 3.    ( ) Like most other NNs and their applications, the key issue of the NN implementation presented in Figure 4 is determining the weight of each link between neurons.From Equation ( 4), the physical meaning of w nm , a link weight from neuron m to neuron n, can be specified as ( 5 ) where r pq are elements of rotation matrix R, and t p are elements of translation vector T; 1≤ p, q ≤3.The network shown in Figure 4 has a quite simple structure.However, it is not simple to train the NN because we do not know the scale factor s for a given 3D point P. We know only the projected image coordinates u and v for a control point P. If the desired output is not available, it is not possible to train the network using a supervised learning algorithm, such as gradient descent optimization [13].We thus need to develop a method to train the network in the structure of Figure 4.
An error function is defined as Note that the error term of the 3 rd output neuron, e 3 , is derived from (7) Then, the weights are trained by gradient descent.For a weight w nm , n=1 or 2, as shown in Figure 5, a chain rule is applied to the given error E as where, assuming a linear activation function for output neurons, For the case of n=3, on the other hand, the following equation can be obtained by gradient descent, .

Numerical Example
A camera is assumed to be located at x=−200, y=500, z=2000 and oriented by Z-Y-X Euler angles of θ z =45°, θ y =−30° and θ x =120° in the world coordinate system {W}.
It is also assumed that the focal length is f=25, the coordinates of the optical image center is (258, 204), and the dimension of a pixel is 0.023×0.023.This camera setup is drawn in Figure 6.An NN can then be built to express the image formation process of the camera as presented in Figure 7.

Concluding Remarks
We have shown that a feedforward neural network can be  constructed to express the image formation process of a camera.The network constructed in this paper is in a quite simple structure with four input neurons and three output neurons of linear activation functions.Although most existing applications of NNs to camera modeling have focused on nonlinear lens distortion problem, the network of this paper models the linear perspective transformation.A method to learn the link weights between neurons of the proposed network is also described.The entire image formation of a camera may be modeled accurately if the proposed network is combined with an existing NN-based method developed for correcting lens distortion.

Acknowledgement
2) for a 3D point P =[x y z] T in the world coordinate system {W}, its corresponding representation P C =[x C y C z C ] T in the 3D camera coordinate system {C}, the projected point at [u, v] T on the 2D image plane, and the optical image center at [i O j O ] T in the row-column image frame {U}.A 3D point in {W} can be transformed to the representation in {C} by a 3×3 rotation matrix R and a translation vector T.

Figure 1 .
Figure 1.Pin-hole camera model.The coordinates of an image point are computed in the model from the 3D coordinates in {C} by

Figure 4
is a practical network implementation of Figure 3.

Figure 2 .
Figure 2. Image formation model by a neural network.

Figure 3 .
Figure 3. NN built from pinhole camera model.

Figure 4 .
Figure 4. Implementation of the NN of Figure 3.

Figure 5 .
Figure 5. Connection between an input neuron m and an output neuron n.

Figure 6 .
Figure 6.Camera setup assumed as an example.

Figure 7 .
Figure 7. Neural network resulted from the camera setup.