Depth Based View Synthesis Using Graph Cuts for 3DTV

In three-dimensional television (3DTV), an interactive free viewpoint selection application has received more attention so far. This paper presents a novel method that synthesizes a free-viewpoint based on multiple textures and depth maps in multi-view camera configuration. This method solves the cracks and holes problem due to sampling rate by performing an inverse warping to retrieve texture images. This step allows a simple and accurate re-sampling of synthetic pixels. To enforce the spatial consistency of color and remove the pixels wrapped incorrectly because of inaccuracy depth maps, we propose some processing steps. The warped depth and warped texture images are used to classify pixels as stable, unstable and disoccluded pixels. The stable pixels are used to create an initial new view by weighted interpolation. To refine the new view, Graph cuts are used to select the best candidates for each unstable pixel. Finally, the remaining disoccluded regions are filled by our inpainting method based on depth information and texture neighboring pixel values. Our experiment on several multi-view data sets is encouraging in both subjective and objective results. Furthermore, our proposal can flexibly use more than two views in multi-view system to create a new view with higher quality.


Introduction
Recently, 3D-TV application and system are rapidly growing.With the growing capability of capturing devices, multi-view capture system with dense or sparse camera array can be built with ease, free-viewpoint television (FTV) [1] system has attracted increasing attention.In FTV system, users can freely select the viewpoint of any dynamic real world to see.The chosen free-viewpoint cannot only be selected from available multi-view camera views, but also from any viewpoint between these cameras.This system requires a smart synthetic algorithm that allows free-viewpoint view rendering.To render a high quality image at an arbitrary viewpoint, one has to manage three main challenges as pointed out in [2].First, empty pixels and holes due to sampling of the reference image have to be closed.Secondly, pixels at borders of high discontinuities cause contour artifacts.The third challenge involves inpainting disocclusions that remain after blending the projected images (these are invisible from any of the surrounding cameras).In [3] it is shown that one can obtain an improved rendering quality by using the geometry of the scene.When using depth information, a well-known technique for rendering is called Depth Image Based Rendering (DIBR), which involves the 3D-projection or 2D-warping from a viewpoint into another view.
In this paragraph, we describe briefly some recent researches on free-viewpoint DIBR algorithm.In [2], the author has developed a free-viewpoint rendering algorithm which is based on layered representation.For texture mapping, 3D meshes are created and the rendering is implemented on a Graphics Processing Unit (GPU).Although the results look good, the method is complex and requires a considerable amount of pre-and post-processing operations.This work is extended in [4] where the depth map is decomposed into three layers and these layers are warped separately.The warp results are obtained for each layer and merged.To deal with artifacts, they have introduced three post-processing algorithms.In [5], a new viewpoint is rendered by some steps.First, the depth maps of the reference cameras are warped to the new viewpoint.Then the empty pixels are filled with a median filter.Afterwards, the depth maps are processed with a bilateral filter.Then, the textures are retrieved by performing an inverse warping from the projected depth maps back to the reference cameras.Ghost contours are removed by dilating the disocclusions.Finally, the tex-ture images are blended and the remaining disocclusions are inpainted using the method proposed by Telea [6].Although, the results look good, this method is remaining some issues such as not removing all holes by median filter, assigning a none-zero value for some pixels in disocclusion regions.This work is improving in [7] by introducing three enhancing techniques.First, re-sampling artifacts are filled in by a combination of median filtering and inverse warping.Second, contour artifacts are processed while omitting warping of edges at high discontinuities.Third, disocclusion regions are inpainted with depth information.The quality of this method is higher than the work in [5], but still having disadvantages.For example, they have to define the label of pixel at high discontinuities.The color consistency during blending is not verified to avoid jagged edges at straight line after blending.The work in [8] combines depth based hole filling and inpainting to restore the disoccluded pixels more accurately compared to inpainting method without using depth information.This method produces a notable blur and can be computationally inefficient when disoccluded region is larger in the new view.
In this paper, we introduce a new free-viewpoint rendering algorithm from multiple color and depth images.First, the depth maps for the virtual views are created by warping the depth maps of reference cameras.We process the wrapped depth maps with median filter.Depth maps consist of smooth regions with sharp edges, so filtering with a median will not degrade the quality.Then, the textures are retrieved by performing an inverse warping from the warped depth maps to the reference cameras.This allows a simple and accurate resampling of synthetic pixels.After that, all warped depth and warped texture images are used to classify pixels as stable, unstable and disoccluded regions.An initial virtual view is created based on weighted interpolation of stable pixels.To refine the synthetic view, the best candidates for unstable pixels are optimally selected by Graph cuts.By defining the types of pixels and using Graph cuts, the color is consistent and the incorrectly wrapped pixels because of inaccuracy depth maps are removed in the refined view.The remaining disoccluded pixels are inpainted by using depth and texture neighboring pixel values.Considering depth information for inpainting, blurring between foreground and background textures is reduced.
The rest of this paper is organized as follows: Section 2 presents the proposed view synthesis algorithm.Section 3 shows experimental results; and, finally, Section 4 concludes this paper.

Proposed Synthesis Method
Our proposal is shown in Figure 1 and it consists of six steps.These steps are explained below.

3D Warping the Depth Maps
3D warping enables to synthesize a new view from the reference view as following.

Let
be the world point;  T , , ,1 , ,1 p u v  and be its projection , ,1 p u v  onto reference and synthetic image planes, respectively.  ; ; where, i K is a 3 3  upper triangular matrix representing the inner structure of the camera and is called i the intrinsic matrix.The 3 orthogonal matrix represents the orientation and represents the position.The matrix is called the extrinsic matrix and it indicates the relationship between world coordinates and the camera coordinates.
Rearranging (1) we can derive 3D coordinate of the scene point : Substituting ( 3) into (2) we obtain the synthetic pixel position : Assuming that the world coordinate system is the same as the reference camera coordinate system and looks at along direction Z  , i.e., , ) can rewrite as following: where, w Z is defined by the pixel value at coordinate point in the reference image.
In our method, only depth maps of reference cameras are projected to virtual image plane.The warping is specified by:

Median Filter the Warped Depth Map
respective reference textures image by employing ( 5), such that color of the synthetic destination pixel 2 is interpolated from the surrounding pixel 1 in the reference color image.Figure 4 illustrates the image rendering process using inverse warping.

p p
In this step, we consider the blank points that appeared in projected depth map.The reasons for the appearance of these blank points are round off errors of the image coordinate by ( 6) and depth discontinuities.It can cause one pixel wide blank region to appear.This blank region can be filled by median filter with a window of pixels.Depth maps consist of smooth regions with sharp edges, so filtering with a median will not degrade the quality.
The advantage of an inverse warping operation is that all pixels of the destination image are correctly defined and the color disoccluded pixels can be inferred by back This step can describe as: ), ( where, is a median filter with a window

Retrieve Texture Image by Inverse Warping
In this step, the textures are retrieved by performing in-   projected point 2w onto multiple source image planes, covering all regions of video scene.

3D P
Figure 5 shows the retrieved color images by inverse warping using depth maps in Figure 3.

Pixel Classification and Initial New View Creation
Formally, suppose that we have a set of texture im- If the depth value of a pixel at only one input image is higher than the depth threshold and at all remaining images is less than , we classify the pixel p as the stable pixel.This is case the pixel p is visible in only one view.The values of the pixel p at synthetic view are just copied from the values of the pixel p in the visible view.
If the depth value of a pixel is higher than p P  Assuming that for each view , this total number is .
 , we classify the pixel as the stable pixel.Otherwise, the pixel is classified as the unstable pixel.The value of unstable pixel can set to be −1 so that they can be easily identified.
where, is the weight factor assigned to view i , where, is view index, i i  is the angular distance of view I and i is weight for the view at that pixel.The constant controls the fall off as the angular dis-  The new view is specified by , InitialView , , , , , , , , where, is the procedure of pixel classification and initial new view creation as above described.

Find the Best Candidate for Unstable Pixel by Graph Cuts
In this step, we focus on refining initial synthetic view with unstable pixels.Unstable pixels have multiple pixel candidates and we want to predict the best candidate that minimizes the energy function described in following part.
We denote as labeling space with L  definition, our problem is to find the labeling * f to fill the unstable region, such that the labeling * f has minimum cost.
We define our energy function based on the Markov Random Fields (MRF) formulation: where, f is the labeling field, is the set of unstable pixels, and is the pixel's neighborhood system.
is called the data term, which defines the cost of assigning label p f to pixel .p  , ,  p q p q V f f denotes the smoothness term that evaluates the cost of disagreement between and which is assigned with p q p f and q f respectively. is a parameter to weigh the importance of these two terms.Data term The first part of data term enforces the candidate pixel selected to agree with its neighbor pixels.In addition, the neighboring pixel that is disocclusion does not influence the candidate selection process.It is also penalized less cost for the selecting a candidate pixel which has smaller depth value Z because the pixel with smallest depth value is closer to the camera and more likely defined the color of synthetic pixel . 2 The second part of ( 16) is stationary cost, which defined based on color similarity at pixel p of all the input images.If the pixel has similar color at more input images, the stationary cost is smaller.V f f : measures the penalty of two neighboring pixel and with different labels and is defined as follow: where,  denotes the Euclidean distance in RGB color spaces.The smoothness term gives a higher cost if p f and q f do not match well.By incorporating such the smoothness term, we can achieve visually smooth in the synthetic image.
We apply graph cuts optimization that is public available in [9] to minimize our energy function   E f .More detail about energy minimization with graph cuts can be found in [10,11].
This step is specified by The refinement of image in Figure 7 by using graph cut to select the best candidate for unstable pixel is shown in Figure 8.

Inpainting Disocclusion Pixels Based on the Depth and Color Values of Neighboring Pixels
Until this step, only the disocclusion regions are remaining.To deal with these disoccluded pixels, many papers such as [5,8] have developed algorithms based on the  inpainting method proposed by Tela [6].Inpainting is a process of reconstructing lost or corrupted parts of images using the values of neighborhood pixels.Although, these algorithms work sufficiently well, the resulting inpainted regions contain a notable blur because of the mixture background and foreground colors at the edge of disoccluded regions.In this paper, we develop a technique based on inpainting method with depth information.We assume that the disoccluded pixels belong only to background, and we employ depth information to select accurately background pixels at the edges of disoccluded regions so that the blur can be avoided.Our method consists of several steps as follow.First, for reducing processing time we find the small disoccluded regions by defining a window with the size of centered at and counting the unstable pixel inside this window.If the number of visible pixels 3 3  p M inside this window is higher than 50%, then the disoccluded pixels is inpainted by a weighted interpolation from visible pixels, which is specified by ) where, M is number of visible pixels inside the window.
is disoccluded region, and is distance from O i disoccluded pixel to visible pixel .

Experimental Results
We quantify the proposal method performance based on Peak Signal Noise Ratio ( ) and the structural similarity (SSIM) index between a reference image r and a synthetic image s .SSIM index is a method for measuring the similarity between two images [12].The SSIM index value 1 is only reachable when two images are identical and the higher PSNR normally indicates that it is higher quality synthetic image.Before computing , the images are converted from RGB color space to YUV color space, and Y channel is used for calculation.Y channel is defined by The can be calculated by where, and h are the image width and height.r and s are the channels of reference image and synthetic image, respectively.

w Y Y
The proposed new view synthesis has been tested on "Break-dancer" and "Ballet" sequence which are gener-ated and distribution by Interactive Visual Group at Microsoft Research [13].These datasets include a sequence of 100 images of 102 pixels captured from 8 cameras with the calibration parameters.Figure 9 shows the camera arrangement of these two sequences.Depth maps for each view are also provided.For more detail about these depth maps generation, please refer to [2].
4 768  In our paper, the synthetic view is set to be the same as the actual camera.View 3 and 5 are used with depth maps to synthesize view 4. Figure 10 shows the example of view synthesis results.The experimental results show that the proposed method achieved on average over 34 dB in PSNR and 0.93 index value in SSIM on the two sequence "Break-dancer "and "Ballet".
Figure 11 shows our PSNR and SSIM comparison with those of Sohl et al. [14] over 100 frames for the "Break-dancer" and "Ballet" sequences.
Because usually the number of cameras is limited, the camera arrangement is very importance for obtaining a good quality of synthesized view.Figure 12 shows our quality of synthesis with varying the distance between the two reference cameras comparing the method presented by Mori et al. in [5], where our measurements correspond to an average over 100 frames.
The measured synthetic qualities are compared with other methods and summarized in   results, the average PSNR of proposal is superior to that of other methods such as Mori et al. [5], Sohl et al. [14] with a gain of 3.0 dB.The structure similarity (SSIM) of our method is higher than that of Sohl et al. method.Moreover, in multi-view configuration, we have cameras, which capture the scene at difference positions.For our experimental case, there are 8 cameras.Thus, instead of using only two neighbor views as above conventional methods, we can use more than two images to synthesize a new view.Our proposal can do this idea easily.Our experiment shows that using four reference views (two views on both left side and right side) to synthesis a new view, a higher PSNR (about 0.5 -1 dB) and SSIM are obtained than the case of using two reference views.

Conclusions
In this paper, we propose a novel synthesis method that enables to render a free-viewpoint from multiple existing cameras.The proposed method solves the main problems of depth based synthesis by performing the pixel classification to generate an initial new view from stable pixels and using Graph cuts to select the best candidate for unstable pixels.By defining the types of pixels and using Graph cuts, the color is consistent and the pixels are wrapped incorrectly because inaccuracy depth maps are removed.The remained disoccluded pixels are inpainted by using depth and texture neighboring pixel values.Considering depth information for inpainting, blurring between foreground and background textures are reduced.Experimental results show that the proposed method has strength in artifact reduction.In addition, our smooth term makes the result visually smooth.Objective evaluation has shown that our method gets a significant gain in PSNR and SSIM comparing to some other existing methods.Another advantage of our method is that we can use a set of un-rectified images in multi-view system to create a new view with higher quality.
The drawback of our method is using Graph Cuts, which is time consuming.However, we just only apply Graph Cuts for unstable pixels, which are a small amount of pixels comparing to the whole image, so the time for Graph Cuts can be reduced.
The future work will focus on more improving synthesis quality with utilizing temporal information in successive video frames.

Figure 1 .
Figure 1.Proposed new view synthesis algorithm.

Figure 2
Figure 2 can be processed by using median filter to obtained images in Figure 3.

Figure 2 .
Figure 2. The projected depth maps from two reference cameras (from the left side and from the right side).Figure4.Image synthesis process using inverse warping.

Figure 4 .
Figure 2. The projected depth maps from two reference cameras (from the left side and from the right side).Figure4.Image synthesis process using inverse warping. p

Figure 5 .
Figure 5. Obtained color images by inverse warping.
The color and depth values of stable pixel at synthetic view are rendered by blending p M pixels as following weighted interpolation:

Figure 6 .
color value and depth value of pixel at view i .The weight assigned to each view should reflect its proximity with the view being synthesized.The views that are closer to the synthetic view should have a bigger weight.In general, case, the weight p i w can be set based on baseline spacing.However, for more precise weighting, we use the angle distance determined by the point in 3 and camera positions as D shown in The w c tance increases.Input views for which π 2 i   are eli-

Figure 6 .
Figure 6.Weighted interpolation based on angular distances.minated as they view the scene the other side.In practice, has been found to work well.1or 2 c The new view is specified by 1) are the color value and disoccluded indicator q O of pixel , respectively.q  and  are weight factors.  i I p is color value of pixel at input image .

Figure 7 .
Figure 7. Initial synthesized view with 3 types of pixels.(The white color pixels are unstable pixels, the red color pixels are disoccluded pixels and the remaining pixels are stable pixels).

Figure 8 .
Figure 8. Refinement of initial synthesized image (image in Figure 7) by using graph cut (the red color pixels are disoccluded pixels).
and depth values of the visible pixel .i Second, for each pixel o in remaining disoccluded regions we search in eight directions to find the pixel u , which has the smallest depth value p p p min Z at the edge of disoccluded region and the distance u from this point to .We define a window with the size of and we count the visible pixels which have depth value Z with min 5 Z Z   .If there are not enough 50% of visible pixels inside the window, we increase the size of window by increasing  .Finally, disoccluded pixels are inpainted by a weighted interpolation from visible pixels according to (19).With inpainting procedure describing above, this step can summarized by

N
For each color channel, the color threshold C is set to be 15 in our case.Depth threshold is the brightness in the depth map.In our experiments, the depth threshold Z t in more than one view, we examine both the color and depth values of the pixel to detect the types of pixel.

Table 1 .
From the