A Two-Stage Algorithm of High Resolution Image Alignment for Mobile Applications

Global motion estimation (GME) algorithms are widely applied to computer vision and video processing. In the previous works, the image resolutions are usually low for the real-time requirement (e.g. video stabilization). However, in some mobile devices applications (e.g. image sequence panoramic stitching), the high resolution is necessary to obtain satisfactory quality of panoramic image. However, the computational cost will become too expensive to be suitable for the low power consumption requirement of mobile device. The full search algorithm can obtain the global minimum with extremely computational cost, while the typical fast algorithms may suffer from the local minimum problem. This paper proposed a fast algorithm to deal with 2560 × 1920 high-resolution (HR) image sequences. The proposed method estimates the motion vector by a two-level coarse-to-fine scheme which only exploits sparse reference blocks (25 blocks in this paper) in each level to determine the global motion vector, thus the computational costs are significantly decreased. In order to increase the effective search range and robustness, the predictive motion vector (PMV) technique is used in this work. By the comparisons of computational complexity, the proposed algorithm costs less addition operations than the typical Three-Step Search algorithm (TSS) for estimating the global motion of the HR images without the local minimum problem. The quantitative evaluations show that our method is comparable to the full search algorithm (FSA) which is considered to be the golden baseline.


Introduction
Global motion estimation (GME) had been widely applied to video processing and computer vision in decades.
For example, video stabilization, motion compensation, and the popular image panoramic stitching which can make a photograph with wide field of view.There are two major tasks to stitch a sequence of images: 1) find the global motions of the images with respect to the previous ones and then 2) stitch the sequence to produce a panoramic image.After all the images have been aligned, there are various algorithms to stitch the images.For instance, one may refer to the efficient ways which are to find the optimal seams [1]- [4], or the more complicated ones [5] [6].This study will focus on the global motion estimation, thus the discussions about the stitching algorithms are beyond the scope of this paper.If the readers are interested in the stitching algorithms, please refer to the related works.
Global motion estimation algorithms can be classified into two categories: direct methods [7]- [18] and feature-based methods [19]- [24].The direct methods aim to obtain a global motion through global minimization of some cost functions by using the image pixels directly.On the contrary, the feature-based methods first locate sparse set of keypoints in the images and then obtain the global motion parameters by matching the feature correspondences.In this paper, we focus on the direct methods since the feature-based methods are computational expensive in feature description and feature matching, which makes feature-based methods unsuitable for the low power consumption requirement in mobile devices.
The direct methods can be classified into two subgroups: full search kind algorithms [7]- [12] and fast algorithms [13]- [18].The original full search algorithm compares all the positions in the search windows, which makes the computations extremely expensive.There are some accelerated versions of full search algorithm, e.g.projection-based [9] [10], or skipping some checking points by some criteria of mathematical inequality [11] [12].Unfortunately, the computations still significantly increase as the search range and block size increase.On the contrary, the fast algorithms such as three-step search (TSS) [13] [14], four-step search [15], special patterns (diamond, hexagon, etc.) search [16] [17], and gradient-descent [18], are well-known for their quick convergence property.However, the fast algorithms usually suffer from the local minimum problem.For the applications of panoramic stitching, especially when the user captures the sequence with large motions, misalignments may produce apparently discontinuities on the seam lines.Therefore, the fast algorithms are not suitable for the panoramic stitching purposes.
Another issue is power consumption for mobile devices.The power consumption will be reduced if 1) the computation costs and 2) the memory access are as low as possible.Therefore, we aimed at developing an algorithm which is fast and low-memory-access while keeping the accuracy as comparable to the full search algorithm as possible.
The remainder of this paper is organized as follows.Section 2 gives the brief review of related works.The overall details of the proposed algorithm are described in Section 3. The comparisons of computational complexity are presented in Section 4. Accuracy verifications are presented in Section 5. Finally, conclusions are summarized in Section 6.

Related Works
The full search kind algorithms usually consider all the positions in a search window and determine the motion vectors by minimizing some cost functions.The traditional full search algorithm (FSA) computes the sum of absolute differences (SAD) between a reference block and candidate blocks by block matching to determine the motion vector of the reference block.The FSA is able to find a global minimum in search window while the computational cost is extremely high.There are some accelerated versions of FSA: Tu et al. [9] proposed the projection method to accelerate the SAD computations of block matching; Puglisi et al. [10] proposed an modified version of projection method [9], which is more efficient since only the candidate blocks satisfied the conditions are further used to compute the SAD; the other kind of accelerated version is to skip the computation of SAD by some mathematical inequality [11] [12].Although these accelerated versions significantly reduced the computational costs in contrast to original FSA, the computational costs still greatly increase as the search range and block size increase with image resolution.
On the other hand, the fast algorithms assume the energy functions (e.g.SAD) are unimodal and the optimal solution is quickly converge to the minimum using different optimization algorithms.The typical fast algorithms, for example, three-step search (TSS) algorithms [13] [14], four-step search (4SS) [15], special patterns (e.g., diamond, hexagon) search [16] [17], and gradient-descent [18], are well-known for their quick convergence property.The TSS algorithm determines motion vector for one block in three iterations, and the computation of SAD are significantly less than FSA.The 4SS algorithm is a modified version of TSS which is more robust and accurate than TSS.The special pattern search algorithms are also well-known for their fast convergence and compatible peak signal-to-noise ratio (PSNR) quality in video coding.However, the real world scenes are complex and the SAD is not unimodal in general, hence there might be local minimum problem.Figure 1 shows the illustration of local minimum problem for 1-D case.The optimal solution may converge to local minimum instead of global minimum if the initial position is close to the local minimum.

The Proposed Algorithm
The proposed algorithm is a two-level scheme which processes the QVGA versions of the original HR images followed by a motion vector refinement in HR images.Although the down-sampled image will lose some detail informations, the major structures and edges in the image are usually preserved.The QVGA global motion multiplied by 8 will approach to the HR global motion, hence we only need to search for small range in HR domain to refine the HR global motion vector.This significantly reduce the computation costs in HR full search.
The overall flow chart is shown in Figure 2. We first down-sample all the HR images to obtained the QVGA ones, and only select QVGA N reference blocks uniformly distributed in the center of image in order to reduce the  computational costs.Every block is predicted by the global motion vector of previous image or the local motion vector of the processed neighbor block, in order to extend the effective search range.Once the motion vectors of all blocks are obtained, we perform the simplest clustering algorithm to reject the "outliers" of motion vectors and obtain the global motion vector which is equal to the center of the largest cluster.Finally, we apply the global motion vectors of QVGA images to predict the motion of HR images and refine the motion vectors of HR N blocks in a small search range followed by the same motion vectors clustering to obtain the final HR global motion vectors of all images.
In the following subsections, we will describe all the details of the proposed algorithm.Section 3.1 first briefly reviews the traditional full search and then describes the global motion estimation in the QVGA resolution, including the details of the predictive motion estimation and clustering algorithm.Section 3.2 describes the HR global motion refinement.

Reference Blocks Reduction
The block-based motion estimation algorithms usually divide the image of size H W × into N N × pixels sub-blocks, as shown in Figure 3.The full search algorithm computes the sum of absolute differences (SAD) between reference image 1 t I + and all the candidate blocks within a search window in the previous image t I , the local motion of one reference block is determined by the criteria of minimizing the SAD: arg min , , After all the local motion vectors of reference blocks are estimated, the global motion vector is estimated by averaging all the local motion vectors or by some statistical method, e.g.histogram, least-square, etc.
Figure 4 shows the full search algorithm within a ( ) ( ) Assume the total number of blocks is B N , the number of addition operations for a global motion estimation is  ( ) For a QVGA(320 × 240) image, suppose 16 N = , then there are 300 B N = reference blocks for global motion estimation.In fact, the reference blocks in neighbor usually have similar motion vectors, hence only sparse reference blocks are needed.The reference blocks should have the following properties: • Repeatability: The reference blocks in 1 t I + are supposed to find their true correspondences in t I .This means that the blocks near the four boundaries of image are not suitable.
• Independence: The reference blocks in image 1 t I + should be as uncorrelated to each other as possible, hence the estimated global motion will approach to the true camera motion rather than the motions of moving objects.
Taking the above properties into considerations, we uniformly take 25 sample blocks in the center of image, as shown in Figure 5.We only select the samples from the center of image with size ( ) ( ) The central part of image is uniformly divided into 5 5 × large blocks, then the reference blocks of size N N × are sampled from the center of these large blocks.This sampling scheme significantly reduces the number of reference blocks.The comparisons of the accuracy between the proposed method and the traditional full search algorithm are shown in Section 4.
Although the number of reference blocks B N is reduced to 25, the computational costs significantly increase as the search range R and block size N increase.The block size N can not be small since the reference blocks have to contain representative and enough informations.In this paper, we set 16 N = for QVGA global motion estimation, which is a common size in many related works.For the sequence panoramic stitching purposes, users may move the camera in large motions between two successive images, hence the search range R should be as large as possible.However, according to (2), if R increases by a factor of M, the total number of addition operations will increase by a factor of 2 M .Hence we need a solution to increase the search range while keeping the computation costs unchanged.This paper utilized the idea of predictive motion vector (PMV) which is widely applied to video coding [25].

Predictive Motion Estimation
Intra/inter frame motion prediction has been widely used in video coding for local motion vector predictions.The intra-frame prediction provides possible motion to adjacent reference blocks hence the blocks can find the global minimum beyond the search range R, i.e. the search range is effectively "extended" without any additional computation.Figure 6 shows the illustration of intra-frame motion prediction.The inter-frame prediction assumes the motions of camera between two successive frames (or photoshots) are similar, hence the motion of previous frame can be exploited to predict the possible motion of current frame.Figure 7 shows the inter-frame motion prediction.Figure 8 gives an example for a pair of images in an outdoor sequence.In this example, the global motion vector is greater than the search range in horizontal direction.It is clear that the global motion vector estimated with prediction is more reliable since there is exactly a dominant motion in the histogram.In the proposed algorithm, the the reference blocks of the first row apply the inter-frame prediction and the rest of rows apply the intra-frame prediction, as shown in Figure 9.In this paper, we proposed a PMV selection scheme for the intra-frame prediction, as shown in Figure 10.We only consider three processed blocks adjacent to the current block since the neighborhood blocks give reliable prediction.In order to increase robustness, the PMV is estimated by choosing the median of the neighbor MV: The motion vectors in the first row blocks and first column blocks are less precision since the blocks select PMV from GMV of previous frame or the motion vectors of blocks which are not adjacent to them.Considering this problem, the motion vectors of blocks in the first row and first column are only for intra-frame prediction purposes, hence only the rest of 16 motion vectors are selected for further processing.
The simplest way to estimate the GMV is to averaging all the motion vectors of reference blocks, i.e., ∑ .However, averaged motion vector suffered from the "outliers" problem since all motion vectors have equivalent weights 1 . The outliers values will significantly influence the averaged value  especially when the number of blocks is small.To alleviate this problem, we applied a simple clustering algorithm to automatically cluster the motion vectors into several (at least one) clusters, and the final global motion vector of a QVGA image is obtained by averaging all the motion vectors in the largest cluster.

Threshold-Order-Dependent Clustering
The threshold-order-dependent (TOD) clustering [26] is the simplest algorithm in data clustering.The clustering result depends on the order of input data and the distance threshold θ (i.e.radius of a cluster).Figure 11 shows an example of 2D data clustering.The only one parameter is the distance threshold, which controls the final number of clusters.If the threshold is small, there might be a lot of isolated clusters; on the contrary, if the threshold is large, the largest cluster probably contains the "outliers" which decrease the precision in our global motion estimation.In this paper, we set 5 θ = by experience.The details of TOD algorithm are shown in Algorithm 1.After motion clustering, the global motion vector is equal to the center of the largest cluster, for example, the red cluster shown in Figure 11.By motion vectors clustering, the outliers problem is avoided.Note that the computational complexity of TOD is ( ) 2 O N , however, the number of data (motion vectors) in this paper is only 16, hence the computation cost of TOD is negligible (compared with the computation cost of motion estimation).

Global Motion Refinement for HR(2560 × 1920) Images
After the GMV of QVGA image is obtained, we only need to refine the global motion of HR image by exploiting the fact that the image contents are similar to the QVGA version.The division of image is the same as the QVGA version.We multiply the GMV of QVGA by 8 as the PMV of HR image, and simply perform the traditional full search in a small search range.Note that the search range in HR domain is reduced to   The proposed global motion estimation algorithm applied the idea of predictive motion vectors, hence the search range is effectively enlarged without additional computations.The comparisons of computational complexity are given in next section.

Computational Complexity
In this section, the proposed algorithm is compared with four algorithms, which are full search algorithm (FSA), three-step search (TSS) algorithm [13], and projection-based method [9] [10], respectively.The factor N F is set to 89.74 in [10].We only compare the addition (subtraction) operations in the block matching since other computations are rather minor.
Table 1 lists the computational costs of motion estimation for one block.The proposed algorithm is a modified version of FSA, hence the computational complexity for one block is the same as FSA.The proposed method only computes 25 blocks in QVGA and 16 blocks in HR while the others computes all the blocks to determine the global motion vector.We first compare the computation costs under QVGA resolution with block size 16 N = and search range 32 R = .Table 2 shows that our method is much faster than FSA since we only consider 25 blocks, but is still slower than others especially the TSS algorithm.Consider the HR motion vectors, our method provide the effective search range: Hence, in our case, the effective search range of HR motion vectors is 272 With the block size 128 N = , Table 3 shows that our method is much faster than the algorithms proposed in [9] [10] and even faster than the fast algorithm TSS [13].Table 4 lists the computational speed ratio with respect to FSA for HR global motion estimation.The proposed method is 4671.58times faster than FSA and 18.05 times faster than the projection-based method [10].
Moreover, the effective search range in our method is more than 272 since we applied the PMVs to extend the QVGA search range.The effective search range in QVGA is 1 . Note that we only consider 1E R as twice the 1 R for conservative estimation.In fact, 1E R could be more than 1 2R since we select the PMV from adjacent blocks which were predicted by their neighbors.Consider 1 , which is usually enough for ordinary usage of HR panoramic stitching.The computational costs are shown in Table 5, and the Table 6 lists the speed ratio with respect to FSA in HR motion estimation with 528 E R = .There is no additional computations since we applied the PMVs to extend the overall search range.
The proposed algorithm can be further accelerated for sequence panoramic stitching purposes since the users usually capture an image sequence in horizontal camera motions with only little vertical jitters.In such kind of applications, we can reduce the Y-direction search range to accelerate the proposed algorithm.

Accuracy Verifications
The proposed algorithm is a fast version of FSA which determines the global motion vector with only 25 reference blocks, hence we need to verify the accuracy of our method with the original FSA, which is considered as the golden baseline.In this section, we apply two different comparisons to verify the accuracy of our    method.Section 5.1 gives the comparisons of global motion vectors obtained by FSA and our method.Another comparison is PSNR in the overlapped region of every two successive images.For panoramic stitching applications, the overlapped regions of two successive images should be as similar as possible, such that there will be less visual discontinuities.In Section 5.2, we apply the PSNR value as the quantitative index to show that the global motion vector errors do not affect the stitching quality since the PSNR differences are below 0.5 dB, which is usually treated as noise-level difference and negligible.

Estimation Errors
If we disregard the expensive computations of FSA, the GMV obtained by FSA can be the ground truth since FSA takes all the candidate blocks in search windows into considerations.Therefore, we should compare the GMV obtained by our method with the GMV obtained by FSA, the GMV error is defined as follows: We tested our method with six real-world panoramic sequences, including three indoor scenes and three outdoor scenes.There are at least 19 images in each sequence, and there are total 148 pairs of successive images used to estimate the global motion vectors.All the sequences were captured by hand-held camera and the major motion was horizontal.Figure 13 shows the sample images of the six sequences.
Figure 14 shows the histograms of GMV error calculate by (5).The first column represents X-direction error histograms and the second column represents Y-direction error histograms.The horizontal axis of histograms are the GMV error (in pixels) and vertical axis are the number of images.We can observe that the number of images with great GMV errors in the indoor sequences are more than outdoor ones since there are more homogeneous regions (e.g.ceiling, walls, etc) which will degrade the accuracy of motion estimation and large disparities which made local motion vectors in large variance.The outdoor scenes usually contain more textures which are helpful to the block matching and the disparities are usually small, hence there are small number of images with large GMV errors in contrast to the indoor scenes.Figure 14 shows that the accuracy of our method is comparable to FSA.Even in the indoor sequences with large motions, which are challenging for global motion estimation, our method performed a satisfactory accuracy (GMV errors within 5 pixels) in most of images.

Quantitative Evaluation
The GMV errors is not enough to describe "how accurate our method is".Another way to evaluate the accuracy of the GMV is to measure the "similarity" of the overlapped region of two images, i.e., the more accurate GMV, the less difference between the two images in the overlapped region.A well-known index to evaluate the "similarity" of two images is peak signal-to-noise ratio (PSNR) [27], which is defined as follows:

PSNR I i j I i j N
where Ω is the overlapped region as shown in Figure 15 and N Ω is the number of pixels in the overlapped region.The PSNR value computed in the overlapped region of two images using the GMV obtained by FSA (3) (4) (5) should be the highest one.The only thing we need to prove is that the difference between the PSNR of two images using GMV of FSA and the PSNR of two images using the GMV of our method is small enough.Therefore, we used the same sequences in Section 5.1 to compare the PSNR differences, which is defined as follows:  histograms of PSNR differences of outdoor sequences.The horizontal axis are PSNR ∆ (in dB) and the vertical axis are the number of image pairs.Note that most of the PSNR ∆ values are less than 0.5 dB, which is usually considered as the noise-level difference, i.e., the similarity of the overlapped region in two images based on our GMV is comparable to the one based on GMV of FSA.

Conclusions
This paper proposed a fast global motion estimation algorithm for HR (2560 × 1920) image alignment of mobile applications.The proposed method is a modified version of full search algorithm which only considers 25 reference blocks uniformly distributed in the center of image.By applying the predictive motion vector scheme, our method is able to deal with the large camera motions and even faster than the typical three-step search (TSS) algorithm.The local minimum problem is avoided since the proposed method is a kind of full search algorithm.Six real-world sequences with total 148 pairs of successive images are used to verify our method by comparing the GMV errors and similarity with FSA.The first comparison shows that the GMV differences between our method and FSA are less than 5 pixels in both X and Y directions for most cases.The second comparison shows that the similarity of the overlapped region in two images using our GMVs is comparable with the one using GMVs of FSA.
In the future, we will focus on solving the challenging problems which make the block matching task difficult.For example, illumination differences between two successive images, large disparity in the scenes.These problems exist in real world image sequences especially in indoor scenes.We will aim to improve the algorithm to alleviate the problems and increase the accuracy.

Figure 1 .
Figure 1.The illustration of local minimum problem for 1-D case.

Figure 2 .
Figure 2. The overview of the proposed algorithm.

Figure 3 .
Figure 3.The traditional block based motion estimation algorithms divide the image into subblocks with size of N × N.

Figure 4 .
Figure 4.The full search algorithm compares all the positions in the search window to determine the motion vector.

Figure 5 .
Figure 5.The proposed reference blocks selection scheme.

Figure 9 .
Figure 9. PMV selection for all the blocks.The reference blocks in the 1st row are predicted by the inter-frame prediction.The blocks in the rest of rows are predicted by the intra-frame prediction.

Figure 10 .
Figure 10.The proposed intra-frame PMV selection scheme.The motion vectors of green blocks are exploited to predict the motion of white block.

Figure 11 .
Figure 11.An example of clustering.
16 R = , and the size N of reference blocks is multiplied by 8, i.e. 128 N = .We only compute the motion vectors of 16 HR N = reference blocks corresponding to the 16 blocks in QVGA image, i.e., we ignored the unreliable motion vectors of the first row blocks and first column blocks.Finally, the refined global motion vector is determined by TOD clustering of the motion vectors of the 16 blocks.

Figure 12
shows the HR global motion vector refinement.The proposed algorithm is completely described in Algorithm 2.

Figure 12 .
Figure 12.Illustration of the motion estimation in high resolution domain.

Figure 16 (
Figure 16(a) shows the histograms of PSNR differences of indoor sequences and Figure 16(b) shows the

Figure 15 .Figure 16 .
Figure 15.Only the overlapped regions in two successive images are used to compute the PSNR value.

Table 2 .
Computation costs for whole QVGA image with

Table 3 .
Computation costs for whole QVGA image with

Table 4 .
Speed ratio of 272 E R =with respect to FSA for HR global motion estimation.

Table 5 .
Computation costs for whole HR image with 528

Table 6 .
Speed ratio of 528