A Simple Model for On-Sensor Phase-Detection Autofocusing Algorithm

A simple model of the phase-detection autofocus device based on the partially masked sensor pixels is described. The cross-correlation function of the half-images registered by the masked pixels is proposed as a focus function. It is shown that—in such setting—focusing is equivalent to searching of the cross-correlation function maximum. Application of stochastic approximation algorithms to unimodal and non-unimodal focus functions is shortly discussed.


Introduction
In imaging, focusing can be defined as seeking for the image being the best approximation of the captured scene.The proposed autofocusing algorithm is of a stochastic optimization type.Within the stochastic framework we model the scene as a random process (continuous, stationary and fourth-order) of an unknown distribution 1 .Assuming that the dimensions of the lens and sensors are far larger than the length of the light-wave we can use the first order (geometric/linear) approximation of the optics laws [1,2].In particular, we can model the lens as a linear low-pass filter with a symmetric (box) impulse response centered at the origin [3].The width of the box is therefore proportional to the distance between the sensor and the image plane.One can note that the scene is "in focus" when the sensor is in the image plane, that is, in the plane where all rays from a single point at the scene converge into a single point (and the corresponding impulse response of the lens is the Dirac delta function).
A popular approach to this problem in digital imaging is to use the sequentially collected images with their variance serving as a focus function.Such an approach is referred to as the contrast-detection auto-focusing (we will use the common CD AF acronym for shortness) which also includes algorithms based on an image histogram or its gradient analysis.It does not require any additional equipment and hence can be implemented in virtually all digital cameras.Its well-known issue, how-ever, is that a single image does not provide information about either: • the distance between the sensor and the image plane, or • the direction toward the sensor should be shifted in order to attain a focused image, and subsequently CD algorithms seek the focus iteratively, in the back-and-forth manner (shifting the lens accordingly), and require capturing an image in each position determined by the algorithm.The CD AF algorithms are usually derivatives of the stochastic approximation routines (like e.g. the golden-section search (if a noise is negligible) or the Kiefer-Wolfowitz algorithm (if the noise impact cannot be ignored) [4,5]).In consequence, they are rather slow and not directly applicable in e.g.object tracking or video applications.
In order to overcome these deficiencies one can use algorithms based on the phase-detection auto-focusing (PD AF) principle, in which a single image is split into two, left-and right-hand side halves.Typically, image splitting is achieved with the help of a separate optical path consisting of semi-transparent/pellicle mirrors and dedicated line sensors and such an implementation is often met in digital SLRs; see e.g.[3]).The half-images -if the scene is out-of-focus-are shifted with respect to each other.Such a shift is traditionally referred to as a "phase shift" and maintains information about both: • the distance between the sensor and the image plane, and • the direction towards the sensor should be moved.This property makes the PD AF algorithms faster than CD AF ones since-in principle-a single (but split) image suffices to determine the correct (in-focus) sensor position.The technological progress in image sensors fabrication has recently allowed partially masking the microlenses and subsequently implementing the PD AF on sensors.Masking makes possible splitting a single image registered by the sensor without the use of the aforementioned additional optical equipment.The onsensor PD AF approach (at the cost of a more complicated sensor fabrication) can therefore speed up on-sensor focusing and make it appropriate in e.g.focus tracking applications.It can also be considered as an interesting alternative to the CD AF-based shape-from-focus algorithms used in a 3D scene restitution; cf.[6][7][8][9].

Assumptions
We propose a simple model of a sensor with masked pixels and a corresponding focus function.We also consider several stochastic approximation algorithms searching for the location of the focus function maximum (which corresponds to the location of the image sensor in the image plane).
Our analysis can also be adopted to the sensors in which e.g.every second green pixel on the Bayer CFA is replaced by a phase detection pixel (such an approach allows for a pixel-level autofocusing precision and implies only minor modifications to existing sensors), see Figure 1.Several leading manufacturers, like e.g.Aptina, Canon, Fuji, Olympus or Sony, offer CMOS sensors equipped with PD AF circuits.
Remark 1: Recently, Canon introduced an alternative "dual-pixel" approach in which a single pixel consists of two photosensors coupled under a single microlens.It can also be approximated by the proposed model since the half-images are registered there by the left-hand and right-hand side photosensors of each pixel.Canon's implementation makes masking the microlenses unnecessary, nevertheless, it results in a sensor with twice as many pixels.
In focusing problems, it is usually assumed that the impulse response of the lens is of a rectangular shape, cf.e.g.[3,10,11]: where the width parameter a is proportional to distance between the sensor and the image planes ( v s a − ∼ ; see Figure 2).In our PD AF problem the following approximations of the impulse responses for the left-and righthand side masked pixel sensors are proposed (see Figure 3).
Collecting separately the images from the left-and right-hand side masked pixels, we obtain a pair of halfimages (that is, the convolutions of the scene ( ) with either of the impulse responses): About the scene ( ) we assume that it is a widesense fourth-order stationary and ergodic process and that its autocorrelation function ( ) x ρ is continuous and bounded.Such a process admits a particularly important class of piecewise-smooth images; see [12, p. 529]. 2nalogous assumptions hold for the additive noise ( ) x Z corrupting the half-images; cf.[2].

Focus Function
In order to propose the focus function, we need the following lemma.
Lemma 1: The symmetry property (3) implies that that is, that the convolution of the scene with the righthand side impulse response ( ) equals to the scene cross-correlation with the left-hand side one ( ) (note that here we use the term "cross-correlation" in the signal processing sense).
Proof: Indeed, observe that exploiting shift invariance of the convolution operation yields that Consider now the stochastic cross-correlation between the left-and right-half images is the autocorrelation function of the scene process ( ) x S .Observing that due to stationarity we have We thus have the following proposition.Proposition 2: The phase-detection focus function, ( ) , is the following cross-correlation product Proof: To verify both the unimodality and symmetry property of ( ) η f observe that the cross-correlation of the ( ) x L with itself is the autocorrelation of ( ) and is a symmetric function w.r.t.x .Moreover, ( ) x ρ is known to be symmetric with a maximum at x = 0.So their cross-correlation has a maximum at 0 = x and is symmetric w.r.t.
x .Note finally that ( ) is stationary and independent of the image process ( ) x S .Hence it has a constant variance which only adds up to the correlations of the halfimages.Subsequently, its presence does not alter the unimodality property of the images correlation and the position of the correlation function maximum.□

Focusing Algorithms
Because of random character of the scene process ( ) ) needs to be estimated from its realizations (captured images).The resulting estimate (the empirical correlation function) can clearly be different from the actual correlation function and, in particular, it can have false local maxima [16].One can consider two approaches to this problem: • In the first, we can neglect the randomness and treat the empirical correlation function as the genuine focus function.This approach is called stochastic counterpart optimization [17].It can be justified by virtue of the observation that a number of data used in calculations is large (as the number n of points in sensors can be counted in thousands).Thus, the impact of the random noise is averaged (the covariance estimates converge as fast as ( ) in the MISE sense [16]) and the unimodality and the position of maximum of the correlation function are maintained.In such a scenario one can use the well-known golden-section search algorithm [4,18].• In the second, examined below, we search for the actual maximum of the focus function using the noisy data.To this end, we apply the standard Kiefer-Wolfowitz algorithm, see [19] and cf.e.g.[5,20,21].Then we take the version of the K-W algorithm oper-ating on the smoothed functional as in [22][23] and cf.[24,25], in order to apply the algorithm to the case when the correlation function is not unimodal.

Unimodal Case
Since the focus function ( ) 4) is unimodal by assumption, to apply Kiefer-Wolfowitz stochastic approximation algorithm, we merely need to assure that ( ) x f is also sufficiently smooth.Recalling that ( ) is itself the correlation of the continuous and bounded function with the continuous and bounded autocorrelation function ( ) x ρ , we infer that ( ) has at least one bounded derivative, that is, it satisfies the Kiefer-Wolfowitz convergence conditions.

Multimodal Case
Let the autocorrelation function of the scene process ( ) x S be multimodal. 3Then, the focus function ( ) 4) is no longer unimodal and the standard stochastic approximation algorithms fail in general and find local maxima.We show that by convolving ( ) with a rectangular kernel (box) function, we obtain (smoothed) version of ( ) which gains unimodality property and maintains the position of the maximum of ( ) ; cf.[22].The following lemma gives sufficient conditions for the focus and kernel functions, ( ) then the convolution is unimodal (since it also is symmetric).Since the support of ( ) is at most [−r, r], then for any 0 , > y x , we have that ( ) ( ),

Numerical Simulations
The hardware equipped with the on-sensor PD pixels has not been available to Authors at the time of the paper preparation.Therefore, we performed a simple numerical experiment illustrating the approach and based on a stylized model in the environment provided by the Mathematica and C++ packages.A sample scene ( ) is presented in Figure 4 together with the half-images, ( ) 5 shows the shape of the resulting focus function ( ) . In Figure 6 the results of application of the Kiefer-Wolfowitz algorithms are shown for the sequences ( ) and ( ) (as in the original algorithm in [19]).The white noise of uniform distribution in the interval [ ] was added to the focus function.

Final Remarks
In the classic paper by Krotkov [26], several criteria of   "a good focus function" are given.Using these criteria we shortly discuss the properties of the considered approach.

Unimodality
The unimodality property has been formally shown for filters with linear impulse responses, as in ( 2) and (3).Both early experiments and formal investigations suggest that the symmetry condition ((3) or ( 6)) is crucial while the shapes of filters can, for instance, resemble square (or higher monomial) functions.It should be however noticed that the real images may not be stationary (and hence, our assumption that the correlation function is symmetric (i.e.depends only on the shift between halfimages) can be violated).Hence, the search for the correlation function maximum can result in an improper focus distance selection as its maximum may no longer correspond to the actual (or desired) focus position.In this case one should consider application of the global random search algorithms; see e.g.[17,[27][28][29].

Accuracy and Reproducibility
The accuracy and reproducibility of the PD AF algorithms are affected by the presence of noise; the range of admissible noises is very broad and encompass virtually all instances found in practice, see e.g.[30].Observe further that the proposed approach is of an open-loop control type.That is, the focus function maximum-once set-is not further refined.The natural extension of the approach is to exploit the fact that during the sensor movement toward the focus plane, the width of the lens impulse response (the parameter a in (1)) vanishes and the new images captured during these movements can be used to evaluate the maximum.From the formal viewpoint (under our correlation function symmetry assumptions), these additional measurements are not necessary when the image plane is fixed, nevertheless, they can be used in a closed-loop control algorithms, e.g. to track the focus when the image plane shifts.

General Applicability
PD AF algorithms are less general than CD AF ones as they require additional modifications to the sensor (at the cost of image quality: the masked pixels are put in place of the standard pixels in some implementations).However, in contrast to the standard PD AF algorithms which require a separate optical path, this new PD AF one needs merely a new sensor.Moreover, the case we examine is based on an assumption that scene is a 1D (or 2D) process (random field) while in many situations it is in fact a 3D one.Expanding the algorithm analysis towards this assumption is a subject of our current study.

Video Signal Compatibility
As in the CD AF case, the video signal is registered by the same sensor which collects half-images for the PD AF algorithm.Thus, the calibration of the separate optical path, which is often necessary in the standard algorithms based on mirror/splitter, is not required here.Nevertheless, the pixels are masked and part of the light is lost ( 1 − EV per pixel for the considered half-masked pixels, approximately).It can clearly be seen as a drawback in low-light applications.In the abovementioned Canon's "dual-pixel" implementation, all the available light is captured in the final image, however, the number of pixels to be processed is twice as large.

Fast (Software) Implementations
Correlation functions can be effectively computed using the standard routine in which both signals are transformed using FFT, and then multiplied.The correlation function is then obtained from the IFFT routine.The cost of a single run of the correlation evaluation is thus loglinear, ( ), where n is a number of pixels; see e.g.[18].In a special case when the golden sectionsearch algorithm is used, then it is guaranteed that the maximum number of test points is O(logn); see e.g.[4,18].Hence, the overall complexity is then O(nlog 2 n).When, in turn, the Kiefer-Wolfowitz algorithm is used to determine the focus position, the number of test points in which the correlation is computed is usually fixed (and slightly larger than O(logn)).5

Image Readout Issues
Using the image sensor for focusing is clearly beneficial from the video compatibility point of view.However, it also means that the algorithm speed is limited by the sensor framerate.Clearly, this problem is more significant in CD algorithms than in PD ones (especially in a single-image open-loop version of the latter), but in either case can further be alleviated when a sensor at hand offers random access to pixels and one is interested in focusing in a selected region of the scene.

Figure 1 .
Figure 1.(a) A standard image sensor (with a Bayer CFA); (b) The interleaved left-and right-half masked pixels-the PD sensors; (c) An image sensor with embedded PD sensors.

Figure 2 .
Figure 2. The block diagram of the on-sensor phase detection autofocus (PD AF) system model.

Figure 3 .
Figure 3. Half-masked pixels split the rectangular impulse response of the lens into a pair of two triangular ones.The collected "shifted" half-images are used in the on-sensor phase detection algorithms (note that both figures, this and that in Figure 7, are presented for illustrative purposes and are not the exact schemes).
function (viz.the correlation function ( ) η f function with the global maximum at 0, with a support included in [−r, r].Then, the convolution ( )( ) x f h * is unimodal with the maximum at 0. 4 By assumption, f(x) has a global maximum at 0. Let F(x) denote its primitive function.The convolution of f(x) with a rectangular kernel ( ) x h of support [−r, r] equals to

Figure 4 .
Figure 4.The sample scene (black line), and its left (brown) and right (red) half-images.

Figure 5 .
Figure 5.The focus function of the scene from Figure 4.

Figure 6 .
Figure 6.Mean squared error of the Kiefer-Wolfowitz algorithm vs. the number n of the algorithm test points.

Figure 7 .
Figure 7. POV-Ray simulation (clock-wise): the scene (the white square is in focus, the red is closer and the yellow one is further from the focus), the image seen by a 33 × 33 sensor with non-masked pixels, the images seen by the rightand left-hand side half-masked microlenses.