Deriving CDF of Kolmogorov-Smirnov Test Statistic

In this review article, we revisit derivation of the cumulative density function (CDF) of the test statistic of the one-sample Kolmogorov-Smirnov test. Even though several such proofs already exist, they often leave out essential details necessary for proper understanding of the individual steps. Our goal is filling in these gaps, to make our presentation accessible to advanced undergra-duates. We also propose a simple formula capable of approximating the exact distribution to a sufficient accuracy for any practical sample size.


Introduction
The article's goal is to present a comprehensive summary of deriving the distribution of the usual Kolmogorov-Smirnov test statistic, both in its exact and approximate form. We concentrate on practical aspects of this exercise, meaning that • reaching a modest (three significant digit) accuracy is usually considered quite adequate, • computing critical and P-values of the test is the primary objective, implying that it is the upper tail of the distribution which is most important, • methods capable of producing practically instantaneous results are preferable to those taking several seconds, minutes, or more, • simple, easy to understand (and to code) techniques have a great conceptual advantage over complex, black-box type algorithms.
This is the reason why our review excludes some existing results (however deep and mathematically interesting they may be); we concentrate only on the most relevant techniques (this is also the reason why our bibliography is deliberately far from complete).

Test Statistic
The Kolmogorov-Smirnov one-sample test works like this: the null hypothesis states that a random independent sample of size n has been drawn from a specific (including the value of each of its parameters, if any) continuous distribution. The test statistics (denoted n D ) is the largest (in the limit-superior sense) absolute-value difference between the corresponding empirical cumulative density function (CDF) and the theoretical CDF, denoted ( ) To complete the test, we need to know the CDF of n D under the assumption that the null hypothesis is correct. Deriving this CDF is a difficult task; there are several exact techniques for doing that; in this article, we expound only the major ones. We then derive the n → ∞ limit of the resulting distribution, to serve as an approximation when n is relatively large. Since the accuracy of this limit is not very impressive (unless n is extremely large), we show how to remove the 1 n -proportional, 1 n -proportional, etc. error of this approximation, making it sufficiently accurate for samples of practically any size.

Transforming to
The first thing we do is to define where ( ) F x is the CDF of the hypothesized distribution; the 1 2 , , , n U U U  then constitute (under the null hypothesis) a random independent sample from the uniform distribution over the ( ) 0,1 interval, the new theoretical CDF is then simply It is important to realize that doing this does not change the vertical distances between the empirical and theoretical CDFs; it transforms only the corresponding horizontal scale as Figure 1 and Figure 2 demonstrate (the original sample is from Exponential distribution).
This implies that the resulting value of n D (and consequently, its distribution) remains the same. We can then conveniently assume (from now on) that our sample has been drawn from ; yet the results apply to any hypothesized distribution.

Discretization
In this article, we aim to find the CDF of n D , namely ( ) only for a discrete set of n values of d, namely for 1 2 , , , n d n n n =  , even though n D is a continuous random variable whose support is the This proves to be sufficient for any (but extremely small) n, since our discrete results can be easily extended to all values of d by a sensible interpolation. There are technique capable of yielding exact results for any value of d (see [1] or [2]), but they have some of the disadvantages mentioned above and will not be discussed here in any detail; nevertheless, for completeness, we present a Mathematica code of Durbin's algorithm in the Appendix.

Linear-Algebra Solution
This, and the next two section, are all based mainly on [3], later summarized by [4]. To prove the reverse, we must first realize that no one-step decrease in the has exceeded the value of j at some d, or it has reached a value smaller than −j, it then follows that at least one i T has to be equal to either j or −j respectively.
We know that, given C , k where 1 k n J ≤ ≤ − , with the understanding that an empty sum (lower limit exceeding the upper limit) equals to 0. From (4) it is obvious that k T J = is equivalent to having (exactly to be understood from now on) k J + observations smaller than k n .The corresponding probability is the same as that of getting k J + successes in a binomial-type experiment with n trials and a single-success probability of k n ; we will denote it Finally, , which means that, out of the

Resulting Equations
We can thus simplify (6) to By the same kind of reasoning we can show that, for any k between J and 1 n − (8) (note that the T sequence needs at least 2J steps to reach -J at k T from J at i T ), Combining (7) and (9), we end up with the total of ( ) 2 n J − linear equations for the same number of unknowns. Furthermore, these equations have a "doubly triangular" form, meaning that proceeding in the right order, i.e.
( ) ( ) ( ) ( ) Pr , Pr , Pr , Pr , , we are always solving only for a single unknown (this is made obvious by the next Mathematica code).
Having found the solution, we can then compute (based on Claim 1) The whole algorithm can be summarized by the following Mathematica code (note that instead of superscripts, interpreted by Mathematica as powers, we have to use "overscripts").
(for improved efficiency, we use only the relevant range of J values).
The program takes over one minute to execute; the results are displayed in Figure 3.
We can easily interpolate values of the corresponding table to convert it into a continuous function, thereby finding any desired value to a sufficient accuracy.
The main problem with this algorithm lies in its execution time, which increases (like most matrix-based computation) with roughly the third power of n.
This makes the current approach rather prohibitive when dealing with samples consisting of thousands of observations. In this context it is fair to mention that none of our programs have been optimized for run-time efficiency; even though some improvement in this regard is definitely possible, we do not believe that it would substantially change our general conclusions.

Generating-Function Solution
We now present an alternate way of building the same (discretized, but otherwise exact) solution. We start by defining the following function of two integer arguments Note that, when i j + is negative (i is always positive), i j p is equal to 0.
can be expressed in terms of three such p functions, as follows Note that  has the value of 0 whenever the number of successes (the subscript) is either negative or bigger than n (the superscript). Similarly, 0 0  is always equal to 1.

Modified Equations
The new function (11) enables us to express (7) and (9) in the following manner: and ( ) ( ) respectively.
in each term of (14) and multiplying by 0 which can be written as

( )
Pr i B for any specific value of n is a simple task; this approach thus effectively deals with all n at the same time! Similarly modifying (15) results in (for any k J > ), utilizing the previous definition of i a and i b . The equations, together with (17), constitute an infinite set of linear equations for elements of the two sequences. To find the corresponding solution, we reach for a different mathematical tool.

Generating Functions
Let us introduce the following generating functions where j is a non-negative integer, and ,0 j δ (Kronecker's δ ) is equal to 1 when 0 j = , equal to 0 otherwise. Multiplying (17) by k t and summing over k from 1 to ∞ yields combining two sequences in this manner is called their convolution. Note the importance (for correctness of the

Resulting Solution
The last two (simple, linear) equations can be so easily solved for Going back to a specific sample size n, we now need to find the value of (10),  The numerator of the last expression is clearly (by the same convolution argument) the coefficient of n t in the expansion of An important point is that, in actual computation, the G functions need to be expanded only up to and including the n t term, making them long but otherwise simple polynomials.
The algorithm to find which is obtained by substituting the solution to (Gj) and (22)  The corresponding Mathematica code looks as follows: It produces results identical to those of the matrix-algebra algorithm, but has several advantages: the coding is somehow easier, it (almost) automatically yields results for any 300 n ≤ (not a part of our code) and it executes faster (taking about 17 seconds). Nevertheless, its run-time still increases with roughly the third power of n, thus preventing us from using it with a much larger value of n.
We now proceed to find several approximate solutions of increasing accuracy, all based on (25).

Asymptotic Solution
As we have seen, neither of the previous two solutions is very practical (and ultimately not even feasible) as the sample size increases. In that case, we have to switch to using an approximate (also referred to as asymptotic) solution.

Large-n Formulas
First, we must replace the old definition of i j p , namely (11), by Note that this does not affect (12), nor any of the subsequent formulas up to and including (25), since the various e i − factors always cancel out. Also note that the definition can be easily extended to real (not just integer) arguments by using ( ) , where Γ denotes the usual gamma function.

1) Laplace representation
Note that, from now on, the summations defining the G functions in (21) stay infinite (no longer truncated to the first n terms only).

Consider a (rather general) generating function
and an integer n ( p may be implicitly a function of n as well as k); our goal is to find an approximation for n p as n increases. After replacing k and t with two new variables x and s, thus k n x = ⋅ To be able to reach a finite answer, j itself needs to be replaced by z n ; note that doing that with our J changes ( ) It happens to be easier to take the limit of the natural logarithm of (33), namely instead.
With the help of the following version of Stirling's formula (ignore its last term for the time being) ( ) we get (this kind of tedious algebra is usually delegated to a computer) We thus end up with ( ) ( ) where j z n = ; this follows from (32) and the following result: when v and s are positive Proof. Since substitution. Solving the resulting simple differential equa- where c is equal to the last being a well-known integral (related to Normal distribution).

■
To find the n → ∞ limit of (25), we first evaluate the right hand side of (38) with 2 , ,0, . This is based on substituting the right-hand sides of (44) into (25), and on the following result: Since the ILT of ( ) (where k is a positive integer) is equal to (this follows from (32) Note that the error of this approximation is of the The last formula has several advantages over the approach of the previous two sections: firstly, it is easy and practically instantaneous to evaluate (the infinite series converges rather quickly only between 2 and 10 terms are required to reach a sufficient accuracy when 0.3 z < the CDF is practically zero otherwise), secondly, it is automatically a continuous function of z (no need to interpolate), and finally, it provides an approximate distribution of n nD for all values of n (the larger the n, the better the approximation).
But a big disappointment is the formula's accuracy, becoming adequate only when the sample size n reaches thousands of observations; for smaller samples, an improvement is clearly necessary. To demonstrate this, we have computed the difference between the exact and approximate CDF when 300 n = ; see Figure 4, which is in agreement with a similar graph of [2].
We can see that the maximum possible error of the approximation is over 1.5% (when computing the probability of 300 0.046 D > ); errors of this size are generally not considered acceptable.

High-Accuracy Solution
Results of this section were obtained (in a slightly different form, and building on previously published results) by [5] and further expounded by a more accessible [6]; their method is based on expanding (in powers of Having achieved more accurate approximation for all our G functions, and with the following extension of (46) To convert the remaining terms of (65) to their ( ) / 1 D n q contribution, we must first expand them in powers of E, then take the ILT of individual terms of these expansions, and finally set x equal to 1; the following table helps with the last two steps: (the first row has already been proven; the remaining three follow by differentiating both of its sides with respect to zk (taken as a single variable), up to three times).
This results in the following replacement  can never exceed 0.20%; such an accuracy would be normally considered quite adequate (approximating Student's 30 t by Normal distribution can yield an error almost as large as 1%).
As mentioned already, the approximation of ( ) Pr n nD z > can be made even more accurate by adding, to the current expansion, the following extra