Numerical Procedures for Calculating the Probabilities of Recurrent Runs

Run count statistics serve a central role in tests of non-randomness of stochastic processes of interest to a wide range of disciplines within the physical sciences, social sciences, business and finance, and other endeavors involving intrinsic uncertainty. To carry out such tests, it is often necessary to calculate two kinds of run count probabilities: 1) the probability that a certain number of trials results in a specified multiple occurrence of an event, or 2) the probability that a specified number of occurrences of an event take place within a fixed number of trials. The use of appropriate generating functions provides a systematic procedure for obtaining the distribution functions of these probabilities. This paper examines relationships among the generating functions applicable to recurrent runs and discusses methods, employing symbolic mathematical software, for implementing numerical extraction of probabilities. In addition, the asymptotic form of the cumulative distribution function is derived, which allows accurate runs statistics to be obtained for sequences of trials so large that computation times for extraction of this information from the generating functions could be impractically long.


Introduction 1.Runs Tests for Non-Randomness
A stochastic process generates random outcomes in time or space.Such processes occur widely in the physical and social sciences, as well as in purely practical human activities such as finance, manufacturing, and commerce.Despite their random occurrence-indeed, precisely because of it-the outcomes of a stochastic process will display ordered patterns which a statistically naïve observer may mistakenly interpret as predictively useful information.In recent years, controversial issues over the information content of time series have arisen in a variety of disciplines such as nuclear physics [1] and econophysics (i.e.dynamics of the stock market) [2].Although it is not possible to prove with certainty that a particular process is random, there are various statistical tests to demonstrate within specified confidence limits that it is not random.Among these, nonparametric runs tests are especially useful, in part because of their ease of implementation and statistical power [3].

Exclusive Runs
An exclusive run is an unbroken sequence of similar events, ordinarily of a binary nature.For example, a sequence of symbols aabbbaa comprises 2 runs of a's of length 2 and 1 run of b's of length 3.If the events a and b occur with fixed probabilities throughout the sequence, the stochastic process is of the Bernoulli kind, and the distribution theory of binary runs [4] can be used to test for non-randomness in permutational ordering of any such empirical sequence of outcomes.
It is not necessary for a stochastic process to generate binary events in order to be analyzed for runs.For example, a sequence of n different observations 1 2 , , , n x x x  of a continuous random variable will yield 1 n − sequential differences that are either positive (+) or negative (-) and therefore again subject to binary runs analysis.The binary elements, however, are not Bernoulli variates since the probability of obtaining an element decreases with its position in the unbroken sequence.Nevertheless, one can test for non-randomness in permutational ordering with a different distribution theory [5].
Although developed initially for testing quality control in manufacturing, exclusive runs and up-down runs have been employed in analysis of a variety of experiments to test the fundamental prediction of quantum mechanics that transitions between quantum states occur randomly [6,7].A problematic issue in the counting of exclusive or up-down runs is that the length of a run can be changed by future events.Thus, in the succession aabbbaa, the second run of 2 a could change to a run of 3 a or 4 a if the next two trials resulted in ab or aa respectively.

Runs of Recurrent Events
A third kind of runs analysis, based on Feller's theory of recurrent events [8], was recently employed to examine certain quantum optical processes for evidence of non-random behavior [9].A recurrent run of length t, as defined by Feller, is a sequence of non-overlapping, uninterrupted successions of exactly t elements of the same kind.It is distinguished from the other two kinds of runs in that the concept of run length is so defined as to be independent of subsequent trials.For example, in the sequence aaaabaaaaaa, there are two runs of length 4 aa aa b aa aa aa .(Analyzed in terms of exclusive runs, there would have been 1 run of a of length 4 and 1 run of a of length 6, provided the sequence ended at the 11 th trial).In a sequence of Bernoulli trials, a recurrent run of length t occurs at the th n trial if the th n trial adds a new run to the sequence.Thus, the recurrent runs of length 4 occur at positions 4, 9, and the recurrent runs of length 3 occur at positions 3, 8, 11.
The advantage of Feller's definition is that runs of a fixed length become recurrent events, and the statistical theory of recurrent events can then be applied to testing empirical data sequences for permutational invariance over a much wider variety of patterns than just those of unbroken sequences of identical binary elements.For example, one may be interested in testing the recurrence of a pattern abab, which, in a quantum optics experiment, might correspond to a sequence of alternate detections of left and right circularly polarized photons, or, in a series of stock price variations, might correspond to a sequence of alternative observations of rising and falling closing prices.Besides the application to runs, the same theoretical foundation may be applied to recurrent events in other forms such as return-to-origin problems, ladder-point problems (instances where a sum of random variables exceeds all preceding sums), and waiting-time problems.
The theory of recurrent runs, the relevant parts of which are examined in the following section, leads to generating functions from which the probability of a run of defined events of specified length can in principle be calculated exactly.As a practical matter, the extraction of these probabilities requires geometrically longer computation times with increasing sequence length.The availability of fast lap-top computers with large random access memory and of symbolic mathematical software of hitherto unparalleled ability to execute series expansions and perform differentiation and integration provides the analyst with computational power unimaginable to the creators of the statistical theory of runs.I report here mathematical strategies for reducing significantly the computation time for the probability of the widely applicable case of k occurrences of runs of length t in a Bernoulli sequence of length n.

Probability Generating Functions
Following Feller, I define the recurrent event E to be a run of successes of length t in a sequence of Bermoulli trials with p the probability of a single successful outcome and 1 q p = − the probability of failure.Consider the following random variables: number of trials "waiting time" between 1 and occurrence of E 1 1 number of trials up to and including occurrence of E The distribution of the variable T is defined by where n f is the probability that E occurs for the first time at the th n trial.The generating function of the probabilities of first occurrence is expressed by ( ) The number of trials to the th k occurrence of E is then characterized by the random variable k S in (2), which is a sum of the waiting times of k independent trials, from which it is follows that the associated generating function takes the form where is the probability that the th k occurrence of E first takes place at the th n trial.I leave to the cited literature the proof that the generating function ( 5) for runs of length t with individual probability of success p is given by from which the mean and standard deviation of the recurrence times follow by differentiation ( ) ( ) ( ) For economy of expression, the parameters p, t will be suppressed in the arguments of ( ) F s p t , ( ) and ( ) unless needed to avoid ambiguity.In general, these parameters will be chosen and fixed at the outset of any illustrative applications.Note, too, that to obtain a statistical moment from a probability generating function (pgf), the derivatives are evaluated at 1 s = , which leads to a sum of terms, whereas to obtain a probability the derivatives are evaluated at 0 s = , which leads to a single term.For many applications the analyst's interest is not necessarily in the recurrence time (i.e.number of trials) to the th k occurrence of E, but in the probability that E occurs k times in a fixed number n of trials.The relation connecting the two variates is The probability , n k p that k events E occur in n trials is then expressible as Pr Pr Pr and serves in the construction of two pgf's Note that the summation in ( 13) is over the number of occurrences, whereas the summation in ( 14) is over the number of trials.The second equality in (14) follows directly from Equation (11).Multiplying both sides of (13) by n s and summing over n leads to the bivariate generating function from which the probabilities , n k p are calculable by series expansion of both sides of the equality.A sense of the structure of the formalism can be obtained by considering the case of recurrent runs of length 3 t = for a stochastic process with 1 2 p = Substitution of these conditions into Equation ( 8) for ( ) F s yields the following rational expression for the right side of Equation ( 15) and its corresponding Taylor-series expansion to order 6  s .Recall that the powers of s designate the number of trials, and the powers of z designate the number of recurrences of runs of length 3.For a fixed power of s, the sum of the coefficients of the powers of z within each bracketed expression equal unity, as they must by the completeness relation for the probability of mutually exclusive outcomes.Note that the first three terms ( ) are independent of z-i.e.contain only powers 0 z -since there cannot be runs of length 3 in a sequence of no more than 2 trials.For 3 trials, the probabili- ty of 0 runs of length 3 is 7/8 and the probability of 1 run of length 3 is 1/8.For 5 trials, however, the probability of 0 runs is 1/4 and the probability of 1 run is 3/4.This pattern persists: (a) to obtain a run of length t, the sequence of trials must be of length s t ≥ , and (b) the greater the number of trials, the higher is the probability of obtaining longer runs.

It is not necessary to know the individual
Starting from the relation and following the same procedure that led to (18) yields the generating function for the distribution From the generating functions (18) and (20) one can deduce the asymptotic relations for mean and variance ( )

Numerical Procedures
The statistics (probabilities and expectation values) for any physically meaningful choice of probability of success p, run length t, and number of trials n are deducible exactly from expressions (15) and (18) in the manner previously illustrated.For many applications, however, particularly where it is possible to accumulate long sequences of data as is often the case in atomic, nuclear and elementary particle physics experiments or investigations of stock market time series, the tests for evidence of non-random behavior are best made by examining long runs.Suppose, for example, one wanted the probability of obtaining the number of occurrences of runs of length 50 in a sequence of 100 trials.One approach, leading directly to all non-vanishing probabilities, would be to extract the 100 th term ( ) and similarly where C is the unit circle and the generating function ( ) F s , given by Equation ( 8), specifies the single-event probability p and run length t.Contrary to first impression, however, the execution of expressions ( 23 Because the generator ( ) ] is univariate, one can take advantage of the rapidity with which symbolic mathematical software executes a series expansion to obtain exact values n N for very long sequence lengths by the following simple procedure: 1) For given p and t, express ( ) M s as a rational function ( ) ( ) 2) Convert ( ) ( ) where the ellipsis ( )  represents either the expression to be converted or the Maple equation number of that expression.
3) Extract the single desired term n N .In Maple this can be done by filtering the sum: Alternatively, one can convert the series generated to order ( ) to a polynomial of degree n, and use the command lcoeff to extract the leading coefficient of the polynomial.
As an illustration, consider the calculation (by means of Maple) of the exact mean number of runs of length 4 t = in a sequence of 1 million trials with probability of success 0.5 p = .Following the foregoing steps, we have 1) where ( ) j  is to be replaced by the actual j th numerical element in the sum obtained in step (2).The arrow, inserted by the author, symbolically points to the form of the output.Extraction of this element for the specified conditions led to 1,000,000 33,333.258N = in a fraction of a second.In Maple, the command evalf calls for numerical evaluation of expressions; omitting this command results in an exact fraction, which for a sequence length of 1 million is too unwieldy be useful.Use of the command lcoeff yields the same result, but executes more slowly for large n.
The procedure described above for converting the rational expression of s into a formal power series in s did not work with the bivariate generator ( ) , H s z , which involved a product of z with a power of s (depending on run length t) in the denominator, and required, for the conversion, solution of the roots of a high-order (>2) algebraic equation.Maple did return, however, the recursion relation for the coefficients of the formal power series.An alternative procedure to isolate the values , n k p for fixed n, which still relied on the computational speed of series expansion and worked well for sequence lengths in the thousands, is the following: 1) For given p and t, express ( ) H s z as a rational function of s and z. 2) Generate a series expansion of ( ) , H s z to order n and 1 n + .
3) Convert the series expansions into polynomials ( ) 4) Subtract one polynomial from the other to obtain an expression of the form , for the probability of k occurrences of runs of length 4 in a sequence of 1000 trials.The calculations described in this section were performed with a laptop computer (Intel-based Mac Powerbook) running Maple 16.
One final procedure, particularly suitable when only selected probabilities of the full set { } , n k p are desired, is to obtain these probabilities directly from the generating function ( ) , Q s k in (14).As in the previous examples, this can be accomplished in either of two ways: (1) by evaluating the leading coefficient of the polynomial in the desired degree n, or (2) by filtering the power series for the th n term.As an example, consider the probability 100,4 p for obtaining 4 runs of length 4 t = in a series of 100 trials with probability of success 0.5 p = .
The generating function then takes the form where the ellipsis is to be replaced by the th k term produced in step (1).In using Maple to execute method (2), one proceeds in a single step once the rational expression ( )

Conclusions
The theory of recurrent runs provides a statistical basis for rejecting the hypothesis that a series of observations (in time or space) are random.This is a matter that often arises in experimental investigations in atomic, optical, nuclear, and elementary particle physics, as well as in other sciences, finance, and commerce, which may entail a very large number-perhaps in the thousands to millions-of trials or observations.In this paper theoretical and numerical methods based on different generating functions were derived and investigated to determine (a) the probability nk p for k recurrence runs of length t in n Bernoulli trials, (b) the complementary cumulative probability The methods reported here can be implemented on modern laptop computers running commercially available symbolic mathematical software, such as Maple (which was the application used by the author).Computation times for application of these methods to data sequences up to millions of trials could range from seconds to minutes.
To compute runs statistics for sequences of intermediate to very long trial numbers, the asymptotic distribution for the number of trials up to and including the th k occurrence ( ) of a specified run length t was derived and found to be a Gamma distribution aa , three runs of length 3 [ ] | | | aaa ab aaa aaa , and five runs of length 2 [ ] many applications in the physical sciences and elsewhere, is the experimentally observed quantity of interest.Multiplying both sides of Equation (17) by n s and summing n over the range ( ) 1, ∞ leads to the generating function for the distribution of n differentiation, instead of by a series expansion of the corresponding generators up to the order that yields the desired , n k p or n N , is not computationally economic.The computer, in fact, executes the series expansion of the generator much more rapidly than it performs symbolic differentiation.

g s h s to a power
point numbers, if desired.As an example, the procedure led in under 10 seconds to the full set By virtue of equivalence(11), Equation (28) also yields a closed-form asymptotic relation for the sought-for cumulative probability distribution ( )Pr n N k≥ .Consider, for example, the probability of 1 or more runs of length 20 in 2000 trials with individual probability of success 0.5.A comparison of the results of (a) Equation (28), (b) the cumulative Gaussian distribution with mean and variance given by relations (21) and (22), and (c) the exact calculation obtained from the generating function (26) the distribution (28).If the number of trials is increased to 10 6 , the Gamma and Gaussian asymptotic distributions lead to comparable results is still superior to the latter when compared to the probability calculated from the exact generating function.The relation (28), by which one can calculate the cumulative probability ( )Pr n N k ≥for large values of n, also allows one to calculate the individual probabilities nk p as a function of number of occurrences k through the identity

of
approximation.In the limit of very large (technically, infinite) n and k, the Central Limit Theorem (CLT) predicts an asymptotic distribution Gaussian form.Both the Gamma and Gaussian asymptotic distributions give comparable results for ( ) Maple or Mathematica permits one to do this up to a certain order limited by the speed and memory of one's computer, but these calculational tools may become insufficient when one is seeking exact probabilities of runs in data sequences of thousands to millions of bits.Using complex variable theory, one can extract expressions for