Trinity College Digital Repository Trinity College Digital Repository

A composite random variable is a product (or sum of products) of statistically distributed quantities. Such a variable can represent the solution to a multi-factor quantitative problem submitted to a large, diverse, independent, anonymous group of non-expert respondents (the “crowd”). The objective of this research is to examine the statistical distribution of solutions from a large crowd to a quantitative problem involving image analysis and object counting. Theoretical analysis by the author, covering a range of conditions and types of factor variables, predicts that composite random variables are distributed log-normally to an excellent approximation. If the factors in a problem are themselves distributed log-normally, then their product is rigorously log-normal. A crowdsourcing experiment devised by the author and implemented with the assistance of a BBC (British Broadcasting Corporation) tele-vision show, yielded a sample of approximately 2000 responses consistent with a log-normal distribution. The sample mean was within ~12% of the true count. However, a Monte Carlo simulation (MCS) of the experiment, employing either normal or log-normal random variables as factors to model the processes by which a crowd of 1 million might arrive at their estimates, resulted in a visually perfect log-normal distribution with a mean response within ~5% of the true count. The results of this research suggest that a well-modeled MCS, by simulating a sample of responses from a large, rational, and incentivized crowd, can provide a more accurate solution to a quantitative problem than might be attainable by direct sampling of a smaller crowd or an uninformed crowd, irrespective of size, that guesses randomly.


Runs Tests for Non-Randomness
A stochastic process generates random outcomes in time or space. Such processes occur widely in the physical and social sciences, as well as in purely practical human activities such as finance, manufacturing, and commerce. Despite their random occurrence-indeed, precisely because of it-the outcomes of a stochastic process will display ordered patterns which a statistically naïve observer may mistakenly interpret as predictively useful information. In recent years, controversial issues over the information content of time series have arisen in a variety of disciplines such as nuclear physics [1] and econophysics (i.e. dynamics of the stock market) [2]. Although it is not possible to prove with certainty that a particular process is random, there are various statistical tests to demonstrate within specified confidence limits that it is not random. Among these, nonparametric runs tests are especially useful, in part because of their ease of implementation and statistical power [3].
It is not necessary for a stochastic process to generate binary events in order to be analyzed for runs. For example, a sequence of n different observations 1 2 , , , n x x x  of a continuous random variable will yield 1 n − sequential differences that are either positive (+) or negative (-) and therefore again subject to binary runs analysis. The binary elements, however, are not Bernoulli variates since the probability of obtaining an element decreases with its position in the unbroken sequence. Nevertheless, one can test for non-randomness in permutational ordering with a different distribution theory [5].
Although developed initially for testing quality control in manufacturing, exclusive runs and up-down runs have been employed in analysis of a variety of experiments to test the fundamental prediction of quantum mechanics that transitions between quantum states occur randomly [6,7]. A problematic issue in the counting of exclusive or up-down runs is that the length of a run can be changed by future events. Thus, in the succession aabbbaa, the second run of 2 a could change to a run of 3 a or 4 a if the next two trials resulted in ab or aa respectively.

Runs of Recurrent Events
A third kind of runs analysis, based on Feller's theory of recurrent events [8], was recently employed to examine certain quantum optical processes for evidence of non-random behavior [9]. A recurrent run of length t, as defined by Feller, is a sequence of non-overlapping, uninterrupted successions of exactly t elements of the same kind. It is distinguished from the other two kinds of runs in that the concept of run length is so defined as to be independent of subsequent trials. The advantage of Feller's definition is that runs of a fixed length become recurrent events, and the statistical theory of recurrent events can then be applied to testing empirical data sequences for permutational invariance over a much wider variety of patterns than just those of unbroken sequences of identical binary elements. For example, one may be interested in testing the recurrence of a pattern abab, which, in a quantum optics experiment, might correspond to a sequence of alternate detections of left and right circularly polarized photons, or, in a series of stock price variations, might correspond to a sequence of alternative observations of rising and falling closing prices. Besides the application to runs, the same theoretical foundation may be applied to recurrent events in other forms such as return-to-origin problems, ladder-point problems (instances where a sum of random variables exceeds all preceding sums), and waiting-time problems.
The theory of recurrent runs, the relevant parts of which are examined in the following section, leads to generating functions from which the probability of a run of defined events of specified length can in principle be calculated exactly. As a practical matter, the extraction of these probabilities requires geometrically longer computation times with increasing sequence length. The availability of fast lap-top computers with large random access memory and of symbolic mathematical software of hitherto unparalleled ability to execute series expansions and perform differentiation and integration provides the analyst with computational power unimaginable to the creators of the statistical theory of runs. I report here mathematical strategies for reducing significantly the computation time for the probability of the widely applicable case of k occurrences of runs of length t in a Bernoulli sequence of length n.

Probability Generating Functions
Following Feller, I define the recurrent event E to be a run of successes of length t in a sequence of Bermoulli trials with p the probability of a single successful outcome and (3) The distribution of the variable T is defined by where n f is the probability that E occurs for the first time at the th n trial. The generating function of the probabilities of first occurrence is expressed by The number of trials to the th k occurrence of E is then characterized by the random variable k S in (2), which is a sum of the waiting times of k independent trials, from which it is follows that the associated generating function takes the form is the probability that the th k occurrence of E first takes place at the th n trial. I leave to the cited literature the proof that the generating function (5) for runs of length t with individual probability of success p is given by from which the mean and standard deviation of the recurrence times follow by differentiation For economy of expression, the parameters p, t will be suppressed in the arguments of ( ) , , unless needed to avoid ambiguity. In general, these parameters will be chosen and fixed at the outset of any illustrative applications. Note, too, that to obtain a statistical moment from a probability generating function (pgf), the derivatives are evaluated at 1 s = , which leads to a sum of terms, whereas to obtain a probability the derivatives are evaluated at 0 s = , which leads to a single term. For many applications the analyst's interest is not necessarily in the recurrence time (i.e. number of trials) to the th k occurrence of E, but in the probability that E occurs k times in a fixed number n of trials. The relation connecting the two variates is The probability , n k p that k events E occur in n trials is then expressible as M. P. SILVERMAN

125
Note that the summation in (13) is over the number of occurrences, whereas the summation in (14) is over the number of trials. The second equality in (14) follows directly from Equation (11). Multiplying both sides of (13) by n s and summing over n leads to the bivariate generating function from which the probabilities , n k p are calculable by series expansion of both sides of the equality. A sense of the structure of the formalism can be obtained by considering the case of recurrent runs of length 3 t = for a stochastic process with 1 2 p = . Substitution of these conditions into Equation (8) to order 6 s . Recall that the powers of s designate the number of trials, and the powers of z designate the number of recurrences of runs of length 3. For a fixed power of s, the sum of the coefficients of the powers of z within each bracketed expression equal unity, as they must by the completeness relation for the probability of mutually exclusive outcomes. Note that the first three terms ( ) contain only powers 0 z -since there cannot be runs of length 3 in a sequence of no more than 2 trials. For 3 trials, the probability of 0 runs of length 3 is 7/8 and the probability of 1 run of length 3 is 1/8. For 5 trials, however, the probability of 0 runs is 1/4 and the probability of 1 run is 3/4. This pattern persists: (a) to obtain a run of length t, the sequence of trials must be of length s t ≥ , and (b) the greater the number of trials, the higher is the probability of obtaining longer runs.

Moment Generating Functions
It is not necessary to know the individual

Numerical Procedures
The statistics (probabilities and expectation values) for any physically meaningful choice of probability of success p, run length t, and number of trials n are deducible exactly from expressions (15) and (18) in the manner previously illustrated. For many applications, however, particularly where it is possible to accumulate long sequences of data as is often the case in atomic, nuclear and elementary particle physics experiments or investigations of stock market time series, the tests for evidence of non-random behavior are best made by examining long runs. Suppose, for example, one wanted the probability of obtaining the number of occurrences of runs of length 50 in a sequence of 100 trials. One approach, leading directly to all non-vanishing probabilities, would be to extract the 100 th term ( ) and similarly where C is the unit circle and the generating function ( ) F s , given by Equation (8), specifies the single-event probability p and run length t. Contrary to first impression, however, the execution of expressions (23) or (24) for  is to be replaced by the actual j th numerical element in the sum obtained in step (2). The arrow, inserted by the author, symbolically points to the form of the output. Extraction of this element for the specified conditions led to 1,000,000 33,333.258 N = in a fraction of a second. In Maple, the command evalf calls for numerical evaluation of expressions; omitting this command results in an exact fraction, which for a sequence length of 1 million is too unwieldy to be useful. Use of the command lcoeff yields the same result, but executes more slowly for large n.
The procedure described above for converting the rational expression of s into a formal power series in s did not work with the bivariate generator where the ellipsis is to be replaced by the th k term produced in step (1). In using Maple to execute method (2), one proceeds in a single step once the rational expression ( )

128
Note that the series must be expanded to 1 n + terms in order to obtain the coefficient of n s , since the summation index begins at 0.

Generating Function of Cumulative Probability
For many applications in the physical sciences and elsewhere, the full set of probabilities { } , n k p provides more information than is desirable or usable. Moreover, because , n k p for large k may be very small and the variance relatively large, the more observationally stable statistic is the probability of obtaining k or more occurrences of the specified event, or in other words, the complementary cumulative probability (ccp) ( ) introduced in Equation (11). Experimental situations calling for preferential usage of a cumulative probability distribution over a probability function abound in the physical sciences, as, for example, in the analysis of fragmentation [10] and other stochastic processes leading to a power-law distribution.
The generating function for the ccp is derivable from ( )

Asymptotic Distributions
One can show by application of the Central Limit Theorem (CLT) to relation (11) that for sufficiently large number of trials n and number of occurrences k, the number n N of runs of length t produced in n trials is approximately normally distributed with mean and variance given by relations (21) and (22). The approximation, whose relative accuracy improves in the limit of increasing n, is actually quite good even for modest values of n, as shown in Table 1 for 100 n = . Expansion of the generating function ( ) 1 M s yielded the exact mean value as an integer or fraction, which was then expressed as a floating-point number to three significant figures for comparison with the Gaussian approximation. It is to be noted from the , , F s p t , are respectively the mean number of trials to the first occurrence of event E, which is a run of length t. The equivalence of the mean and standard deviation suggests that the asymptotic distribution of the random variable T in Equation (1) is exponential [11] ( )

Conclusions
The theory of recurrent runs provides a statistical basis for rejecting the hypothesis that a series of observations (in time or space) are random. This is a matter that often arises in experimental investigations in atomic, optical, nuclear, and elementary particle physics, as well as in other sciences, finance, and commerce, which may entail a very large number-perhaps in the thousands to millions-of trials or observations. In this paper theoretical and numerical methods based on different generating functions were derived and investigated to determine (a) the probability nk p for k recurrence runs of length t in n Bernoulli trials, (b) the complementary cumulative probability The methods reported here can be implemented on modern laptop computers running commercially available symbolic mathematical software, such as Maple (which was the application used by the author). Computation times for application of these methods to data sequences up to millions of trials could range from seconds to minutes.
To compute runs statistics for sequences of intermediate to very long trial numbers, the asymptotic distribution for the number of trials up to and including the th k occurrence ( ) 1 n t k ≥ ≥     of a specified run length t was derived and found to be a Gamma distribution under these circumstances, but the Gaussian approximation is less accurate and fails entirely for values of n and k for which the CLT does not apply.