Crowdsourced Sampling of a Composite Random Variable: Analysis, Simulation, and Experimental Test

A composite random variable is a product (or sum of products) of statistically distributed quantities. Such a variable can represent the solution to a multi-factor quantitative problem submitted to a large, diverse, independent, anonymous group of non-expert respondents (the “crowd”). The objective of this research is to examine the statistical distribution of solutions from a large crowd to a quantitative problem involving image analysis and object counting. Theoretical analysis by the author, covering a range of conditions and types of factor variables, predicts that composite random variables are distributed log-normally to an excellent approximation. If the factors in a problem are themselves distributed log-normally, then their product is rigorously log-normal. A crowdsourcing experiment devised by the author and implemented with the assistance of a BBC (British Broadcasting Corporation) tele-vision show, yielded a sample of approximately 2000 responses consistent with a log-normal distribution. The sample mean was within ~12% of the true count. However, a Monte Carlo simulation (MCS) of the experiment, employing either normal or log-normal random variables as factors to model the processes by which a crowd of 1 million might arrive at their estimates, resulted in a visually perfect log-normal distribution with a mean response within ~5% of the true count. The results of this research suggest that a well-modeled MCS, by simulating a sample of responses from a large, rational, and incentivized crowd, can provide a more accurate solution to a quantitative problem than might be attainable by direct sampling of a smaller crowd or an uninformed crowd, irrespective of size, that guesses randomly.


Introduction: Estimation of an Unknown Composite Quantity by Large-Scale Sampling
The global reach of telecommunications media, including radio, television, and in particular the social media sites of the internet, make possible an ease and scale of statistical sampling hitherto inconceivable. Through use of these media, almost any question can, at least in principle, be posed to a large, anonymous, diverse, independent population of respondents, referred to in both technical and non-technical literature as the "crowd" [1]. This paper reports a comprehensive 1) analytical investigation, 2) Monte-Carlo simulation, and 3) experimental test of the distribution of a composite random variable (RV) representing a crowdsourced response to a question calling for a numerical answer. A composite RV is a product of two or more factor RVs. In the following sections it is shown that: 1) the most useful characteristic of a crowdsourced sample is its distribution function and not just a single statistic, 2) under conditions to be specified, a product of RVs is distributed log-normally to an excellent approximation, irrespective of the type or number or correlation of factor RVs, 3) computer simulation methods can model the response of a hypothetical rational crowd orders of magnitude larger than what actually might be practically attainable.

Background
To the author's knowledge, the first quantitative experiment in what today would be considered crowdsourcing was published by the English polymath and statistical innovator Sir Francis Galton in 1907 [2] [3]. Galton  The idea underlying crowdsourcing-a term introduced in 2006-is that a large group of non-experts can collectively arrive at a more accurate estimate of some physical quantity or at a better decision regarding some policy, strategy, or treatment than a small group of experts [4]. This idea is a hypothesis to be examined experimentally, not a mathematical theorem, like Condorcet's jury theorem [5], subject to rigorous proof. Central among crowdsourcing issues in-i.e. the number of objects in a unit volume. However, none of the needed numbers is known; all are representable by random variables whose realizations (i.e. estimates) by respondents in the crowd would be different. The sought-for RV would, in general, be a product (or sum of products) of 3 RVs relating to geometry and 1 RV characterizing the numerical density-or in all a product (or sum of products) of 4 RVs. The analyst is then faced with three general questions: 1) How are the basis RVs distributed?
2) What will be the distribution of the composite RV?
3) Which statistic of the composite RV should be taken to represent the physical value of the sought-for quantity?
By examining this archetypical question a) theoretically, b) computationally by Monte Carlo simulation, and c) experimentally, this paper addresses the preceding three questions.

Organization
The remainder of this paper is organized in the following way: Section 2 investigates analytically the distribution of a composite random variable comprising independent basis RVs. Of particular interest are the cases in which the basis is either normally or log-normally distributed.

Section 3 investigates numerically by MCS the distributions of a composite
variable comprising basis RVs whose distributions differ widely in shape parameters (skewness, kurtosis) for fixed location and scale parameters (mean, variance).
Section 4 reports 1) an experiment, implemented with the collaboration of a British national television show, to employ crowdsourcing as a means to estimate the number of opaque objects in a transparent receptacle, and 2) the use of MCS to predict the statistical results for a hypothetical much larger crowd incentivized to estimate rationally rather than guess randomly.
Section 5 concludes the paper with a summary of principal findings.
For the reader's convenience, the statistical abbreviations used in the paper  Table 1 is the Heaviside function, also known as the step function, which we define here as ( ) (There are different definitions of ( ) H x depending on the value assigned to ( ) 0 H [17].) A statistical convention followed in this paper is to represent a random variable by an upper case letter, e.g. X, and a variate (i.e. sample or realization of the random variable) by a corresponding lower case letter, e.g. x.
Reciprocally, one can write (4) The strategy of the analysis in this section is to calculate the moment-generating function (MGF) of Y defined by the expectation operation p y is the PDF of Y, and t is a dummy variable the differentiation of which generates the statistical moments 0,1, 2, k =  in the following way: If the MGF of a random variable does not exist, one can always use the characteristic function (CF) defined by where Equation (7) is recognized as the Fourier transform of ( ) Y p y [18]. Each random variable is uniquely characterized by its MGF (if it exists) and CF [19].
By identifying the MGF or CF of Y, it may then be possible to determine the distribution of the sought-for composite variable Z.

Substitution of Equation (3) into Equation (5) leads to
in which the last step-expectation of product equals product of expectations-is justified if the basis RVs are independent, as assumed to be the case in this section. This point will be revisited in Section 4.
From the form of Equation (1), a further condition of the analysis is that the basis RVs have well-defined means and variances. This is the same requirement as for the Central Limit Theorem (CLT) (see [19], 193-195). Re-express each by the identity ( ) which defines the variable i β , and substitute Equation (9) into Equation (8) to leads to the approximate MGF ( ) respectively define the mean, variance, and skewness parameter Y λ of Y. Under the conditions assumed in the foregoing analysis, MGF (12) shows that the distributions of Y, and therefore also Z, are not symmetric about the mean.
The author has been unable to find any source that identifies MGF (12) with a named distribution. However, upon neglect of skewness, Equation (12) takes the form ( ) of the MGF of a normal RV [20]. By definition, if Y, as defined by Equation (3), is a normal RV denoted by Table 1. Note that the parameters defining the log-normal RV are the mean and variance of the associated normal RV and not the mean and variance of the log-normal RV itself.
For comparison, Figure 1 shows plots of the PDF of a normal and log-normal distribution, as well as the PDFs of a uniform and Laplace distribution (which will be used in Section 3), all of the same mean ( ) 5 X µ = and standard devia- The figure illustrates that the significance of the standard deviation as a measure of statistical uncertainty (i.e. the width of the PDF) can vary markedly for different distributions, as summarized quantitatively in Table 2, which records the cumulative probability of a variable X.
where the error function is defined by The first column of Table 2 shows the values of the distribution parameters that lead to the fixed mean ( In the analyses and experiments of this paper, it will be adequate to neglect the skewness of Y and adopt MGF (16), which identifies Y as a normal RV. In that case, it follows that Z takes the form e e Y Y W Y Z µ σ + = = (20) in which is a standard normal RV. The justification of Equation (20) is that an arbitrary normal RV ( ) 2 , N µ σ can be written in the form [21] ( ) Equation ( where the PDF of W is given in Table 1 and the inverse relations, which will be needed later, can be shown to be Although the RV Y is distributed symmetrically about its mean, the distribution of Z itself is skewed. From Equation (22) the third moment about the mean, to which skewness is proportional, can be shown to be ( )  The seminal findings of this section may be summarized as follows: 1) A random variable Z composed of the product of 2 or more factor RVs for which the ratio of standard deviation to mean is <1 is distributed log-normally to the extent that the skewness (and higher order moments) of ( ) ln Z can be neglected.
2) To find the parameters of the distribution of a log-normal RV Z, one first transforms the data (e.g. sample or simulation) by ( ) to obtain the distribution of the associated normal RV Y which is symmetric about its mean.
In concluding this section, a point of comparison is in order regarding the CLT for the sum of independent RVs and relation (20) for the product of independent RVs. In brief, the CLT holds that the sum (e.g. mean) of a sufficiently large number N of identically distributed, independent RVs converges to a normal RV irrespective of the distribution of X, provided that the i X have a well-defined mean and variance [22] [23]. In theory, the number N is infinitely large, but in practice it can be well below 10; see Ref [10], pp. 36-38. In contrast, the foregoing demonstration that a product of RVs is distributed approximately log-normally

Special Case: Product of Normal RVs
for the CF of Y, where the summation index has been changed from i to j so as not to be confounded with the unit imaginary 1 i = − . The inverse Fourier transform of Equation (27) then yields the PDF of Y in which the second equality of Equation (27) Substitution of relation (21) for each normal factor which can be re-expressed in the form Equation (31) Substitution of CF (33) into Equations (28) and (29) provides a more accurate PDF of Y and Z than the PDF of log-normal (26).
in Equation (27), then one can approximate which, substituted into the integral in Equation (28), leads to the Gaussian dis- and associated log-product In Figure 2 are plotted the real part given by Equation (33) as a function of t. Although t serves in the MGF as a dummy variable for computation of statistical moments by differentiation, in the CF t is equivalent to a spatial or temporal frequency [27] [28].  (1) and (2) are seen to be nearly indistinguishable, and both are well approximated by the Gaussian profile (3). Figure  4 shows plots of ( ) Z p z as calculated by (1)

Special Case: Product of Log-Normal RVs
The ubiquity of the normal distribution is primarily a consequence of the CLT, which is a limiting theorem for the sum of a large (in theory, infinite) number of random variables. Moreover, the distributed variable can take-or, as a matter of practicality, be thought to take-both positive and negative values, since the Gaussian PDF is normalized to unity only when integrated over the entire real axis. The log-normal distribution also occurs widely, particularly in reference to activities that involve counting, measuring, or observing the attributes of real physical things. Such activities underlie many kinds of problems for which crowdsourced solutions can be sought. The distributed variable then takes on only non-negative real values and is expected to be intrinsically skewed, since its least value cannot be below zero, whereas its upper limit is open.
Consider, therefore, a composite variable Z comprised of log-normal factors with PDF of the form (see [21], pp. 131-134)      (39) which shows that Y is a Gaussian RV of mean m and variance 2 s , i.e.
Thus, taking the log of Equation (37) leads to the chain of relations from which it follows that Z, itself, is a log-normal RV Stated formally: The product of log-normal RVs is a log-normal RV with parameters given by Equation (42). Note that the preceding result, Equation (41), is exact; no approximations regarding either the number of factor RVs or the relative magnitudes of parameters i m and i s have been made.
From Equation (23) the mean and variance of Z, defined by Equation (37), is

Monte-Carlo Simulations of a Composite Random Variable
In  , , X X X to characterize the 3-dimensional receptacle geometry. The physical quantity for which an estimate is sought is then represented by the variable If Z is satisfactorily described by a log-normal RV, then ( ) • Each simulation, although generated with a different type of basis variable X, should lead within statistical uncertainties to identical histograms for Z and Y.
The preceding prediction follows from the fact that the means and variances of Z and Y depend only on the means and variances (44) of the basis variables i X , and not on the type of RV symbolized by X.
From the ungrouped variates of each MCS one can calculate the sample mean and sample variance of Z by two different approaches, both employing relations deduced from the method of maximum likelihood (ML) [29]. The first approach is to calculate the sample mean ( ) Agreement of statistics (47) and (49) would be indicative that the variates of Z were distributed log-normally.
Comparison of sample statistics with theory for each of the simulations to follow are summarized in Table 3.

Normal Basis X = N
The normal distribution is defined by its mean and variance (see Table 1 Skewness (50) is a measure of symmetry of the PDF with respect to the mean.
Kurtosis (51) is a measure of the shape of the tails of the PDF. A distribution Table 3. Statistics of Monte Carlo simulations of with "fat tails" (leptokurtic) has a higher probability than normal of extreme events, in contrast to a distribution with "thin tails" (platykurtic) for which the probability of extreme events is lower than normal. Figure 5 shows a panoramic plot of the histograms of 1 X (green), 2 X , 3 X , 4 X (gray), and ( ) As expected, all the histograms in the figure appear to be Gaussian, and the histogram of Y lies between the histograms of 2 X and 3 X .
Panels A and B of Figure 6 respectively show in greater detail the histograms of Z and Y, bordered by the profiles of the corresponding log-normal and normal PDFs. In panel A, the right tail of the histogram is marginally less skewed than predicted by the log-normal model. In panel B, the left tail of the histogram is marginally more skewed than the symmetric profile of the Gaussian PDF.
Nevertheless, in both panels, the theoretical profiles satisfactorily match the peak and overall shape of the histograms.

Uniform Basis
µ σ = is symbolized by its upper and lower boun- Table 2 it follows that the mean and standard deviation of X are related to the boundary parameters by Each histogram is enveloped by its associated Gaussian PDF (red).

Laplace Basis X = La
is symbolized by a location parameter µ corresponding to the mean of X and a scale parameter β related to the standard deviation of X by (see Table 2). The four basis variables of the simulation, which have the same means and variances as the basis RVs of Section 3.1, are then respectively   Figure 5. Sample size, symbolic notation, and color coding are the same as in Figure 5. enveloped by PDF of Gaussian variable (35).
(59) Figure 9 shows a panoramic plot of the histograms i X , which have sharp cusps and fat tails in comparison to the Gaussian histograms of Figure 5. Equation (59) establishes quantitatively that a Laplace RV is leptokurtic, as is apparent from Figure 1. Nevertheless, the histogram of ( )  Histogram Y is enveloped by the Gaussian PDF of Figure 5. Sample size, symbolic notation, and color coding are the same as in Figure 5.    Z X X X X = are seen to be precisely normal and log-normal, respectively, as predicted in Section 2.3 and shown in detail in Figure 12.

Commentary
The set of variates (45) comprise the response of a crowd to a problem for which the sought-for solution is a composite random variable Z. The information, or so-called "wisdom of the crowd" [1], lies in the distribution of Z from which the full population statistics can be determined. In comparing the MCS histograms PDFs (red) (41). Histogram Y is enveloped by the Gaussian PDF (40). Sample size, symbolic notation, and color coding are the same as in Figure 5.  there is reason to believe that the basis variables i X comprising the composite variable Z are distributed log-normally, then Z itself should be rigorously log-normal, and a goodness-of-fit test may then be appropriate. This point will be illuminated further in Section 4, which reports a crowdsourcing experiment and MCS to estimate the number of identical objects in a receptacle.
The preceding comments notwithstanding, Figures 5-12 illustrate how well the predicted log-normal distribution fits the histograms of Z generated by basis variables of widely differing distribution shapes, as distinguished by their skewness and kurtosis. Simulations using normal or log-normal basis variables yielded the visually closest matches to the log-normal model. In the case of a log-normal basis, theory predicted, and MCS sustained, an exact log-normal distribution of Z.

Test of Crowdsourced Estimation
In a collaborative effort with the BBC The One Show (nearly exactly 100 years after Galton's pioneering statistical experiment), the author was able to obtain,

The Coin-Estimation Experiment
The  Actually, the maximum value submitted was 25 million, which was about 15% of the entire BBC One network annual budget in the form of £1 coins in a small glass tumbler. The submission was rejected on the grounds that it was so preposterous as to be intended to undermine the experiment.   Table 4. Figure 14. Comparison of the histogram (blue) of 1706 crowdsourced estimates with the histogram (gray) of 10 6 Monte Carlo simulated responses employing log-normal basis variables for coin density and tumbler geometry. The crowdsourced mean estimate was 982; the MCS mean was 1057; the true count was 1111. Relevant statistics are given in Table 4. Enveloping the histograms are the profiles of the log-normal PDFs for the sample (dashed blue) and simulation (solid red). . The gray histogram with red border in Figure 14 will be discussed in Section 4.2.
Despite the caution about goodness-of-fit tests in Section 3.5, it is noteworthy that the fit of the log-normal PDF with parameters (65)

Monte Carlo Simulation of the Coin Estimation Experiment
Passing a goodness-of-fit test does not necessarily prove that a hypothesized theory is correct. Rather, it signifies that the theory should not be rejected on the basis of the tested data. The statistical significance of the experiment described in Section 4.1 is that the distribution of estimates of the number of coins (a composite RV) is consistent with a log-normal distribution for the given sample.
Nevertheless, the implication of this result is of far-reaching practical importance: If it is indeed the case that the estimates from a crowd of given size are distributed log-normally, then one should be able to simulate the estimates of a much larger crowd by constructing the appropriate basis variables that form the factors of the sought-for composite variable.
In other words, the analyst may be able to avoid sampling an impractically large crowd, yet still obtain reliable statistical information by a Monte Carlo simulation (MCS). In this section the responses from a hypothetical crowd of 1 million were simulated by applying the underlying reasoning and mathematical procedure described in Section 2.
Responses from a large crowd to a question that calls for a quantitative answer will presumably include some random guesses as well as reasoned estimates. As the author has emphasized elsewhere [10], a seminal principle to increasing the proportion of reliable estimates in crowdsourcing is to provide participants with a personal incentive to respond thoughtfully. Broadly speaking, there are two types of incentives. The first is to reward all respondents in some way for participating. For example, the author has used this method to provide extra credit toward the final course grade of all students in the class who executed certain tasks designed to measure the randomization of shuffled playing cards [31].
Another example of this reward structure is the internet-based Amazon Mechanical Turk which, according to Amazon, leverages "the skills of distributed Workers on a pay-per-task model" [32]. The second kind of incentive, which has in the experiment to estimate weight, is to reward only the respondent(s) whose estimate(s) comes closest to the true (or best) answer to the problem, once the answer becomes known. In this second approach, the members of the crowd are effectively in a competition where skill matters-unlike the case of a lottery where success depends primarily on probability and luck. Let us assume, then, that members of the hypothetical crowd represented by the MCS are incentivized to deduce the number of coins as described in Section 2. A likely approach entails multiplying the numerical density of coins by the geometrical dimensions of the volume of the receptacle. The televised image of the tumbler showed it to have the shape of an inverted truncated right circular cone, or frustum, such as illustrated in Figure 15. The number Z of coins in the tumbler could then be calculated from the expression [33] ( )( in which 1 R is the lower radius, 2 R is the upper radius, H is the height, and C is the numerical density of the coins. Because the upper and lower radii, height, and numerical density of coins are quantities unknown to the crowd, they must be treated as random variables. The author, himself, did not know the true numerical values, but, judging from the same image presented to the viewers, assigned random variables with the following estimated means and standard deviations (in units of cm)  Figure 16 shows a panoramic plot of the distributions of variables , 1, 2,3, 4 i X i = , for both normal (dashed) and log-normal (solid) bases. Although the former (normal) are symmetric about the mean and the latter (log-normal) exhibit skewness, the difference in visual appearance of the two PDF profiles for each variable is relatively insignificant for the parameters shown in relations (70).
The gray histogram marked "Simulation" in Figure 14 shows the outcome of a MCS comprising as summarized in Table 4. Thus, the MCS estimate was considerably closer to the true value 1111 c N = than the mean estimate of 982 by the crowd.  The histogram obtained from the MCS with normal basis variables is nearly identical to that in Figure 14, and therefore not shown. The match with the corresponding theoretical log-normal PDF is marginally less close, but the higher The most significant statistical outcome, however, is that the MCS predicted the number of coins in the tumbler much more closely than did the actual crowd.
Results of the experiment and simulations are summarized in detail in Table 4.

Commentary on the Experiment and Simulations
The coin estimation study raises several issues worth clarifying if the investigation is to provide a useful general methodology for seeking solutions to other quantitative problems by crowdsourcing.
1) Although sample size matters, the reason that the MCS did much better than the BBC crowd in estimating the number of coins in the tumbler was not primarily due to sample size. The populations sampled by crowdsourcing and by MCS were different not only in size but principally in their effective information content. This was seen by running the MCS with the same parameters (44) as before, but for a sample size comparable to that of the coin experiment, i.e.
~2000. The result was a 24-bin histogram that produced a sample mean of ~1048 and a shape that effectively overlapped the MCS histogram of Figure 14 3) It is especially noteworthy that the MCS estimates Z, defined in Equation (69), resulted in a virtually perfect log-normal distribution, as shown by Figure   14. This outcome suggests that the validity of the log-normal hypothesis of composite variables applies beyond what was explicitly demonstrated in the analysis of Section 2. In contrast to a composite RV like (26) which is formed by products of independent basis RVs, the products forming the variable Z in Equation (69) are not all independent. In particular, the product 1 2 R R is correlated with both 2 1 R and 2 2 R . In the case of two correlated variables-call them U and V-one cannot assume, as was done in the last step of Equation (8) can range between −1 and +1. At the upper limit +1, V varies in the same direction and in perfect linearity with U; at the lower limit −1, V varies in the opposite direction in perfect linearity with U. If two random variables are independent, then , 0 U V ρ = , but the converse is not true; , 0 U V ρ = does not prove that U and V are independent. Various interpretations have been given to [36]. Perhaps the most useful quantitative interpretation is this [37]: The square of the correlation coefficient is equal to the fraction of the variance of variable V that is accounted for by a linear relationship with variable U. Other, more general, methods of testing for nonlinear dependence of two random variables are also known [38] [39].
To estimate the degree of correlation of terms in Equation (69)  for normal (N) and log-normal ( Λ ) radius variables, respectively. The analysis is given in Appendix 2.
The author is unaware of any closed-form expression for the PDF or CDF of a sum of correlated or uncorrelated log-normal RVs, although it is known that the resulting RV is not rigorously log-normal [40]. Various approaches exist to approximating the sum of log-normal RVs under special circumstances (such as independent identically distributed terms), or to achieve accuracy in selected parts of the distribution profile (e.g. the tails), or to match the lowest moments

Conclusions
This paper examined analytically, numerically, and experimentally the distribution of crowdsourced estimates of the solution to a problem seeking the number of objects in a partially revealed three-dimensional volume. Experimentally, the mean response of the crowd, which comprised approximately 2000 viewers of a BBC television show, was within ~12% of the true count. More significantly, the distribution of viewer responses was satisfactorily accounted for by a log-normal distribution.
Theoretical analyses of the product of independent random variables of low standard deviation-to-mean ratios showed that the product was distributed log-normally to an excellent approximation irrespective of the number of factors and their individual distributions. Monte Carlo tests of the theory were made with normal, uniform, Laplace, and log-normal factor variables, all of the same mean and variance, but differing widely in the shape statistics skewness and kurtosis. For independent factors of the log-normal type, the product was rigorously (not approximately) log-normal.
Monte Carlo simulations of the coin estimation experiment, employing basis variables of either the normal or log-normal type and a sample size of 1 million, resulted in mean estimates that were within ~5% of the true count. Particularly noteworthy is the fact that the sought-for composite variable comprised terms that were not independent, but linearly correlated. Nevertheless, the histogram of the product variable was, to all visual appearances, rigorously log-normal.
for Case 2.
The correlation coefficients are virtually the same for the normal and log-normal bases, as one might have anticipated from the close match of the individual distribution functions displayed in Figure 16.