_{1}

A composite random variable is a product (or sum of products) of statistically distributed quantities. Such a variable can represent the solution to a multi-factor quantitative problem submitted to a large, diverse, independent, anonymous group of non-expert respondents (the “crowd”). The objective of this research is to examine the statistical distribution of solutions from a large crowd to a quantitative problem involving image analysis and object counting. Theoretical analysis by the author, covering a range of conditions and types of factor variables, predicts that composite random variables are distributed log-normally to an excellent approximation. If the factors in a problem are themselves distributed log-normally, then their product is rigorously log-normal. A crowdsourcing experiment devised by the author and implemented with the assistance of a BBC (British Broadcasting Corporation) television show, yielded a sample of approximately 2000 responses consistent with a log-normal distribution. The sample mean was within ~12% of the true count. However, a Monte Carlo simulation (MCS) of the experiment, employing either normal or log-normal random variables as factors to model the processes by which a crowd of 1 million might arrive at their estimates, resulted in a visually perfect log-normal distribution with a mean response within ~5% of the true count. The results of this research suggest that a well-modeled MCS, by simulating a sample of responses from a large, rational, and incentivized crowd, can provide a more accurate solution to a quantitative problem than might be attainable by direct sampling of a smaller crowd or an uninformed crowd, irrespective of size, that guesses randomly.

The global reach of telecommunications media, including radio, television, and in particular the social media sites of the internet, make possible an ease and scale of statistical sampling hitherto inconceivable. Through use of these media, almost any question can, at least in principle, be posed to a large, anonymous, diverse, independent population of respondents, referred to in both technical and non-technical literature as the “crowd” [

1) the most useful characteristic of a crowdsourced sample is its distribution function and not just a single statistic,

2) under conditions to be specified, a product of RVs is distributed log-normally to an excellent approximation, irrespective of the type or number or correlation of factor RVs,

3) computer simulation methods can model the response of a hypothetical rational crowd orders of magnitude larger than what actually might be practically attainable.

To the author’s knowledge, the first quantitative experiment in what today would be considered crowdsourcing was published by the English polymath and statistical innovator Sir Francis Galton in 1907 [

The idea underlying crowdsourcing—a term introduced in 2006—is that a large group of non-experts can collectively arrive at a more accurate estimate of some physical quantity or at a better decision regarding some policy, strategy, or treatment than a small group of experts [

This paper addresses a different aspect of crowdsourcing closer in nature to the kind of experiment first performed by Galton. Questions whose responses can be represented numerically are especially suitable for statistical analysis. In this regard, the most useful statistical information to obtain from a crowdsourced sample is its distribution—i.e. the probability function for a discrete random variable (RV), or probability density function (PDF) for a continuous RV, or cumulative distribution function (CDF) for either kind of RV. For simplicity of discussion, the designation PDF will apply here to both discrete and continuous RVs. The importance of knowing the PDF or CDF of a distribution is that one can calculate from it, either theoretically or numerically, the exact population moments, which, depending on the size of an actual sample, can be significantly different from the sample moments. The population moments are estimates of the statistics that would result from a hypothetical infinitely large population of independent respondents. A virtually infinite sample size is what the internet and mass media have the potential to provide; it is also what computer-based Monte Carlo simulation (MCS) methods are already able to provide.

Throughout the past two decades, the author has conducted an array of experiments with students in his physics courses to investigate the validity of the crowdsourcing hypothesis [

This paper reports a comprehensive study of the distribution of responses to a class of questions that calls for estimation of a composite random variable. A composite RV is formed by the product of two or more basis RVs. (The term “composite” is adopted from the designation of a “composite number” [

An archetypical example of this class might be a question like the following: How many objects are contained within some partially disclosed geometric region? There are countless contexts in which such a question might arise and for which turning to a crowd for the answer may be good strategy. For example, high-energy physicists may enlist a crowd to count events recorded in a complex bubble-chamber image; astronomers may enlist a crowd to search a deep-space image for some extraordinary astrophysical event or object; intelligence services may enlist a crowd to search reconnaissance images for locations or objects of military interest, archaeologists may enlist a crowd to search satellite images for structures associated with some cultural sites, and so on [

The specific problem examined in this paper is mathematically simple, but statistically informative: How many identical opaque objects are contained within a certain 3-dimensional volume of space seen only as a 2-dimensional image? The problem involves image analysis and object counting. A reasonable procedure to answer that question might entail the following: 1) Depending on the shape of the region, multiply together the appropriate geometric factors to obtain the volume, and then 2) multiply that volume by the numerical density, i.e. the number of objects in a unit volume. However, none of the needed numbers is known; all are representable by random variables whose realizations (i.e. estimates) by respondents in the crowd would be different. The sought-for RV would, in general, be a product (or sum of products) of 3 RVs relating to geometry and 1 RV characterizing the numerical density—or in all a product (or sum of products) of 4 RVs. The analyst is then faced with three general questions:

1) How are the basis RVs distributed?

2) What will be the distribution of the composite RV?

3) Which statistic of the composite RV should be taken to represent the physical value of the sought-for quantity?

By examining this archetypical question a) theoretically, b) computationally by Monte Carlo simulation, and c) experimentally, this paper addresses the preceding three questions.

The remainder of this paper is organized in the following way:

Section 2 investigates analytically the distribution of a composite random variable comprising independent basis RVs. Of particular interest are the cases in which the basis is either normally or log-normally distributed.

Section 3 investigates numerically by MCS the distributions of a composite variable comprising basis RVs whose distributions differ widely in shape parameters (skewness, kurtosis) for fixed location and scale parameters (mean, variance).

Section 4 reports 1) an experiment, implemented with the collaboration of a British national television show, to employ crowdsourcing as a means to estimate the number of opaque objects in a transparent receptacle, and 2) the use of MCS to predict the statistical results for a hypothetical much larger crowd incentivized to estimate rationally rather than guess randomly.

Section 5 concludes the paper with a summary of principal findings.

For the reader’s convenience, the statistical abbreviations used in the paper are listed below in alphabetical order.

BBC = British Broadcasting Corporation

CDF = cumulative distribution function

CF = characteristic function

CLT = central limit theorem

MCS = Monte Carlo simulation(s)

MGF = moment generating function

PDF = probability density function

RNG = random number generator

RV = random variable

Consider a random variable Z defined by the product

Z = ∏ i = 1 N X i ( μ i , σ i ) (1)

where each basis variable X i ( μ i , σ i ) in Equation is characterized by its mean μ i and standard deviation σ i . At this point, the symbol X represents an arbitrary RV, and the parameters ( μ i , σ i ) for defining X i were chosen to simplify the notation and analysis in sections to follow. Conventional statistical labeling of specific RVs that are relevant to this paper may include parameters different from the mean and standard deviation, as summarized in

H ( x ) = { 1 x ≥ 0 0 x < 0 (2)

(There are different definitions of H ( x ) depending on the value assigned to H ( 0 ) [

Distribution of RV X | Symbolic Representation | Significance of Parameters | |
---|---|---|---|

normal or Gaussian | N ( μ , σ 2 ) | μ = meanof X σ = standarddeviationof X | 1 2 π σ exp ( − ( x − μ ) 2 2 σ 2 ) |

log-normal | Λ ( m , s 2 ) | m = meanof Y = ln ( X ) s = standarddeviationof Y | 1 2 π σ x exp ( − ( ln ( x ) − μ ) 2 2 σ 2 ) |

uniform | U ( a , b ) | a = lowerboundary b = upperboundary | 1 b − a [ H ( x − a ) − H ( x − b ) ] |

Laplace | L a ( μ , β ) | μ = locationparameter β = scaleparameter | 1 2 β exp ( − | x − μ | β ) |

The natural logarithm of Z, which is a more convenient RV to work with, takes the form

Y = ln ( Z ) = ∑ i = 1 N ln ( X i ) . (3)

Reciprocally, one can write

Z = exp ( Y ) . (4)

The strategy of the analysis in this section is to calculate the moment-generating function (MGF) of Y defined by the expectation operation

g Y ( t ) ≡ 〈 exp ( Y t ) 〉 = ∫ e y t p Y ( y ) d y (5)

in which p Y ( y ) is the PDF of Y, and t is a dummy variable the differentiation of which generates the statistical moments k = 0 , 1 , 2 , ⋯ in the following way:

〈 Y k 〉 = [ d k g Y ( t ) / d t k ] t = 0 . (6)

If the MGF of a random variable does not exist, one can always use the characteristic function (CF) defined by

h Y ( t ) ≡ 〈 exp ( i Y t ) 〉 = ∫ e i y t p Y ( y ) d y (7)

where Equation (7) is recognized as the Fourier transform of p Y ( y ) [

Substitution of Equation (3) into Equation (5) leads to

g Y ( t ) = 〈 exp ( t ∑ i = 1 N ln ( X i ) ) 〉 = 〈 ∏ i = 1 N X i t 〉 = ∏ i = 1 N 〈 X i t 〉 (8)

in which the last step—expectation of product equals product of expectations—is justified if the basis RVs are independent, as assumed to be the case in this section. This point will be revisited in Section 4.

From the form of Equation (1), a further condition of the analysis is that the basis RVs have well-defined means and variances. This is the same requirement as for the Central Limit Theorem (CLT) (see [

X i = μ i ( 1 + X i − μ i μ i ) ≡ μ i ( 1 + β i ) , (9)

which defines the variable β i , and substitute Equation (9) into Equation (8) to obtain

g Y ( t ) = ∏ i = 1 N μ i t 〈 ( 1 + β i ) t 〉 . (10)

If the basis variables X i are to describe reasoned estimates rather than unrestricted random guesses, then it can be assumed that representative values of

β i are less than 1—i.e. that the expectations 〈 ( X i − μ i ) k 〉 are small compared to μ i k for integer k ≥ 1 .

Expansion of the binomial factor in Equation (10) to order O ( β i 3 ) , followed by insertion of the expectation values

〈 β i 〉 = 0 〈 β i 2 〉 = ( σ i / μ i ) 2 where σ i 2 = 〈 ( X − μ i ) 2 〉 〈 β i 3 〉 = ( λ i / μ i ) 3 where λ i 3 = 〈 ( X − μ i ) 3 〉 (11)

leads to the approximate MGF

g Y ( t ) ≈ exp ( μ Y t + 1 2 σ Y 2 t 2 + 1 6 λ Y 3 t 3 ) (12)

where

μ Y = ∑ i = 1 N ( ln ( μ i ) − 1 2 ( σ i / μ i ) 2 + 1 3 ( λ i / μ i ) 3 ) (13)

σ Y 2 = ∑ i = 1 N ( ( σ i / μ i ) 2 − ( λ i / μ i ) 3 ) (14)

λ Y 3 = ∑ i = 1 N ( λ i / μ i ) 3 (15)

respectively define the mean, variance, and skewness parameter λ Y of Y. Under the conditions assumed in the foregoing analysis, MGF (12) shows that the distributions of Y, and therefore also Z, are not symmetric about the mean.

The author has been unable to find any source that identifies MGF (12) with a named distribution. However, upon neglect of skewness, Equation (12) takes the form

g Y ( t ) ≈ exp ( μ Y t + 1 2 σ Y 2 t 2 ) (16)

of the MGF of a normal RV [

For comparison,

Δ X ≡ ∫ μ X − σ X μ X + σ X p X ( x ) d x (17)

of a variable X.

Distribution X ( a , b ) | Mean μ X = 5 | Variance σ X 2 = 1 | Probability Δ X P ( | x − μ X | ≤ σ X ) |
---|---|---|---|

Normal N ( μ X , σ X 2 ) | μ X | σ X 2 | erf ( 2 − 1 2 ) ≃ 0.6827 |

Log-Normal Λ ( m , s 2 ) m = 1.5898 s = 0.1980 | exp ( m + 1 2 s 2 ) | e 2 m [ e 2 s 2 − e s 2 ] | 0.6940 |

Uniform U ( a , b ) a = 3.2679 b = 6.7321 | 1 2 ( a + b ) | 1 12 ( b − a ) 2 | 1 3 ≃ 0.5774 |

Laplace L a ( μ , β ) μ = 5 β = 2 − 1 2 ≃ 0.7071 | μ | 2 β 2 | 1 − e − 2 ≃ 0.7569 |

Note that for variables N, U, and La the probability Δ X that a sample falls within ± 1 standard deviation of the mean is a constant dependent on the type of distribution, but independent of the parameters of the distribution. For the log-normal variable, however, Δ Λ has a complicated dependence on μ Λ and σ Λ

Δ Λ = 1 2 [ erf ( ln ( ( μ Λ + σ Λ ) + 1 4 ln ( μ Λ 2 + σ Λ 2 ) − ln ( μ Λ ) ) 2 ln ( μ Λ 2 + σ Λ ) − 4 ln ( μ Λ ) )

− erf ( ln ( ( μ Λ − σ Λ ) + 1 4 ln ( μ Λ 2 + σ Λ 2 ) − ln ( μ Λ ) ) 2 ln ( μ Λ 2 + σ Λ ) − 4 ln ( μ Λ ) ) ] (18)

where the error function is defined by

erf ( x ) ≡ 2 π ∫ 0 x e − t 2 d t . (19)

The first column of

In the analyses and experiments of this paper, it will be adequate to neglect the skewness of Y and adopt MGF (16), which identifies Y as a normal RV. In that case, it follows that Z takes the form

Z = e Y = e μ Y + σ Y W (20)

in which W ≡ N ( 0 , 1 ) is a standard normal RV. The justification of Equation (20) is that an arbitrary normal RV N ( μ , σ 2 ) can be written in the form [

N ( μ , σ 2 ) = μ + σ W . (21)

Equation (20) leads directly by integration to the expectation values of Z

〈 Z k 〉 = 〈 e k μ Y + k σ Y W 〉 = 1 2 π ∫ − ∞ ∞ exp ( k μ Y + k σ Y w ) e − w 2 2 d w = exp ( k μ Y + 1 2 k 2 σ Y 2 ) (22)

where the PDF of W is given in

From Equation (22), the mean and variance of the log-normal RV are then

μ Z = exp ( μ Y + 1 2 σ Y 2 ) σ Z 2 = exp ( 2 μ Y ) ( exp ( 2 σ Y 2 ) − exp ( σ Y 2 ) ) (23)

and the inverse relations, which will be needed later, can be shown to be

μ Y = ln ( μ Z 2 / μ Z 2 + σ Z 2 ) σ Y 2 = ln ( ( μ Z 2 + σ Z 2 ) / μ Z 2 ) (24)

Although the RV Y is distributed symmetrically about its mean, the distribution of Z itself is skewed. From Equation (22) the third moment about the mean, to which skewness is proportional, can be shown to be

〈 ( Z − μ Z ) 3 〉 = 〈 Z 3 〉 − 3 〈 Z 2 〉 μ Z + 2 μ Z 3 = e 3 μ Y ( e 9 2 σ Y 2 − e 5 2 σ Y 2 + e 3 2 σ Y 2 ) (25)

It is useful to note that Equation (20) provides an even more direct way than integration of the PDF at arriving at Equation (22) for the moments of Z since 〈 Z k 〉 = 〈 e Y k 〉 takes the form of the MGF (16) of a normal RV, upon replacing the dummy variable t with the moment order k.

The seminal findings of this section may be summarized as follows:

1) A random variable Z composed of the product of 2 or more factor RVs for which the ratio of standard deviation to mean is <1 is distributed log-normally to the extent that the skewness (and higher order moments) of ln ( Z ) can be neglected.

2) To find the parameters of the distribution of a log-normal RV Z, one first transforms the data (e.g. sample or simulation) by y i = ln ( z i ) to obtain the distribution of the associated normal RV Y which is symmetric about its mean.

In concluding this section, a point of comparison is in order regarding the CLT for the sum of independent RVs and relation (20) for the product of independent RVs. In brief, the CLT holds that the sum (e.g. mean) of a sufficiently large number N of identically distributed, independent RVs X i ( μ X , σ X ) , i = 1 , ⋯ , N converges to a normal RV irrespective of the distribution of X, provided that the X i have a well-defined mean and variance [

Z = ∏ i = 1 N X i ( μ i , σ i ) → Λ ( ∑ i = 1 N ln ( μ i ) , ∑ i = 1 N ( σ i / μ i ) 2 ) (26)

holds for any number of factors N ≥ 2 under the previously specified conditions. Moreover, the individual independent factors X i ( μ i , σ i ) need not have identical distribution parameters, nor even all be the same type of variable X. The parameters of Λ shown in Equation (26) are from Equations (13), (14), (15) with neglect of the skewness parameter and terms of order ( σ i / μ i ) 2 in the mean. This reduction has been found satisfactory in accounting for the Monte Carlo simulations and experimental results discussed in later sections.

The log-normal distribution of a composite RV derived in the previous section is an approximate relation valid to the extent that certain conditions are fulfilled. In the special case where the factors X i of the product (26) defining Z are normal RVs, an alternative expression for the PDF of Y = ln ( Z ) can be derived by means of the CF. This is an important case because the normal distribution satisfactorily describes measurements or estimates of many biomedical variables, physical variables, and variables relating to business management and finance, among others [

From Equation (8) for the MGF of Y and the definition (7) for the CF, one can write

h Y ( t ) = ∏ j = 1 N 〈 X j i t 〉 = ∏ j = 1 N ∫ x j i t p X j ( x j ) d x j ≡ ∫ e i y t p Y ( y ) d y (27)

for the CF of Y, where the summation index has been changed from i to j so as not to be confounded with the unit imaginary i = − 1 . The inverse Fourier transform of Equation (27) then yields the PDF of Y

p Y ( y ) = ( 2 π ) − 1 ∫ − ∞ ∞ e − i y t h Y ( t ) d t = ( 2 π ) − 1 ∫ − ∞ ∞ e − i y t [ ∏ j = 1 N ∫ x j i t p X j ( x j ) d x j ] d t (28)

in which the second equality of Equation (27) was substituted for h Y ( t ) in the first line of Equation (28). The PDF of Z is calculable from the PDF of Y by the following transformation (see Appendix 1):

p Z ( z ) = z − 1 p Y ( ln ( z ) ) . (29)

Substitution of relation (21) for each normal factor X j into (27) leads to

h Y ( t ) = ∏ j = 1 N ( 2 π ) − 1 2 ∫ − μ j / σ j ∞ ( μ j + σ j x ) i t e − x 2 / 2 d x , (30)

which can be re-expressed in the form

h Y ( t ) = ∏ j = 1 N ( 2 π ) − 1 2 e i t ln ( μ j ) ∫ − α j − 1 ∞ exp ( i t ln ( 1 + α j x ) − x 2 / 2 ) d x (31)

where

α j = σ j / μ j . (32)

Equation (31) is an exact expression for the CF of Y, but, to the author’s knowledge, cannot be integrated in closed form. However, for α j < 1 , expansion of the logarithm in a Taylor series to order α j 2 results in the closed form expression

h Y ( t ) = ∏ j = 1 N [ μ j i t exp ( − 1 2 α j 2 t 2 / ( 1 + i α j 2 t ) ) 1 + i α j 2 t ] . (33)

Substitution of CF (33) into Equations (28) and (29) provides a more accurate PDF of Y and Z than the PDF of log-normal (26).

If α j 2 < 1 for each factor X j in Equation (27), then one can approximate h Y ( t ) in Equation (33) by

h Y ( t ) ≃ ∏ j = 1 N μ j i t exp ( − 1 2 α j 2 t 2 ) (34)

which, substituted into the integral in Equation (28), leads to the Gaussian distribution

Y = ln ( ∏ i = 1 N N i ( μ i , σ i 2 ) ) → N ( ∑ i = 1 N ln ( μ i ) , ∑ i = 1 N ( σ i / μ i ) 2 ) (35)

for Y and the log-normal distribution (26) for Z.

As an example to illustrate the stages of the analysis, consider the composite RV

Z = ∏ i = 1 4 X i ( μ i , σ i ) = N 1 ( 1.0 , ( 0.2 ) 2 ) N 2 ( 4.0 , ( 0.5 ) 2 ) N 3 ( 6.0 , ( 1.0 ) 2 ) N 4 ( 10.0 , ( 1.5 ) 2 ) (36)

and associated log-product Y = ln ( Z ) . In

The ubiquity of the normal distribution is primarily a consequence of the CLT, which is a limiting theorem for the sum of a large (in theory, infinite) number of random variables. Moreover, the distributed variable can take—or, as a matter of practicality, be thought to take—both positive and negative values, since the Gaussian PDF is normalized to unity only when integrated over the entire real axis. The log-normal distribution also occurs widely, particularly in reference to activities that involve counting, measuring, or observing the attributes of real physical things. Such activities underlie many kinds of problems for which crowdsourced solutions can be sought. The distributed variable then takes on only non-negative real values and is expected to be intrinsically skewed, since its least value cannot be below zero, whereas its upper limit is open.

Consider, therefore, a composite variable Z comprised of log-normal factors

Z = ∏ i = 1 N Λ i ( m i , s i 2 ) (37)

with PDF of the form (see [

p Z ( z | m , s ) = 1 2 π s z exp ( − ( ln ( z ) − m ) 2 / 2 s 2 ) . (38)

It then readily follows from the inverse of Equation (29) (see Appendix 1) that the PDF of the variable Y = ln ( Z ) has the form

p Y ( y | m , s ) = 1 2 π s exp ( − ( y − m ) 2 / 2 s 2 ) (39)

which shows that Y is a Gaussian RV of mean m and variance s 2 , i.e. Y = N ( m , s 2 ) .

Thus, taking the log of Equation (37) leads to the chain of relations

Y = ln ( Z ) = ∑ i = 1 N ln ( Λ i ( m i , s i 2 ) ) = ∑ i = 1 N N i ( m i , s i 2 ) = N ( ∑ i = 1 N m i , ∑ i = 1 N s i 2 ) (40)

from which it follows that Z, itself, is a log-normal RV

Z = Λ ( m , s 2 ) (41)

with

m = ∑ i = 1 N m i s 2 = ∑ i = 1 N s i 2 (42)

Stated formally: The product of log-normal RVs is a log-normal RV with parameters given by Equation (42). Note that the preceding result, Equation (41), is exact; no approximations regarding either the number of factor RVs or the relative magnitudes of parameters m i and s i have been made.

From Equation (23) the mean and variance of Z, defined by Equation (37), is then

μ Z = exp ( ∑ i = 1 N m i + 1 2 ∑ i = 1 N s i 2 ) σ Z 2 = exp ( 2 ∑ i = 1 N m i ) ( exp ( 2 ∑ i = 1 N s i 2 ) − exp ( ∑ i = 1 N s i 2 ) ) (43)

In this section the distribution of responses to the kind of archetypical problem posed at the end of Section 1.1 is examined numerically by means of Monte-Carlo simulations (MCS) employing four basic types of two-parameter RVs X i ( μ i , σ i ) : 1) normal, 2) uniform, 3) Laplace, and 4) log-normal. The means μ i and standard deviations σ i of the factor RVs are respectively those of the arguments of the four RVs in Equation (36):

( μ 1 , σ 1 ) = ( 1.0 , 0.2 ) ( μ 2 , σ 2 ) = ( 4.0 , 0.5 ) ( μ 3 , σ 3 ) = ( 6.0 , 1.0 ) ( μ 4 , σ 4 ) = ( 10.0 , 1.5 ) (44)

The four types of RVs differ markedly, however, in skewness and kurtosis, which characterize the shape of the PDF, as shown in

Each of the four simulations of the composite variable Z reported in the subsections to follow comprises n = 10 6 independent samples from a random number generator (RNG) corresponding to one of the four basis RVs listed above. The simulated variates { x i , j } ( i = 1 , 2 , 3 , 4 ; j = 1 , ⋯ , n ) are partitioned into uniform bins of width Δ x = 0.1 ; the resulting variates { y j } , { z j } are partitioned into uniform bins of width Δ y = 0.1 , Δ z = 10.0 (if X = N , U , L a ) or 15.0 (if X = Λ ). To get a sense of scale, note that the product of the four means in Equation (44) is 240 and that ln ( 240 ) ≈ 5.48 . It is to be expected, therefore, that, neglecting skewness, the histogram of Z should be centered at a point near 240, whereas the symmetric histogram of Y should be centered at close to 5.48, which lies between the centers of histograms X 2 and X 3 .

Superposed on each of the generated histograms in the figures to follow will be the relevant theoretical PDF (solid red): 1) PDF of the corresponding RNG for the basis variables { X i } , 2) log-normal PDF (if X = N , L a , U ) or (41) (if X = Λ ) for Z, and 3) normal PDF (35) (if X = N , L a , U ) or (if X = Λ ) for Y. The analysis of Section 2.1 leads to an important prediction concerning the four Monte Carlo simulations:

· Each simulation, although generated with a different type of basis variable X, should lead within statistical uncertainties to identical histograms for Z and Y.

The preceding prediction follows from the fact that the means and variances of Z and Y depend only on the means and variances (44) of the basis variables X i , and not on the type of RV symbolized by X.

From the ungrouped variates of each MCS

z j = x 1 , j x 2 , j x 3 , j x 4 , j (45)

y j = ln ( x 1 , j x 2 , j x 3 , j x 4 , j ) , (46)

one can calculate the sample mean and sample variance of Z by two different approaches, both employing relations deduced from the method of maximum likelihood (ML) [

SAMPLE: Z m Z = 1 n ∑ j = 1 n z j s Z 2 = 1 n ∑ j = 1 n ( z j − m Z ) 2 (47)

The second approach is to calculate the sample mean ( m Y ) and sample variance ( s Y 2 ) from the set of Gaussian variates (46)

SAMPLE: Y m Y = 1 n ∑ j = 1 n y j s Y 2 = 1 n ∑ j = 1 n ( y j − m Y ) 2 (48)

and use relations (48) to deduce the sample mean ( M Z ) and sample variance ( S Z 2 ) as follows from Equation (23)

SAMPLE: Z(Y) M Z = exp ( m Y + 1 2 s Y 2 ) S Z 2 = exp ( 2 m Y ) ( exp ( 2 s Y 2 ) − exp ( s Y 2 ) ) (49)

Agreement of statistics (47) and (49) would be indicative that the variates of Z were distributed log-normally.

Comparison of sample statistics with theory for each of the simulations to follow are summarized in

The normal distribution is defined by its mean and variance (see

S k X ( N ) ≡ 〈 ( ( X − μ X ) / σ X ) 3 〉 = 0 (50)

K X ( N ) ≡ 〈 ( ( X − μ X ) / σ X ) 4 〉 = 3 . (51)

Skewness (50) is a measure of symmetry of the PDF with respect to the mean. Kurtosis (51) is a measure of the shape of the tails of the PDF. A distribution

Basis Parameters X i ( μ i , σ i ) ( i = 1 , 2 , 3 , 4 ) | ( μ i , σ i ) = ( 1 , 0.2 ) , ( 4 , 0.5 ) , ( 6 , 1.0 ) , ( 10 , 1.5 ) | ||||
---|---|---|---|---|---|

Sample (n = 1,000,000) | Theory Z = Λ ( μ Y , σ Y 2 ) | ||||

Basis Variables X i | Sample Z | Sample Y | Sample Z(Y) | Y | Z |

Normal | m Z = 240.01 s Z = 79.54 | m Y = 5.43 s Y = 0.34 | M Z = 240.48 S Z = 83.91 | μ Y = 5.48 σ Y = 0.33 | Μ Z = 253.05 Σ Z = 84.58 |

Uniform | m Z = 240.02 s Z = 79.53 | m Y = 5.43 s Y = 0.33 | M Z = 240.23 S Z = 82.11 | — | — |

Laplace | m Z = 240.01 s Z = 79.48 | m Y = 5.42 s Y = 0.38 | M Z = 243.01 S Z = 96.58 | — | — |

Log-Normal | m Z = 239.97 s Z = 79.50 | m Y = 5.43 s Y = 0.32 | M Z = 239.97 S Z = 79.49 | μ Y = 5.43 σ Y = 0.32 | Μ Z = 240.00 Σ Z = 79.60 |

with “fat tails” (leptokurtic) has a higher probability than normal of extreme events, in contrast to a distribution with “thin tails” (platykurtic) for which the probability of extreme events is lower than normal.

Panels A and B of

A uniform RV X ( μ , σ ) = U ( a , b ) is symbolized by its upper and lower boundaries ( b > a ) . From

a = μ − 3 σ b = μ + 3 σ (52)

The basis RVs X i ( μ i , σ i ) , i = 1 , 2 , 3 , 4 of the simulation, which have the same means and variances as the basis RVs of Section 3.1, are then respectively

X 1 = U 1 ( 0.6536 , 1.3464 ) X 2 = U 2 ( 3.1340 , 4.8660 ) X 3 = U 3 ( 4.2679 , 7.7321 ) X 4 = U 4 ( 7.4019 , 12.5981 ) (53)

The skewness and kurtosis of a uniformly distributed RV are

S k X ( U ) = 0 (54)

K X ( U ) = 9 / 5 = 1.8 . (55)

A Laplace RV X ( μ , σ ) = L a ( μ , β ) is symbolized by a location parameter μ corresponding to the mean of X and a scale parameter β related to the standard deviation of X by

β = 2 − 1 2 σ (56)

(see

X 1 = L a 1 ( 1.0 , 0.1414 ) X 2 = L a 2 ( 4.0 , 0.3536 ) X 3 = L a 3 ( 6.0 , 0.7071 ) X 4 = L a 4 ( 10.0 , 1.0607 ) (57)

The skewness and kurtosis of a Laplace distributed RV are

S k X ( L a ) = 0 (58)

K X ( L a ) = 6 . (59)

A log-normal RV X ( μ , σ ) = Λ ( m , s 2 ) is symbolized by the mean and variance of the normal variable Y = N ( m , s 2 ) = ln ( X ) . From Equation (24), re-expressed below for convenience,

m = ln ( μ 2 / μ 2 + σ 2 ) s 2 = ln ( ( μ 2 + σ 2 ) / μ 2 ) (60)

it follows that the four log-normal basis variables with properties (44) are respectively

X 1 = Λ 1 ( − 0.0196 , 0.1980 ) X 2 = Λ 2 ( 1.3785 , 0.1245 ) X 3 = Λ 3 ( 1.7781 , 0.1655 ) X 4 = Λ 4 ( 2.2915 , 0.1492 ) (61)

The skewness and kurtosis of a log-normal RV

S k X ( Λ ) = ( exp ( s 2 ) + 2 ) exp ( s 2 ) − 1 (62)

K X ( Λ ) = exp ( 4 s 2 ) + 2 exp ( 3 s 2 ) + 3 exp ( 2 s 2 ) − 3 (63)

are not constants, but depend on the scale parameter s. Skewness (62) is greater than 0 for all values of s > 0 ; kurtosis (63) is greater than 3 for all values of s > 0 .

The set of variates (45) comprise the response of a crowd to a problem for which the sought-for solution is a composite random variable Z. The information, or so-called “wisdom of the crowd” [

of Y and Z to the profiles of their respective PDFs, one should bear in mind that in general there is no underlying fundamental theory of crowd response. The log-normal model is not a fundamental theory such as one encounters in physics, and therefore the MCS histograms in Section 3 were not subjected to a chi-square goodness-of-fit test, as is often done in physics to compare experiment and theory.

The validity of the analytical model developed in this paper lies in how well it enables the analyst to predict an unknown quantity represented by the sampled variable Z, and not necessarily in how closely the complete distribution of the sample (i.e. histogram of Z) is matched by a log-normal distribution. However, if there is reason to believe that the basis variables X i comprising the composite variable Z are distributed log-normally, then Z itself should be rigorously log-normal, and a goodness-of-fit test may then be appropriate. This point will be illuminated further in Section 4, which reports a crowdsourcing experiment and MCS to estimate the number of identical objects in a receptacle.

The preceding comments notwithstanding, Figures 5-12 illustrate how well the predicted log-normal distribution fits the histograms of Z generated by basis variables of widely differing distribution shapes, as distinguished by their skewness and kurtosis. Simulations using normal or log-normal basis variables yielded the visually closest matches to the log-normal model. In the case of a log-normal basis, theory predicted, and MCS sustained, an exact log-normal distribution of Z.

In a collaborative effort with the BBC The One Show (nearly exactly 100 years after Galton’s pioneering statistical experiment), the author was able to obtain, using the wide reach of national television, a crowdsourced sample sufficiently large to test the log-normal hypothesis, namely, that under appropriate conditions composite random variables are distributed log-normally. Two kinds of experiments were performed entailing crowdsourced estimates of 1) the weight of a tangible local object, and 2) the quantity of a remotely viewed object. (See Ref. [

The experiment analyzed in detail here is of the second kind. Viewers of The One Show were shown on their televisions a transparent tumbler filled with opaque £1 coins. The tumbler rested on a table adjacent to two ordinary cylindrical glasses of water to provide clues to scale. No explicit dimensions of any objects were given. The challenge posed to viewers (i.e. the crowd) was to estimate the number of coins in the vessel.

The experimental estimates z k ( exp ) , k = 1 , ⋯ , n 0 , were transmitted to the show by email, and the author subsequently received the full set of n 0 = 1706 anonymous responses, which ranged from a low of 42 to a high of 43,200.1 The mean and median of the estimates were respectively Z ¯ ( exp ) = 982 , Z ˜ ( exp ) = 695 . The true count was N c = 1111 . If the mean is taken as the measure of crowd response—a standard statistical practice—the fractional error of the crowd was

Δ N c ( exp ) = ( Z ¯ ( exp ) − N c ) / N c = − 11.6 % . (64)

Although result is not bad, it calls into question—at least to the author—how Galton’s crowd of just 800 members (less than half the BBC sample size) could guess the weight of an ox to within a fractional error of less than 0.1%. One explanation might be that the participants at the fair comprised a crowd of experts familiar with livestock. The respondents to The One Show apparently had no special expertise in the estimation of quantity.

^{1}Actually, the maximum value submitted was 25 million, which was about 15% of the entire BBC One network annual budget in the form of £1 coins in a small glass tumbler. The submission was rejected on the grounds that it was so preposterous as to be intended to undermine the experiment.

m 0 = 1 n 0 ∑ k = 1 n 0 y k ( exp ) = 6.565 (65)

s 0 = 1 n 0 ∑ k = 1 n 0 ( y k ( exp ) − m 0 ) 2 = 0.719 (66)

where the variates y k ( exp ) are defined by

y k ( exp ) = ln ( z k ( exp ) ) . (67)

Parameters m 0 and s 0 in Equation (65) are respectively the mean and standard deviation of Y ( exp ) . The gray histogram with red border in

Despite the caution about goodness-of-fit tests in Section 3.5, it is noteworthy that the fit of the log-normal PDF with parameters (65) to the histogram of experimental estimates actually does exceed the 5% acceptance threshold of a chi-square test for ν = 21 degrees of freedom: χ 21 2 = 6.4 % . The number ν of degrees of freedom is given by

ν = K − 1 − p (68)

where K = 24 is the number of distribution categories (bins), p = 2 is the number of parameters ( m 0 , s 0 ) determined from the data, and the numeral 1 refers to the fact that the histogram is normalized to unit area, in which case knowledge of the values of K − 1 bins determines the value of the remaining bin.

Passing a goodness-of-fit test does not necessarily prove that a hypothesized theory is correct. Rather, it signifies that the theory should not be rejected on the basis of the tested data. The statistical significance of the experiment described in Section 4.1 is that the distribution of estimates of the number of coins (a composite RV) is consistent with a log-normal distribution for the given sample. Nevertheless, the implication of this result is of far-reaching practical importance:

If it is indeed the case that the estimates from a crowd of given size are distributed log-normally, then one should be able to simulate the estimates of a much larger crowd by constructing the appropriate basis variables that form the factors of the sought-for composite variable.

In other words, the analyst may be able to avoid sampling an impractically large crowd, yet still obtain reliable statistical information by a Monte Carlo simulation (MCS). In this section the responses from a hypothetical crowd of 1 million were simulated by applying the underlying reasoning and mathematical procedure described in Section 2.

Responses from a large crowd to a question that calls for a quantitative answer will presumably include some random guesses as well as reasoned estimates. As the author has emphasized elsewhere [

Let us assume, then, that members of the hypothetical crowd represented by the MCS are incentivized to deduce the number of coins as described in Section 2. A likely approach entails multiplying the numerical density of coins by the geometrical dimensions of the volume of the receptacle. The televised image of the tumbler showed it to have the shape of an inverted truncated right circular cone, or frustum, such as illustrated in

Z = ( π / 3 ) ( R 1 2 + R 1 R 2 + R 2 2 ) H C (69)

in which R 1 is the lower radius, R 2 is the upper radius, H is the height, and C is the numerical density of the coins. Because the upper and lower radii, height, and numerical density of coins are quantities unknown to the crowd, they must be treated as random variables. The author, himself, did not know the true numerical values, but, judging from the same image presented to the viewers, assigned random variables with the following estimated means and standard deviations (in units of cm)

R 1 = X ( 3 , 0.7 ) R 2 = X ( 5 , 1.0 ) H = X ( 20 , 2.0 ) C = X ( 1 , 0.2 ) (70)

Monte Carlo simulations were then implemented for both normal variables X = N and log-normal variables X = Λ .

The gray histogram marked “Simulation” in

m s = 1 n s ∑ k = 1 n s y k ( sim ) = 6.892 (71)

s s = 1 n s ∑ k = 1 n s ( y k ( sim ) − m s ) 2 = 0.378 (72)

where variates z k ( sim ) , k = 1 , ⋯ , n s , are the simulated values of Z in Equation (69) and

y k ( sim ) = ln ( z k ( sim ) ) . (73)

For the log-normal basis and sample size of 1 million, the match of theory and simulation in

Δ N c ( sim ) = ( Z ¯ ( sim ) − N c ) / N c = − 4.86 % (74)

as summarized in

Simulation Variables | ||||||||
---|---|---|---|---|---|---|---|---|

Basis RVs X ( μ i , σ i ) | R 1 ( 3 , 0.7 ) | R 2 ( 5 , 1.0 ) | H ( 20 , 2.0 ) | C ( 1 , 0.2 ) | ||||

Composite RVs | Z = ( π / 3 ) ( R 1 2 + R 1 R 2 + R 2 2 ) H C Y = ln ( Z ) | |||||||

Mean Z ¯ or 〈 Z 〉 | S.E. S Z or Σ Z | Mean Y ¯ or 〈 Y 〉 | S.D. s Y or σ Y | S.E. S Y or Σ Y | ||||

Sample n = 1706 | 982 | 38.56 | 6.57 | 0.72 | 0.017 | |||

Simulation Log-Normal n = 1,000,000 | 1057 | 0.42 | 6.89 | 0.38 | 0.00038 | |||

Theoretical Expectations | 1057 | 0.41 | 6.89 | 0.38 | 0.00038 | |||

Simulation Normal n = 1,000,000 | 1057 | 0.40 | 6.89 | 0.39 | 0.00039 | |||

Theoretical Expectations | 1061 | 0.43 | 6.89 | 0.39 | 0.00039 | |||

True Count | 1111 | |||||||

Fractional Error: Experiment (n = 1706) −11.61%; Simulation (n = 1,000,000) −4.86%; Theory (n = 1,000,000) −4.50%.

The histogram obtained from the MCS with normal basis variables is nearly identical to that in

The most significant statistical outcome, however, is that the MCS predicted the number of coins in the tumbler much more closely than did the actual crowd. Results of the experiment and simulations are summarized in detail in

The coin estimation study raises several issues worth clarifying if the investigation is to provide a useful general methodology for seeking solutions to other quantitative problems by crowdsourcing.

1) Although sample size matters, the reason that the MCS did much better than the BBC crowd in estimating the number of coins in the tumbler was not primarily due to sample size. The populations sampled by crowdsourcing and by MCS were different not only in size but principally in their effective information content. This was seen by running the MCS with the same parameters (44) as before, but for a sample size comparable to that of the coin experiment, i.e. ~2000. The result was a 24-bin histogram that produced a sample mean of ~1048 and a shape that effectively overlapped the MCS histogram of

2) Although the MCS of Section 4.2 estimated the number of coins by calculating the volume of a conical frustum, it is unlikely that respondents to The One Show arrived at their estimates in precisely the same way. Quite possibly, very few of the members of the crowd would have known what a frustum is or how to calculate its volume. It is not this geometrical detail that is important in determining the distribution of estimates, but only the act of estimating a volume and multiplying it by a numerical density. The crowd could have treated the glass tumbler simply as a rectangular solid. The independent variations of height, length, and width assumed by different respondents would have again generated estimates distributed log-normally to an excellent approximation, as demonstrated in Section 3. The fact that the sample mean of the crowd was reasonably accurate indicates that most respondents probably applied some kind of valid reasoning to obtain their answers. How closely the MCS estimate matches the true value of a composite variable depends on how well the analyst can model the statistical uncertainties in the factors upon which the sought-for variable depends.

3) It is especially noteworthy that the MCS estimates Z, defined in Equation (69), resulted in a virtually perfect log-normal distribution, as shown by

One widely used measure of the degree of correlation between two random variables U, V is provided by the Pearson correlation coefficient ρ U , V defined by [

ρ U , V ≡ cov ( U , V ) σ U σ V = 〈 ( U − μ U ) ( V − μ V ) 〉 〈 ( U − μ U ) 2 〉 〈 ( V − μ V ) 2 〉 . (75)

ρ U , V can range between −1 and +1. At the upper limit +1, V varies in the same direction and in perfect linearity with U; at the lower limit −1, V varies in the opposite direction in perfect linearity with U. If two random variables are independent, then ρ U , V = 0 , but the converse is not true; ρ U , V = 0 does not prove that U and V are independent. Various interpretations have been given to ρ U , V [

To estimate the degree of correlation of terms in Equation (69) for the volume of the tumbler the Pearson correlation coefficient was used. Substitution of

U ≡ R 1 2 V ≡ R 1 R 2 (76)

into Equation (75), where the radii R 1 and R 2 are given in Equation (70), resulted in correlation coefficients

ρ U , V ( N ) = 0.741 ρ U , V ( Λ ) = 0.740 (77)

for normal (N) and log-normal ( Λ ) radius variables, respectively. The analysis is given in Appendix 2.

The author is unaware of any closed-form expression for the PDF or CDF of a sum of correlated or uncorrelated log-normal RVs, although it is known that the resulting RV is not rigorously log-normal [

Nevertheless, the Monte Carlo simulations executed in the present study of crowdsourcing have shown by computational and graphical means that composite random variables are distributed log-normally to an excellent approximation for large sample size and log-normal basis RVs of low variance ( σ i / μ i < 1 ) , even if the composite variable comprises correlated terms.

This paper examined analytically, numerically, and experimentally the distribution of crowdsourced estimates of the solution to a problem seeking the number of objects in a partially revealed three-dimensional volume. Experimentally, the mean response of the crowd, which comprised approximately 2000 viewers of a BBC television show, was within ~12% of the true count. More significantly, the distribution of viewer responses was satisfactorily accounted for by a log-normal distribution.

Theoretical analyses of the product of independent random variables of low standard deviation-to-mean ratios showed that the product was distributed log-normally to an excellent approximation irrespective of the number of factors and their individual distributions. Monte Carlo tests of the theory were made with normal, uniform, Laplace, and log-normal factor variables, all of the same mean and variance, but differing widely in the shape statistics skewness and kurtosis. For independent factors of the log-normal type, the product was rigorously (not approximately) log-normal.

Monte Carlo simulations of the coin estimation experiment, employing basis variables of either the normal or log-normal type and a sample size of 1 million, resulted in mean estimates that were within ~5% of the true count. Particularly noteworthy is the fact that the sought-for composite variable comprised terms that were not independent, but linearly correlated. Nevertheless, the histogram of the product variable was, to all visual appearances, rigorously log-normal.

Telecommunications media and the internet have the potential to make possible large-scale crowdsourcing of problems like the archetype investigated here, which involved image analysis and object counting. However, the robustness of the log-normal distribution as a kind of universal distribution of composite random variables suggests that crowdsourcing can likewise be accomplished accurately by computer simulations of sufficiently large sample size, provided the underlying statistical model accurately accounts for the uncertainties of the factor variables.

The author thanks reporter Alexandra Freeman of the BBC The One Show for initiating contact regarding the planning of crowdsourcing experiments and providing the author with the resulting data files. The author also thanks Trinity College for partial support through the research fund associated with the George A. Jarvis Chair of Physics.

The author declares no conflicts of interest regarding the publication of this paper.

Silverman, M.P. (2019) Crowdsourced Sampling of a Composite Random Variable: Analysis, Simulation, and Experimental Test. Open Journal of Statistics, 9, 494-529. https://doi.org/10.4236/ojs.2019.94034

Probability Density Function of Z = e x p (Y)

Consider random variables Y and Z related by

Z = exp ( Y ) . (78)

The cumulative probability function (CPF) of Z is defined by the relation

F Z ( z ) = Pr ( Z ≤ z ) = ∫ z 0 z p Z ( z ′ ) d z ′ (79)

where z 0 is some constant reference point. The probability density function (PDF) of Z can be calculated from the CPF by differentiation (see Ref [

p Z ( z ) = d F Z ( z ) / d z . (80)

Substitution of Equation (78) into (79) leads to the chain of deductions

F Z ( z ) = Pr ( e Y ≤ z ) = Pr ( Y ≤ ln ( z ) ) = ∫ − ∞ ln ( z ) p Y ( y ) d y . (81)

Substitution of Equation (81) into (80) leads by the Leibniz integral formula (see Ref [

p Z ( z ) = z − 1 p Y ( ln ( z ) ) . (82)

^{2}and XY

Consider the two composite variables

Z 1 ( μ 1 , σ 1 ) = X ( m 1 , s 1 ) 2 Z 2 ( μ 2 , σ 2 ) = X ( m 1 , s 1 ) Y ( m 2 , s 2 ) (83)

where X ( m 1 , s 1 ) and Y ( m 2 , s 2 ) are independent RVs with respective means ( m i ) and standard deviations ( s i ) , i = 1 , 2 . The correlation coefficient defined by Equation (75) then takes the form

ρ Z 1 Z 2 = 〈 X 3 Y 〉 − 〈 X 2 〉 〈 X 〉 〈 Y 〉 ( 〈 X 4 〉 − 〈 X 2 〉 2 ) ( 〈 X 2 〉 〈 Y 2 〉 − 〈 X 〉 2 〈 Y 〉 2 ) (84)

Equation (84) will be evaluated for the two basis distributions of Section 4.

Case 1: X and Y are normal RVs

Substitution of the variables

X ( m 1 , s 1 ) = N ( m 1 , s 1 2 ) Y ( m 2 , s 2 ) = N ( m 2 , s 2 2 ) (85)

into Equations (83) and (84) leads to expectation values

μ 1 = m 1 2 + s 1 2 σ 1 2 = 4 m 1 2 s 1 2 + 2 s 1 4 μ 2 = m 1 m 2 σ 2 2 = m 1 2 s 2 2 + m 2 2 s 1 2 + s 1 2 s 2 2 (86)

and the correlation coefficient

ρ Z 1 Z 2 ( N ) = 2 m 1 m 2 s 1 ( 2 m 1 2 + s 1 2 ) ( m 1 2 s 2 2 + m 2 2 s 1 2 + s 1 2 s 2 2 ) (87)

Case 2: X and Y are log-normal RVs

Substitution of the variables (with parameters related by Equation (60))

X ( m 1 , s 1 ) = Λ ( a 1 , b 1 2 ) Y ( m 2 , s 2 ) = Λ ( a 2 , b 2 2 ) (88)

into Equations (83) and (84) leads to expectation values

μ 1 = exp ( 2 a 1 2 + 2 b 1 2 ) σ 1 2 = exp ( 4 a 1 ) ( exp ( 8 b 1 2 ) − exp ( 4 b 1 2 ) ) (89)

μ 2 = exp ( a 1 + a 2 + 1 2 b 1 2 + 1 2 b 2 2 ) σ 2 2 = exp ( 2 a 1 + 2 a 2 ) ( exp ( 2 b 1 2 + 2 b 2 2 ) − exp ( b 1 2 + b 2 2 ) ) (90)

and the correlation coefficient

ρ Z 1 Z 2 ( Λ ) = exp ( 2 b 1 2 ) − 1 exp ( 5 b 1 2 + b 2 2 ) − exp ( 4 b 1 2 ) − exp ( b 1 2 + b 2 2 ) + 1 . (91)

With regard to the random variables representing the geometry of the tumbler in Section 4, application of the foregoing relations leads to

Z 1 ≡ R 1 2 = N 1 ( 3 , 0.7 2 ) 2 Z 2 ≡ R 1 R 2 = N 1 ( 3 , 0.7 2 ) N 2 ( 5 , 1.0 2 ) (92)

and

ρ Z 1 Z 2 ( N ) = 0.741 (93)

for Case 1, and to

Z 1 ≡ R 1 2 = Λ 1 ( 1.0721 , 0.2302 2 ) 2 Z 2 ≡ R 1 R 2 = Λ 1 ( 1.0721 , 0.2302 2 ) Λ 2 ( 1.5898 , 0.1980 2 ) (94)

and

ρ Z 1 Z 2 ( Λ ) = 0.740 (95)

for Case 2.

The correlation coefficients are virtually the same for the normal and log-normal bases, as one might have anticipated from the close match of the individual distribution functions displayed in