Crowdsourced Sampling of a Composite Random Variable: Analysis, Simulation, and Experimental Test ()
1. Introduction: Estimation of an Unknown Composite Quantity by Large-Scale Sampling
The global reach of telecommunications media, including radio, television, and in particular the social media sites of the internet, make possible an ease and scale of statistical sampling hitherto inconceivable. Through use of these media, almost any question can, at least in principle, be posed to a large, anonymous, diverse, independent population of respondents, referred to in both technical and non-technical literature as the “crowd” [1] . This paper reports a comprehensive 1) analytical investigation, 2) Monte-Carlo simulation, and 3) experimental test of the distribution of a composite random variable (RV) representing a crowdsourced response to a question calling for a numerical answer. A composite RV is a product of two or more factor RVs. In the following sections it is shown that:
1) the most useful characteristic of a crowdsourced sample is its distribution function and not just a single statistic,
2) under conditions to be specified, a product of RVs is distributed log-normally to an excellent approximation, irrespective of the type or number or correlation of factor RVs,
3) computer simulation methods can model the response of a hypothetical rational crowd orders of magnitude larger than what actually might be practically attainable.
1.1. Background
To the author’s knowledge, the first quantitative experiment in what today would be considered crowdsourcing was published by the English polymath and statistical innovator Sir Francis Galton in 1907 [2] [3] . Galton collected all the estimates of the weight of a dressed ox (i.e. the carcass weight) submitted by contestants at the annual West of England Fat Stock and Poultry Exhibition. To his surprise, he found that the sample median of 1207 pounds differed from the measured weight of 1198 pounds by a mere +0.8% and that the sample mean of 1197 pounds differed by an even smaller fractional error of −0.08%. The sample size was reported to be about 800. There was no mention of the sample distribution.
The idea underlying crowdsourcing—a term introduced in 2006—is that a large group of non-experts can collectively arrive at a more accurate estimate of some physical quantity or at a better decision regarding some policy, strategy, or treatment than a small group of experts [4] . This idea is a hypothesis to be examined experimentally, not a mathematical theorem, like Condorcet’s jury theorem [5] , subject to rigorous proof. Central among crowdsourcing issues investigated recently are questions regarding methods of sampling, quality control, bias elimination, and effectiveness [6] [7] [8] [9] .
This paper addresses a different aspect of crowdsourcing closer in nature to the kind of experiment first performed by Galton. Questions whose responses can be represented numerically are especially suitable for statistical analysis. In this regard, the most useful statistical information to obtain from a crowdsourced sample is its distribution—i.e. the probability function for a discrete random variable (RV), or probability density function (PDF) for a continuous RV, or cumulative distribution function (CDF) for either kind of RV. For simplicity of discussion, the designation PDF will apply here to both discrete and continuous RVs. The importance of knowing the PDF or CDF of a distribution is that one can calculate from it, either theoretically or numerically, the exact population moments, which, depending on the size of an actual sample, can be significantly different from the sample moments. The population moments are estimates of the statistics that would result from a hypothetical infinitely large population of independent respondents. A virtually infinite sample size is what the internet and mass media have the potential to provide; it is also what computer-based Monte Carlo simulation (MCS) methods are already able to provide.
Throughout the past two decades, the author has conducted an array of experiments with students in his physics courses to investigate the validity of the crowdsourcing hypothesis [10] . In particular, tests were designed to examine whether groups of non-experts excelled over specialists in exercises relating to estimation, prediction, and deduction. Because sample sizes were relatively small (below 100), histograms of responses showed significant fluctuations, and the results did not appear to be accounted for by a universally applicable distribution. However, a larger-scale experiment (discussed in Section 4) to test crowdsourced sampling, implemented with the collaboration of a BBC One television show, yielded preliminary results that strongly suggested a log-normal distribution of estimates [10] . The present paper is the outcome of a more general and thorough analysis to extract information contained in a crowdsourced sample.
This paper reports a comprehensive study of the distribution of responses to a class of questions that calls for estimation of a composite random variable. A composite RV is formed by the product of two or more basis RVs. (The term “composite” is adopted from the designation of a “composite number” [11] as an integer expressible as the product of two or more integers, in contrast to a prime number.) This type of question is widely applicable to problems involving mathematics, statistics, physical sciences, engineering, bio-medical sciences, forensics, business and finance, military science, political science, archaeology, and other fields dependent upon quantitative reasoning.
An archetypical example of this class might be a question like the following: How many objects are contained within some partially disclosed geometric region? There are countless contexts in which such a question might arise and for which turning to a crowd for the answer may be good strategy. For example, high-energy physicists may enlist a crowd to count events recorded in a complex bubble-chamber image; astronomers may enlist a crowd to search a deep-space image for some extraordinary astrophysical event or object; intelligence services may enlist a crowd to search reconnaissance images for locations or objects of military interest, archaeologists may enlist a crowd to search satellite images for structures associated with some cultural sites, and so on [12] [13] [14] [15] [16] .
The specific problem examined in this paper is mathematically simple, but statistically informative: How many identical opaque objects are contained within a certain 3-dimensional volume of space seen only as a 2-dimensional image? The problem involves image analysis and object counting. A reasonable procedure to answer that question might entail the following: 1) Depending on the shape of the region, multiply together the appropriate geometric factors to obtain the volume, and then 2) multiply that volume by the numerical density, i.e. the number of objects in a unit volume. However, none of the needed numbers is known; all are representable by random variables whose realizations (i.e. estimates) by respondents in the crowd would be different. The sought-for RV would, in general, be a product (or sum of products) of 3 RVs relating to geometry and 1 RV characterizing the numerical density—or in all a product (or sum of products) of 4 RVs. The analyst is then faced with three general questions:
1) How are the basis RVs distributed?
2) What will be the distribution of the composite RV?
3) Which statistic of the composite RV should be taken to represent the physical value of the sought-for quantity?
By examining this archetypical question a) theoretically, b) computationally by Monte Carlo simulation, and c) experimentally, this paper addresses the preceding three questions.
1.2. Organization
The remainder of this paper is organized in the following way:
Section 2 investigates analytically the distribution of a composite random variable comprising independent basis RVs. Of particular interest are the cases in which the basis is either normally or log-normally distributed.
Section 3 investigates numerically by MCS the distributions of a composite variable comprising basis RVs whose distributions differ widely in shape parameters (skewness, kurtosis) for fixed location and scale parameters (mean, variance).
Section 4 reports 1) an experiment, implemented with the collaboration of a British national television show, to employ crowdsourcing as a means to estimate the number of opaque objects in a transparent receptacle, and 2) the use of MCS to predict the statistical results for a hypothetical much larger crowd incentivized to estimate rationally rather than guess randomly.
Section 5 concludes the paper with a summary of principal findings.
For the reader’s convenience, the statistical abbreviations used in the paper are listed below in alphabetical order.
BBC = British Broadcasting Corporation
CDF = cumulative distribution function
CF = characteristic function
CLT = central limit theorem
MCS = Monte Carlo simulation(s)
MGF = moment generating function
PDF = probability density function
RNG = random number generator
RV = random variable
2. Distribution of a Composite Random Variable
2.1. General Case
Consider a random variable Z defined by the product
(1)
where each basis variable
in Equation is characterized by its mean
and standard deviation
. At this point, the symbol X represents an arbitrary RV, and the parameters
for defining
were chosen to simplify the notation and analysis in sections to follow. Conventional statistical labeling of specific RVs that are relevant to this paper may include parameters different from the mean and standard deviation, as summarized in Table 1. The symbol
employed in Table 1 is the Heaviside function, also known as the step function, which we define here as
(2)
(There are different definitions of
depending on the value assigned to
[17] .) A statistical convention followed in this paper is to represent a random variable by an upper case letter, e.g. X, and a variate (i.e. sample or realization of the random variable) by a corresponding lower case letter, e.g. x.
Table 1. Representation of relevant random variables.
The natural logarithm of Z, which is a more convenient RV to work with, takes the form
. (3)
Reciprocally, one can write
. (4)
The strategy of the analysis in this section is to calculate the moment-generating function (MGF) of Y defined by the expectation operation
(5)
in which
is the PDF of Y, and t is a dummy variable the differentiation of which generates the statistical moments
in the following way:
. (6)
If the MGF of a random variable does not exist, one can always use the characteristic function (CF) defined by
(7)
where Equation (7) is recognized as the Fourier transform of
[18] . Each random variable is uniquely characterized by its MGF (if it exists) and CF [19] . By identifying the MGF or CF of Y, it may then be possible to determine the distribution of the sought-for composite variable Z.
Substitution of Equation (3) into Equation (5) leads to
(8)
in which the last step—expectation of product equals product of expectations—is justified if the basis RVs are independent, as assumed to be the case in this section. This point will be revisited in Section 4.
From the form of Equation (1), a further condition of the analysis is that the basis RVs have well-defined means and variances. This is the same requirement as for the Central Limit Theorem (CLT) (see [19] , 193-195). Re-express each
by the identity
, (9)
which defines the variable
, and substitute Equation (9) into Equation (8) to obtain
. (10)
If the basis variables
are to describe reasoned estimates rather than unrestricted random guesses, then it can be assumed that representative values of
are less than 1—i.e. that the expectations
are small compared to
for integer
.
Expansion of the binomial factor in Equation (10) to order
, followed by insertion of the expectation values
(11)
leads to the approximate MGF
(12)
where
(13)
(14)
(15)
respectively define the mean, variance, and skewness parameter
of Y. Under the conditions assumed in the foregoing analysis, MGF (12) shows that the distributions of Y, and therefore also Z, are not symmetric about the mean.
The author has been unable to find any source that identifies MGF (12) with a named distribution. However, upon neglect of skewness, Equation (12) takes the form
(16)
of the MGF of a normal RV [20] . By definition, if Y, as defined by Equation (3), is a normal RV denoted by
, then Z is a log-normal RV denoted by
; see Table 1. Note that the parameters defining the log-normal RV are the mean and variance of the associated normal RV and not the mean and variance of the log-normal RV itself.
For comparison, Figure 1 shows plots of the PDF of a normal and log-normal distribution, as well as the PDFs of a uniform and Laplace distribution (which will be used in Section 3), all of the same mean
and standard deviation
. The figure illustrates that the significance of the standard deviation as a measure of statistical uncertainty (i.e. the width of the PDF) can vary markedly for different distributions, as summarized quantitatively in Table 2, which records the cumulative probability
(17)
of a variable X.
Figure 1. Graphical comparison of selected distributions of fixed mean
and fixed standard deviation
: (a) Gaussian (red), (b) log-normal (black), (c) uniform (blue), (d) Laplace (green).
Table 2. Comparative significance of 1 standard deviation uncertainty.
Note that for variables N, U, and La the probability
that a sample falls within
standard deviation of the mean is a constant dependent on the type of distribution, but independent of the parameters of the distribution. For the log-normal variable, however,
has a complicated dependence on
and
(18)
where the error function is defined by
. (19)
The first column of Table 2 shows the values of the distribution parameters that lead to the fixed mean (
) and variance (
) specified in the first row. The second and third columns of the table provide the theoretical relations connecting the parameters of each distribution to the mean and variance of the associated RVs.
In the analyses and experiments of this paper, it will be adequate to neglect the skewness of Y and adopt MGF (16), which identifies Y as a normal RV. In that case, it follows that Z takes the form
(20)
in which
is a standard normal RV. The justification of Equation (20) is that an arbitrary normal RV
can be written in the form [21]
. (21)
Equation (20) leads directly by integration to the expectation values of Z
(22)
where the PDF of W is given in Table 1 by setting
and
in the PDF of
.
From Equation (22), the mean and variance of the log-normal RV are then
(23)
and the inverse relations, which will be needed later, can be shown to be
(24)
Although the RV Y is distributed symmetrically about its mean, the distribution of Z itself is skewed. From Equation (22) the third moment about the mean, to which skewness is proportional, can be shown to be
(25)
It is useful to note that Equation (20) provides an even more direct way than integration of the PDF at arriving at Equation (22) for the moments of Z since
takes the form of the MGF (16) of a normal RV, upon replacing the dummy variable t with the moment order k.
The seminal findings of this section may be summarized as follows:
1) A random variable Z composed of the product of 2 or more factor RVs for which the ratio of standard deviation to mean is <1 is distributed log-normally to the extent that the skewness (and higher order moments) of
can be neglected.
2) To find the parameters of the distribution of a log-normal RV Z, one first transforms the data (e.g. sample or simulation) by
to obtain the distribution of the associated normal RV Y which is symmetric about its mean.
In concluding this section, a point of comparison is in order regarding the CLT for the sum of independent RVs and relation (20) for the product of independent RVs. In brief, the CLT holds that the sum (e.g. mean) of a sufficiently large number N of identically distributed, independent RVs
converges to a normal RV irrespective of the distribution of X, provided that the
have a well-defined mean and variance [22] [23] . In theory, the number N is infinitely large, but in practice it can be well below 10; see Ref [10] , pp. 36-38. In contrast, the foregoing demonstration that a product of RVs is distributed approximately log-normally
(26)
holds for any number of factors
under the previously specified conditions. Moreover, the individual independent factors
need not have identical distribution parameters, nor even all be the same type of variable X. The parameters of
shown in Equation (26) are from Equations (13), (14), (15) with neglect of the skewness parameter and terms of order
in the mean. This reduction has been found satisfactory in accounting for the Monte Carlo simulations and experimental results discussed in later sections.
2.2. Special Case: Product of Normal RVs
The log-normal distribution of a composite RV derived in the previous section is an approximate relation valid to the extent that certain conditions are fulfilled. In the special case where the factors
of the product (26) defining Z are normal RVs, an alternative expression for the PDF of
can be derived by means of the CF. This is an important case because the normal distribution satisfactorily describes measurements or estimates of many biomedical variables, physical variables, and variables relating to business management and finance, among others [24] [25] [26] .
From Equation (8) for the MGF of Y and the definition (7) for the CF, one can write
(27)
for the CF of Y, where the summation index has been changed from i to j so as not to be confounded with the unit imaginary
. The inverse Fourier transform of Equation (27) then yields the PDF of Y
(28)
in which the second equality of Equation (27) was substituted for
in the first line of Equation (28). The PDF of Z is calculable from the PDF of Y by the following transformation (see Appendix 1):
. (29)
Substitution of relation (21) for each normal factor
into (27) leads to
, (30)
which can be re-expressed in the form
(31)
where
. (32)
Equation (31) is an exact expression for the CF of Y, but, to the author’s knowledge, cannot be integrated in closed form. However, for
, expansion of the logarithm in a Taylor series to order
results in the closed form expression
. (33)
Substitution of CF (33) into Equations (28) and (29) provides a more accurate PDF of Y and Z than the PDF of log-normal (26).
If
for each factor
in Equation (27), then one can approximate
in Equation (33) by
(34)
which, substituted into the integral in Equation (28), leads to the Gaussian distribution
(35)
for Y and the log-normal distribution (26) for Z.
As an example to illustrate the stages of the analysis, consider the composite RV
(36)
and associated log-product
. In Figure 2 are plotted the real part
(red), imaginary part
(blue), and magnitude
(dashed black) of the Fourier transform
given by Equation (33) as a function of t. Although t serves in the MGF as a dummy variable for computation of statistical moments by differentiation, in the CF t is equivalent to a spatial or temporal frequency [27] [28] .
and
are seen to be symmetric, and
antisymmetric, about t = 0, extending over a range
from −10 to +10. Figure 3 shows plots of
, Equation (28), as calculated by (1) numerical integration of the Fourier transform of the exact CF (31) (solid red), (2) the analytical approximation (33) to the CF (dashed blue), and (3) the PDF of the normal RV (35) (solid green). Profiles (1) and (2) are seen to be nearly indistinguishable, and both are well approximated by the Gaussian profile (3). Figure 4 shows plots of
as calculated by (1) numerical integration of the transformation (29) of the exact PDF of Y (solid red), and (2) the PDF of the approximate log-normal RV (26) (dashed blue). The exact and log-normal PDFs of Z closely match, apart from a slight forward shift of the peak of the log-normal profile.
2.3. Special Case: Product of Log-Normal RVs
The ubiquity of the normal distribution is primarily a consequence of the CLT, which is a limiting theorem for the sum of a large (in theory, infinite) number of random variables. Moreover, the distributed variable can take—or, as a matter of practicality, be thought to take—both positive and negative values, since the Gaussian PDF is normalized to unity only when integrated over the entire real axis. The log-normal distribution also occurs widely, particularly in reference to activities that involve counting, measuring, or observing the attributes of real physical things. Such activities underlie many kinds of problems for which crowdsourced solutions can be sought. The distributed variable then takes on only non-negative real values and is expected to be intrinsically skewed, since its least value cannot be below zero, whereas its upper limit is open.
Consider, therefore, a composite variable Z comprised of log-normal factors
(37)
with PDF of the form (see [21] , pp. 131-134)
. (38)
Figure 2. Fourier transform
of the characteristic function of
, Equation (33), where
is defined by parameters
and
: (a) real part (solid red), (b) imaginary part (solid blue), (c) magnitude (dashed black).
Figure 3. PDF of
defined in Figure 2, as calculated from the Fourier transform of the exact CF Equation (31) (solid red), the Fourier transform of the analytical approximation Equation (33) (dashed blue), and the Gaussian Equation (35) (solid green).
Figure 4. PDF Z defined in Figure 2, as calculated from the exact transformation relation (29) (solid red) and from the PDF of log-normal variable (26) (dashed blue).
It then readily follows from the inverse of Equation (29) (see Appendix 1) that the PDF of the variable
has the form
(39)
which shows that Y is a Gaussian RV of mean m and variance
, i.e.
.
Thus, taking the log of Equation (37) leads to the chain of relations
(40)
from which it follows that Z, itself, is a log-normal RV
(41)
with
(42)
Stated formally: The product of log-normal RVs is a log-normal RV with parameters given by Equation (42). Note that the preceding result, Equation (41), is exact; no approximations regarding either the number of factor RVs or the relative magnitudes of parameters
and
have been made.
From Equation (23) the mean and variance of Z, defined by Equation (37), is then
(43)
3. Monte-Carlo Simulations of a Composite Random Variable
In this section the distribution of responses to the kind of archetypical problem posed at the end of Section 1.1 is examined numerically by means of Monte-Carlo simulations (MCS) employing four basic types of two-parameter RVs
: 1) normal, 2) uniform, 3) Laplace, and 4) log-normal. The means
and standard deviations
of the factor RVs are respectively those of the arguments of the four RVs in Equation (36):
(44)
The four types of RVs differ markedly, however, in skewness and kurtosis, which characterize the shape of the PDF, as shown in Figure 1. Consider
to represent the numerical density of objects in a receptacle, and the variables
to characterize the 3-dimensional receptacle geometry. The physical quantity for which an estimate is sought is then represented by the variable
. If Z is satisfactorily described by a log-normal RV, then
should be well-approximated by a Gaussian RV.
Each of the four simulations of the composite variable Z reported in the subsections to follow comprises
independent samples from a random number generator (RNG) corresponding to one of the four basis RVs listed above. The simulated variates
are partitioned into uniform bins of width
; the resulting variates
,
are partitioned into uniform bins of width
,
(if
) or 15.0 (if
). To get a sense of scale, note that the product of the four means in Equation (44) is 240 and that
. It is to be expected, therefore, that, neglecting skewness, the histogram of Z should be centered at a point near 240, whereas the symmetric histogram of Y should be centered at close to 5.48, which lies between the centers of histograms
and
.
Superposed on each of the generated histograms in the figures to follow will be the relevant theoretical PDF (solid red): 1) PDF of the corresponding RNG for the basis variables
, 2) log-normal PDF (if
) or (41) (if
) for Z, and 3) normal PDF (35) (if
) or (if
) for Y. The analysis of Section 2.1 leads to an important prediction concerning the four Monte Carlo simulations:
· Each simulation, although generated with a different type of basis variable X, should lead within statistical uncertainties to identical histograms for Z and Y.
The preceding prediction follows from the fact that the means and variances of Z and Y depend only on the means and variances (44) of the basis variables
, and not on the type of RV symbolized by X.
From the ungrouped variates of each MCS
(45)
, (46)
one can calculate the sample mean and sample variance of Z by two different approaches, both employing relations deduced from the method of maximum likelihood (ML) [29] . The first approach is to calculate the sample mean
and sample variance
directly from the set of variates (45)
SAMPLE: Z
(47)
The second approach is to calculate the sample mean
and sample variance
from the set of Gaussian variates (46)
SAMPLE: Y
(48)
and use relations (48) to deduce the sample mean
and sample variance
as follows from Equation (23)
SAMPLE: Z(Y)
(49)
Agreement of statistics (47) and (49) would be indicative that the variates of Z were distributed log-normally.
Comparison of sample statistics with theory for each of the simulations to follow are summarized in Table 3.
3.1. Normal Basis X = N
The normal distribution is defined by its mean and variance (see Table 1). The basis variables of the simulation are therefore
,
, as shown in Equation (36) with parameters as defined in list (44). For purposes of comparing histogram shapes, it is noted that the skewness and kurtosis of a normally distributed RV are respectively
(50)
. (51)
Skewness (50) is a measure of symmetry of the PDF with respect to the mean. Kurtosis (51) is a measure of the shape of the tails of the PDF. A distribution
Table 3. Statistics of Monte Carlo simulations of
.
with “fat tails” (leptokurtic) has a higher probability than normal of extreme events, in contrast to a distribution with “thin tails” (platykurtic) for which the probability of extreme events is lower than normal.
Figure 5 shows a panoramic plot of the histograms of
(green),
,
,
(gray), and
(blue), where
. As expected, all the histograms in the figure appear to be Gaussian, and the histogram of Y lies between the histograms of
and
.
Panels A and B of Figure 6 respectively show in greater detail the histograms of Z and Y, bordered by the profiles of the corresponding log-normal and normal PDFs. In panel A, the right tail of the histogram is marginally less skewed than predicted by the log-normal model. In panel B, the left tail of the histogram is marginally more skewed than the symmetric profile of the Gaussian PDF. Nevertheless, in both panels, the theoretical profiles satisfactorily match the peak and overall shape of the histograms.
3.2. Uniform Basis X = U
A uniform RV
is symbolized by its upper and lower boundaries
. From Table 2 it follows that the mean and standard deviation of X are related to the boundary parameters by
(52)
The basis RVs
of the simulation, which have the same means and variances as the basis RVs of Section 3.1, are then respectively
Figure 5. Monte-Carlo simulated histograms of normal variables
with means
and standard deviations
listed in (44), and
(blue).
(green) represents number density;
,
,
(gray) represent geometric dimensions. The sample size is
. Each histogram is enveloped by its associated Gaussian PDF (red).
Figure 6. Panel A: Histogram of Gaussian product Z of Figure 5 enveloped by PDF of log-normal variable (26) with values (44). Panel B: Histogram of
of Figure 5 enveloped by PDF of Gaussian variable (35).
(53)
The skewness and kurtosis of a uniformly distributed RV are
(54)
. (55)
Figure 7 shows a panoramic plot of the histograms
, which have tails that drop vertically in comparison to the Gaussian histograms of Figure 5. Equation (55) establishes that a uniform RV is platykurtic, as is apparent from Figure 1. Nevertheless, the histogram of
is again well represented by a Gaussian PDF, which indicates that
should be reasonably well described by a log-normal RV, as shown in greater detail in Figure 8.
3.3. Laplace Basis X = La
A Laplace RV
is symbolized by a location parameter
corresponding to the mean of X and a scale parameter
related to the standard deviation of X by
(56)
(see Table 2). The four basis variables of the simulation, which have the same means and variances as the basis RVs of Section 3.1, are then respectively
(57)
The skewness and kurtosis of a Laplace distributed RV are
Figure 7. Monte-Carlo simulated histograms of uniform variables
with means
and standard deviations
listed in (44), and
. Histograms
are enveloped by their associated uniform PDFs (red). Histogram Y is enveloped by the Gaussian PDF of Figure 5. Sample size, symbolic notation, and color coding are the same as in Figure 5.
Figure 8. Panel A: Histogram of uniform product Z of Figure 7 enveloped by PDF of log-normal variable (26) with values (44). Panel B: Histogram of
of Figure 7 enveloped by PDF of Gaussian variable (35).
(58)
. (59)
Figure 9 shows a panoramic plot of the histograms
, which have sharp cusps and fat tails in comparison to the Gaussian histograms of Figure 5. Equation (59) establishes quantitatively that a Laplace RV is leptokurtic, as is apparent from Figure 1. Nevertheless, the histogram of
is again well represented by a Gaussian PDF, which indicates that
should again be a log-normal variable to good approximation, as shown in greater detail in Figure 10.
Figure 9. Monte-Carlo simulated histograms of Laplace variables
with means
and standard deviations
listed in (44), and
. Histograms
are enveloped by their associated uniform PDFs (red). Histogram Y is enveloped by the Gaussian PDF of Figure 5. Sample size, symbolic notation, and color coding are the same as in Figure 5.
Figure 10. Panel A: Histogram of Laplace product Z of Figure 9 enveloped by PDF of log-normal variable (26) with values (44). Panel B: Histogram of
of Figure 9 enveloped by PDF of Gaussian variable (35).
3.4. Log-Normal Basis X = Λ
A log-normal RV
is symbolized by the mean and variance of the normal variable
. From Equation (24), re-expressed below for convenience,
(60)
it follows that the four log-normal basis variables with properties (44) are respectively
(61)
The skewness and kurtosis of a log-normal RV
(62)
(63)
are not constants, but depend on the scale parameter s. Skewness (62) is greater than 0 for all values of
; kurtosis (63) is greater than 3 for all values of
.
Figure 11 shows a panoramic plot of the log-normal histograms
, which skew to the right in comparison to the symmetric shapes of the Gaussian basis histograms of Figure 5. The histograms of
and
are seen to be precisely normal and log-normal, respectively, as predicted in Section 2.3 and shown in detail in Figure 12.
3.5. Commentary
The set of variates (45) comprise the response of a crowd to a problem for which the sought-for solution is a composite random variable Z. The information, or so-called “wisdom of the crowd” [1] , lies in the distribution of Z from which the full population statistics can be determined. In comparing the MCS histograms
Figure 11. Monte-Carlo simulated histograms of log-normal variables
with means
and standard deviations
listed in (44), and
. Histograms
are enveloped by their associated log-normal PDFs (red) (41). Histogram Y is enveloped by the Gaussian PDF (40). Sample size, symbolic notation, and color coding are the same as in Figure 5.
Figure 12. Panel A: Histogram of log-normal product Z of Figure 11 enveloped by PDF of log-normal variable (41). Panel B: Histogram of
of Figure 11 enveloped by PDF of Gaussian variable (40).
of Y and Z to the profiles of their respective PDFs, one should bear in mind that in general there is no underlying fundamental theory of crowd response. The log-normal model is not a fundamental theory such as one encounters in physics, and therefore the MCS histograms in Section 3 were not subjected to a chi-square goodness-of-fit test, as is often done in physics to compare experiment and theory.
The validity of the analytical model developed in this paper lies in how well it enables the analyst to predict an unknown quantity represented by the sampled variable Z, and not necessarily in how closely the complete distribution of the sample (i.e. histogram of Z) is matched by a log-normal distribution. However, if there is reason to believe that the basis variables
comprising the composite variable Z are distributed log-normally, then Z itself should be rigorously log-normal, and a goodness-of-fit test may then be appropriate. This point will be illuminated further in Section 4, which reports a crowdsourcing experiment and MCS to estimate the number of identical objects in a receptacle.
The preceding comments notwithstanding, Figures 5-12 illustrate how well the predicted log-normal distribution fits the histograms of Z generated by basis variables of widely differing distribution shapes, as distinguished by their skewness and kurtosis. Simulations using normal or log-normal basis variables yielded the visually closest matches to the log-normal model. In the case of a log-normal basis, theory predicted, and MCS sustained, an exact log-normal distribution of Z.
4. Test of Crowdsourced Estimation
In a collaborative effort with the BBC The One Show (nearly exactly 100 years after Galton’s pioneering statistical experiment), the author was able to obtain, using the wide reach of national television, a crowdsourced sample sufficiently large to test the log-normal hypothesis, namely, that under appropriate conditions composite random variables are distributed log-normally. Two kinds of experiments were performed entailing crowdsourced estimates of 1) the weight of a tangible local object, and 2) the quantity of a remotely viewed object. (See Ref. [10] for a popular account.) Experiments of these kinds were conducted by the author in various physics classes during the past two decades, but no single sample was large enough to permit reliable inference of the statistical distribution. Pooling of results from different sample populations was not feasible since the conditions of the experiments were not all identical.
4.1. The Coin-Estimation Experiment
The experiment analyzed in detail here is of the second kind. Viewers of The One Show were shown on their televisions a transparent tumbler filled with opaque £1 coins. The tumbler rested on a table adjacent to two ordinary cylindrical glasses of water to provide clues to scale. No explicit dimensions of any objects were given. The challenge posed to viewers (i.e. the crowd) was to estimate the number of coins in the vessel.
The experimental estimates
,
, were transmitted to the show by email, and the author subsequently received the full set of
anonymous responses, which ranged from a low of 42 to a high of 43,200.1 The mean and median of the estimates were respectively
,
. The true count was
. If the mean is taken as the measure of crowd response—a standard statistical practice—the fractional error of the crowd was
. (64)
Although result is not bad, it calls into question—at least to the author—how Galton’s crowd of just 800 members (less than half the BBC sample size) could guess the weight of an ox to within a fractional error of less than 0.1%. One explanation might be that the participants at the fair comprised a crowd of experts familiar with livestock. The respondents to The One Show apparently had no special expertise in the estimation of quantity.
1Actually, the maximum value submitted was 25 million, which was about 15% of the entire BBC One network annual budget in the form of £1 coins in a small glass tumbler. The submission was rejected on the grounds that it was so preposterous as to be intended to undermine the experiment.
Figure 13 shows a scatter diagram of the estimates as a function of sample number, i.e. the order in which the estimates were received. Estimates in the approximate range between 0 and 1000 form a dense band; estimates from about 2000 to 10,000 resemble a foam of points the density of which falls off with increasing ordinate. The blue histogram labeled “Experiment” in Figure 14 shows the distribution of estimates partitioned over K = 24 bins of equal width ranging from 0 to 4000. Points that extended beyond 4000 are not shown, since the main body of the histogram would then be severely compressed. Superposed on the histogram of experimental results is the profile (dashed blue) of the corresponding log-normal PDF with sample parameters obtained by application of the method of maximum likelihood (ML) to a Gaussian
[30] ,
Figure 13. Estimates, in order of receipt, of the number of £1 coins in a tumbler displayed on the BBC One Show in 2007. The true count was 1111 coins; the sample size was 1706. Statistics of the experiment are given in Table 4.
Figure 14. Comparison of the histogram (blue) of 1706 crowdsourced estimates with the histogram (gray) of 106 Monte Carlo simulated responses employing log-normal basis variables for coin density and tumbler geometry. The crowdsourced mean estimate was 982; the MCS mean was 1057; the true count was 1111. Relevant statistics are given in Table 4. Enveloping the histograms are the profiles of the log-normal PDFs for the sample (dashed blue) and simulation (solid red).
(65)
(66)
where the variates
are defined by
. (67)
Parameters
and
in Equation (65) are respectively the mean and standard deviation of
. The gray histogram with red border in Figure 14 will be discussed in Section 4.2.
Despite the caution about goodness-of-fit tests in Section 3.5, it is noteworthy that the fit of the log-normal PDF with parameters (65) to the histogram of experimental estimates actually does exceed the 5% acceptance threshold of a chi-square test for
degrees of freedom:
. The number
of degrees of freedom is given by
(68)
where K = 24 is the number of distribution categories (bins), p = 2 is the number of parameters
determined from the data, and the numeral 1 refers to the fact that the histogram is normalized to unit area, in which case knowledge of the values of
bins determines the value of the remaining bin.
4.2. Monte Carlo Simulation of the Coin Estimation Experiment
Passing a goodness-of-fit test does not necessarily prove that a hypothesized theory is correct. Rather, it signifies that the theory should not be rejected on the basis of the tested data. The statistical significance of the experiment described in Section 4.1 is that the distribution of estimates of the number of coins (a composite RV) is consistent with a log-normal distribution for the given sample. Nevertheless, the implication of this result is of far-reaching practical importance:
If it is indeed the case that the estimates from a crowd of given size are distributed log-normally, then one should be able to simulate the estimates of a much larger crowd by constructing the appropriate basis variables that form the factors of the sought-for composite variable.
In other words, the analyst may be able to avoid sampling an impractically large crowd, yet still obtain reliable statistical information by a Monte Carlo simulation (MCS). In this section the responses from a hypothetical crowd of 1 million were simulated by applying the underlying reasoning and mathematical procedure described in Section 2.
Responses from a large crowd to a question that calls for a quantitative answer will presumably include some random guesses as well as reasoned estimates. As the author has emphasized elsewhere [10] , a seminal principle to increasing the proportion of reliable estimates in crowdsourcing is to provide participants with a personal incentive to respond thoughtfully. Broadly speaking, there are two types of incentives. The first is to reward all respondents in some way for participating. For example, the author has used this method to provide extra credit toward the final course grade of all students in the class who executed certain tasks designed to measure the randomization of shuffled playing cards [31] . Another example of this reward structure is the internet-based Amazon Mechanical Turk which, according to Amazon, leverages “the skills of distributed Workers on a pay-per-task model” [32] . The second kind of incentive, which has also been applied by the author in his physics classes as well as by The One Show in the experiment to estimate weight, is to reward only the respondent(s) whose estimate(s) comes closest to the true (or best) answer to the problem, once the answer becomes known. In this second approach, the members of the crowd are effectively in a competition where skill matters—unlike the case of a lottery where success depends primarily on probability and luck.
Let us assume, then, that members of the hypothetical crowd represented by the MCS are incentivized to deduce the number of coins as described in Section 2. A likely approach entails multiplying the numerical density of coins by the geometrical dimensions of the volume of the receptacle. The televised image of the tumbler showed it to have the shape of an inverted truncated right circular cone, or frustum, such as illustrated in Figure 15. The number Z of coins in the tumbler could then be calculated from the expression [33]
(69)
in which
is the lower radius,
is the upper radius, H is the height, and C is the numerical density of the coins. Because the upper and lower radii, height, and numerical density of coins are quantities unknown to the crowd, they must be treated as random variables. The author, himself, did not know the true numerical values, but, judging from the same image presented to the viewers, assigned random variables with the following estimated means and standard deviations (in units of cm)
(70)
Monte Carlo simulations were then implemented for both normal variables
and log-normal variables
.
Figure 16 shows a panoramic plot of the distributions of variables
, for both normal (dashed) and log-normal (solid) bases. Although the former (normal) are symmetric about the mean and the latter (log-normal) exhibit skewness, the difference in visual appearance of the two PDF profiles for each variable is relatively insignificant for the parameters shown in relations (70).
The gray histogram marked “Simulation” in Figure 14 shows the outcome of a MCS comprising
samples from log-normal random number generators with parameters given by relations (70). The profile (solid red) of the histogram is the PDF of the log-normal variable
with Gaussian parameters
(71)
(72)
Figure 15. Geometry of the tumbler is a truncated right circular cone or frustum with dimensions given by independent random variables for height H, lower radius
and upper radius
.
Figure 16. Distributions of the numerical density (C) and geometrical attributes (
,
, H) represented by normal (dashed) or log-normal (solid) random variables, used in the Monte Carlo simulations of Figure 14. The sample size was 106.
where variates
,
, are the simulated values of Z in Equation (69) and
. (73)
For the log-normal basis and sample size of 1 million, the match of theory and simulation in Figure 14 is visually perfect at the scale shown. The predicted number of coins, given by both the theoretical expectation
and sample mean
, Equation (47), is 1057, which represents a fractional error
(74)
as summarized in Table 4. Thus, the MCS estimate was considerably closer to the true value
than the mean estimate of 982 by the crowd.
Table 4. Crowdsourced estimate of number of £1 coins in a tumbler.
Fractional Error: Experiment (n = 1706) −11.61%; Simulation (n = 1,000,000) −4.86%; Theory (n = 1,000,000) −4.50%.
The histogram obtained from the MCS with normal basis variables is nearly identical to that in Figure 14, and therefore not shown. The match with the corresponding theoretical log-normal PDF is marginally less close, but the higher mean
is marginally closer to
, yielding a fractional error of −4.50%. Since the standard error of the mean (i.e. the standard deviation divided by the square root of sample size) is
, the difference of means (1061 − 1057 = 4) is statistically significant in principle. In practical terms, however, the Monte Carlo simulation with either the log-normal or normal basis variables yielded effectively equivalent predictions. Since the individual estimates received from the respondents consisted solely of a single number of coins, it was not possible to conclude which of the two sets of basis variables more accurately described the crowd.
The most significant statistical outcome, however, is that the MCS predicted the number of coins in the tumbler much more closely than did the actual crowd. Results of the experiment and simulations are summarized in detail in Table 4. Theoretical means and sample means are distinguished respectively by expectation brackets like
and overbars like
. Theoretical standard deviations (SD) and standard errors (SE) are symbolized by Greek letters (lower case and upper case sigma, respectively); sample SD and SE are symbolized by Roman letters (lower case and upper case s, respectively).
4.3. Commentary on the Experiment and Simulations
The coin estimation study raises several issues worth clarifying if the investigation is to provide a useful general methodology for seeking solutions to other quantitative problems by crowdsourcing.
1) Although sample size matters, the reason that the MCS did much better than the BBC crowd in estimating the number of coins in the tumbler was not primarily due to sample size. The populations sampled by crowdsourcing and by MCS were different not only in size but principally in their effective information content. This was seen by running the MCS with the same parameters (44) as before, but for a sample size comparable to that of the coin experiment, i.e. ~2000. The result was a 24-bin histogram that produced a sample mean of ~1048 and a shape that effectively overlapped the MCS histogram of Figure 14. The distinction between the two populations is that the BBC crowd contained a subpopulation of uninformed individuals who guessed randomly, whereas the random choices of the MCS were more tightly constrained by the variances assigned to the basis variables. In effect, the MCS population comprised a more rational crowd who used the visual cues better and made better use of a rudimentary knowledge of geometry.
2) Although the MCS of Section 4.2 estimated the number of coins by calculating the volume of a conical frustum, it is unlikely that respondents to The One Show arrived at their estimates in precisely the same way. Quite possibly, very few of the members of the crowd would have known what a frustum is or how to calculate its volume. It is not this geometrical detail that is important in determining the distribution of estimates, but only the act of estimating a volume and multiplying it by a numerical density. The crowd could have treated the glass tumbler simply as a rectangular solid. The independent variations of height, length, and width assumed by different respondents would have again generated estimates distributed log-normally to an excellent approximation, as demonstrated in Section 3. The fact that the sample mean of the crowd was reasonably accurate indicates that most respondents probably applied some kind of valid reasoning to obtain their answers. How closely the MCS estimate matches the true value of a composite variable depends on how well the analyst can model the statistical uncertainties in the factors upon which the sought-for variable depends.
3) It is especially noteworthy that the MCS estimates Z, defined in Equation (69), resulted in a virtually perfect log-normal distribution, as shown by Figure 14. This outcome suggests that the validity of the log-normal hypothesis of composite variables applies beyond what was explicitly demonstrated in the analysis of Section 2. In contrast to a composite RV like (26) which is formed by products of independent basis RVs, the products forming the variable Z in Equation (69) are not all independent. In particular, the product
is correlated with both
and
. In the case of two correlated variables—call them U and V—one cannot assume, as was done in the last step of Equation (8), that the expectation operation factors; in other words,
.
One widely used measure of the degree of correlation between two random variables U, V is provided by the Pearson correlation coefficient
defined by [34]
. (75)
can range between −1 and +1. At the upper limit +1, V varies in the same direction and in perfect linearity with U; at the lower limit −1, V varies in the opposite direction in perfect linearity with U. If two random variables are independent, then
, but the converse is not true;
does not prove that U and V are independent. Various interpretations have been given to
[35] [36] . Perhaps the most useful quantitative interpretation is this [37] : The square of the correlation coefficient is equal to the fraction of the variance of variable V that is accounted for by a linear relationship with variable U. Other, more general, methods of testing for nonlinear dependence of two random variables are also known [38] [39] .
To estimate the degree of correlation of terms in Equation (69) for the volume of the tumbler the Pearson correlation coefficient was used. Substitution of
(76)
into Equation (75), where the radii
and
are given in Equation (70), resulted in correlation coefficients
(77)
for normal (N) and log-normal (
) radius variables, respectively. The analysis is given in Appendix 2.
The author is unaware of any closed-form expression for the PDF or CDF of a sum of correlated or uncorrelated log-normal RVs, although it is known that the resulting RV is not rigorously log-normal [40] . Various approaches exist to approximating the sum of log-normal RVs under special circumstances (such as independent identically distributed terms), or to achieve accuracy in selected parts of the distribution profile (e.g. the tails), or to match the lowest moments (e.g. mean and variance) of an empirical distribution [40] [41] [42] [43] . No single analytical method appears to provide a satisfactory approximation for all conditions.
Nevertheless, the Monte Carlo simulations executed in the present study of crowdsourcing have shown by computational and graphical means that composite random variables are distributed log-normally to an excellent approximation for large sample size and log-normal basis RVs of low variance
, even if the composite variable comprises correlated terms.
5. Conclusions
This paper examined analytically, numerically, and experimentally the distribution of crowdsourced estimates of the solution to a problem seeking the number of objects in a partially revealed three-dimensional volume. Experimentally, the mean response of the crowd, which comprised approximately 2000 viewers of a BBC television show, was within ~12% of the true count. More significantly, the distribution of viewer responses was satisfactorily accounted for by a log-normal distribution.
Theoretical analyses of the product of independent random variables of low standard deviation-to-mean ratios showed that the product was distributed log-normally to an excellent approximation irrespective of the number of factors and their individual distributions. Monte Carlo tests of the theory were made with normal, uniform, Laplace, and log-normal factor variables, all of the same mean and variance, but differing widely in the shape statistics skewness and kurtosis. For independent factors of the log-normal type, the product was rigorously (not approximately) log-normal.
Monte Carlo simulations of the coin estimation experiment, employing basis variables of either the normal or log-normal type and a sample size of 1 million, resulted in mean estimates that were within ~5% of the true count. Particularly noteworthy is the fact that the sought-for composite variable comprised terms that were not independent, but linearly correlated. Nevertheless, the histogram of the product variable was, to all visual appearances, rigorously log-normal.
Telecommunications media and the internet have the potential to make possible large-scale crowdsourcing of problems like the archetype investigated here, which involved image analysis and object counting. However, the robustness of the log-normal distribution as a kind of universal distribution of composite random variables suggests that crowdsourcing can likewise be accomplished accurately by computer simulations of sufficiently large sample size, provided the underlying statistical model accurately accounts for the uncertainties of the factor variables.
Acknowledgements
The author thanks reporter Alexandra Freeman of the BBC The One Show for initiating contact regarding the planning of crowdsourcing experiments and providing the author with the resulting data files. The author also thanks Trinity College for partial support through the research fund associated with the George A. Jarvis Chair of Physics.
Appendix 1
Probability Density Function of
Consider random variables Y and Z related by
. (78)
The cumulative probability function (CPF) of Z is defined by the relation
(79)
where
is some constant reference point. The probability density function (PDF) of Z can be calculated from the CPF by differentiation (see Ref [20] , pp. 60-62)
. (80)
Substitution of Equation (78) into (79) leads to the chain of deductions
. (81)
Substitution of Equation (81) into (80) leads by the Leibniz integral formula (see Ref [17] , p 590) to
. (82)
Appendix 2
Calculation of the Correlation Coefficient of Variables X2 and XY
Consider the two composite variables
(83)
where
and
are independent RVs with respective means
and standard deviations
,
. The correlation coefficient defined by Equation (75) then takes the form
(84)
Equation (84) will be evaluated for the two basis distributions of Section 4.
Case 1: X and Y are normal RVs
Substitution of the variables
(85)
into Equations (83) and (84) leads to expectation values
(86)
and the correlation coefficient
(87)
Case 2: X and Y are log-normal RVs
Substitution of the variables (with parameters related by Equation (60))
(88)
into Equations (83) and (84) leads to expectation values
(89)
(90)
and the correlation coefficient
. (91)
With regard to the random variables representing the geometry of the tumbler in Section 4, application of the foregoing relations leads to
(92)
and
(93)
for Case 1, and to
(94)
and
(95)
for Case 2.
The correlation coefficients are virtually the same for the normal and log-normal bases, as one might have anticipated from the close match of the individual distribution functions displayed in Figure 16.