_{1}

A crowdsourcing experiment in which viewers (the “crowd”) of a British Broadcasting Corporation (BBC) television show submitted estimates of the number of coins in a tumbler was shown in an antecedent paper (Part 1) to follow a log-normal distribution
∧
(m,s^{2})
. The coin-estimation experiment is an archetype of a broad class of image analysis and object counting problems suitable for solution by crowdsourcing.
The objective of the current paper (Part 2)
is to determine the location and scale parameters
(m,s)
of
∧(m,s^{2})
by both Bayesian and maximum likelihood (ML) methods and to compare the results. One outcome of the analysis is the resolution, by means of Jeffreys’ rule, of questions regarding the appropriate Bayesian prior. It is shown that Bayesian and ML analyses lead to the same expression for the location parameter, but different expressions for the scale parameter, which become identical in the limit of an infinite sample size.
A second outcome of the analysis concerns use of the sample mean as the measure of information of the crowd in applications where the distribution of responses is not sought or known. In the coin-estimation experiment
,
the sample mean was found to differ
widely from the mean number of coins calculated from
∧(m,s^{2})
. This discordance raises critical questions concerning whether, and under what conditions, the sample mean provides a reliable measure of the information of the crowd. This paper resolves that problem by use of the principle of maximum entropy (PME). The PME yields a set of equations for finding the most probable distribution consistent with given prior information and only that information. If there is no solution to the PME equations for a specified sample mean and sample variance, then the sample mean is an unreliable statistic, since no measure can be assigned to its uncertainty. Parts 1 and 2 together demonstrate that the information content of crowdsourcing
resides in the distribution of responses
(very often log-normal in form)
,
which
can be obtained empirically or by appropriate modeling.

In a previous paper [

The present paper, to be designated Part 2, extends the statistical analysis of crowdsourcing further. Whereas Part 1 was concerned primarily with the identity and universality of the distribution of crowd responses, Part 2 investigates the parameters by which this distribution is defined and discusses the procedure to be employed when the distribution of crowd responses is not known.

In contrast to impressions fostered by popularized accounts of crowdsourcing [

Part 1 focused primarily on identifying, and demonstrating the universality of, the distribution of crowdsourced responses to a large class of quantitative problems. This class includes problems whose solutions are representable by a composite random variable (RV), i.e. a variable expressible as a product (or sum of products) of other random variables. Statistical analysis of the crowdsourced responses was shown to follow a log-normal distribution. More generally, theoretical analysis and Monte Carlo simulation (MCS) demonstrated that, for a sufficiently large sample size, the distribution of any composite RV comprising factor variables of low relative uncertainty is log-normal to an excellent approximation. (Relative uncertainty is defined by the ratio of standard deviation to mean.) If the factor variables are themselves independent and log-normal, then the composite variable is rigorously (not approximately) log-normal itself.

The present article is concerned with estimation of the parameters that define the log-normal distribution. In general, statistical estimation can be classified into two methodologies: maximum likelihood and Bayesian [

The method of maximum likelihood (ML) is the simpler and easier to use; it was the method employed in Part 1 to extract information from both the crowdsourced sample and much larger MCS sample. The crux of the method, elaborated in the following section, is to compose from the data and known PDF a conditional probability density referred to as the likelihood function, and to solve for the parameters that maximize this function. The ML method is most successful when the likelihood function is unimodal and sharply peaked.

The Bayesian method is more complicated for several reasons. First, in general, it requires the analyst to assess the probability of the sought-for parameters prior to any experimental information about them. This prior probability function is referred to simply as “the prior”. Much of the past debate over Bayesian methods centered on the alleged subjectivity of the prior. Subsequent research, rooted in mathematical group theory (i.e. theory of invariants) has established a rigorous procedure for finding an objective prior for most well-behaved PDFs; see Ref [

The second complication to the Bayesian method, as applied in the present case, is that the distributions relevant to crowdsourcing (normal and log-normal) are defined by two parameters: a location parameter m and a scale parameter s. The crux of the Bayesian method, as elaborated in the following section, is to integrate over the likelihood and prior so as to obtain a posterior probability function (more simply referred to as “the posterior”) from which the statistics of the parameters can be calculated. However, for a two-parameter distribution there are two non-equivalent priors that may apply, depending on whether the analyst is interested in estimating only one or both of the parameters.

Despite the preceding complications, Bayesian methods afford a standardized procedure for incorporating new data by which to progressively update the posterior probability.

The remainder of this paper is organized in the following way.

Section 2 derives the likelihood function and estimation relations for a RV described by a log-normal distribution. Section 3 elaborates on the question of dual priors and derives the corresponding posterior probability densities for a log-normal RV. Section 4 applies the ML and Bayesian methods of parameter estimation to the image analysis and object counting problem of Part 1. Section 5 examines the problem of parameter estimation when the distribution of responses by the crowd is not known, and addresses the question of reliability when two different statistical methods yield significantly different results. Section 6 concludes the paper with a summary of principal findings.

As a matter of statistical terminology, the samples of a random variable are referred to as variates. In keeping with standard statistical notation, a random variable will be denoted by an upper-case letter (e.g. Z), and its variates will be denoted by a corresponding lower-case letter (e.g. z).

A random variable Z is log-normal, as symbolized by

Z = Λ ( m , s 2 ) , (1)

if the variable Y, defined and symbolized by

Y = ln ( Z ) = N ( m , s 2 ) , (2)

is described by a normal (also called Gaussian) distribution. Reciprocally, one can express Z in the form

Z = exp ( Y ) . (3)

The parameters m and s in Equations (1) and (2) are respectively the mean and standard deviation of the normal RV Y whose PDF takes the familiar form

p Y ( y | m , s ) = 1 2 π s exp ( − ( y − m ) 2 / 2 s 2 ) . (4)

The PDF of the original log-normal variable Z, derived in Part 1 from Equation (2), is

p Z ( z | m , s ) = 1 2 π s z exp ( − ( ln ( z ) − m ) 2 / 2 s 2 ) . (5)

The q^{th} statistical moment of Z for q = 0 , 1 , 2 , ⋯ , derived in Part 1, is given by

〈 Z q 〉 = exp ( q m + 1 2 q 2 s 2 ) , (6)

from which the mean m Z and variance s Z 2 directly follow

m Z = 〈 Z 〉 = exp ( m + 1 2 s 2 ) , (7)

s Z 2 = 〈 Z 2 〉 − 〈 Z 〉 2 = exp ( 2 m ) ( exp ( 2 s 2 ) − exp ( s 2 ) ) . (8)

Given the set of variates { z k } , k = 1 , ⋯ , n , obtained, for example, as solutions to a problem by crowdsourcing or by MCS in which the sought-for quantity Z is taken to be log-normal, the likelihood function L ( { z k } | m , s ) is defined to be

L ( { z k } | m , s ) = ∏ k = 1 n p Z ( z k | m , s ) , (9)

where the factors on the right side are evaluations of PDF (5). Equation (9) quantifies the conditional probability of the data, given the distribution parameters m, s.

Since the extremum of a function and of its logarithm occur at the same point, it is more convenient to find the maximum of the log-likelihood

L ( { z k } | m , s ) = ln ( ∏ k = 1 n p Z ( z k | m , s ) ) = ∑ k = 1 n ln ( p Z ( z k | m , s ) ) , (10)

which, upon substitution of Equation (5), takes the form

L ( { z k } | m , s ) = − n ln ( s ) − [ ∑ k = 1 n ( m − ln ( z k ) ) 2 / 2 s 2 ] − ∑ k = 1 n ln ( z k ) − n 2 ln ( 2 π ) . (11)

The last two terms of Equation (11) are independent of the parameters and could have been omitted. Solution of the maximization equations

∂ L ∂ m = − s − 2 ∑ k = 1 n ( m − ( ln z k ) ) = 0 , (12)

∂ L ∂ s = − n s − 1 + s − 3 ∑ k = 1 n ( m − ln ( z k ) ) 2 = 0 , (13)

leads to the ML parameters

m ^ = n − 1 ∑ k = 1 n y k = n − 1 ∑ k = 1 n ln ( z k ) , (14)

s ^ 2 = n − 1 ∑ k = 1 n ( y k − m ^ ) 2 = n − 1 ∑ k = 1 n ( ln ( z k ) − m ^ ) 2 , (15)

in which the ML solution m ^ was substituted for the variable m in Equation (13).

^{1}The term “unbiased” means that the expectation value of the sample variance equals the theoretical population variance. This is not the case for the ML variance. A heuristic justification for the factor ( n − 1 ) − 1 is that there can be no variance for a sample of size 1, and thus the unbiased variance should become indeterminate.

It is to be noted for use later that (a) the first equality of Equation (14) is precisely the form of the sample mean of Y for a sample of size n, and (b) the first equality of Equation (15) differs from the unbiased sample variance of Y for which the normalizing factor of a sample of size n is ( n − 1 ) − 1 , rather than n − 1 [^{1}.

The variance and correlation of the ML parameters are elements of a 2-dimensional correlation matrix C obtained from the Hessian matrix H (i.e. matrix of second derivatives) according to [

C = − H − 1 , (16)

in which

H = ( ∂ 2 L ∂ m 2 ∂ 2 L ∂ m ∂ s ∂ 2 L ∂ s ∂ m ∂ 2 L ∂ s 2 ) . (17)

Upon differentiation of Equations (12) and (13) and use of Equations (14) and (15), the coherence matrix (16) reduces to

C = ( σ m 2 ρ m s ρ m s σ s 2 ) = ( s ^ 2 / n 0 0 s ^ 2 / 2 n ) . (18)

One sees, therefore, that the ML parameters m ^ and s ^ are uncorrelated and that the standard error (i.e. standard deviation of the mean) of each is inversely proportional to the square root of the sample size, as expected.

Although past uses of Bayes’ theorem for estimation and prediction were at times controversial, the theorem itself is a fundamental part of the principles of statistics. Succinctly expressed in terms of hypotheses (H) and data (D), Bayes’ theorem takes a simple form

P ( H | D ) = P ( D | H ) P ( H ) P ( D ) , (19)

where P ( H ) is the prior, P ( D | H ) is the likelihood, and P ( H | D ) is the posterior; the denominator P ( D ) is a normalization constant to be calculated, when needed, by summing or integrating over the full range of the numerator.

Applying Equation (19) in detail to the set of log-normal variates of Section 2 leads to the posterior probability

p ( 2 ) ( n ) ( m , s | { z k } ) = p ( 2 ) ( n ) ( { z k } | m , s ) π ( 2 ) ( m , s ) ∬ p ( 2 ) ( n ) ( { z k } | m , s ) π ( 2 ) ( m , s ) d m d s , (20)

in which π ( 2 ) ( m , s ) is the prior probability of parameters m, s and the likelihood function is given by Equations (9) and (5). The subscript (2) in Equation (20) signifies that the parameter space is 2-dimensional; the superscript (n) marks the total sample size with variates denoted individually by k = 1 , ⋯ , n . The range of m extends from − ∞ to ∞ ; the range of s extends from 0 to ∞ . These ranges hold throughout the entire paper and will, therefore, be omitted from display so that equations will appear less cluttered.

The denominator in Equation (20) is an integral over a log-normal PDF, which is difficult to perform as such. However, since the posterior in Equation (20) is a conditional probability density for the parameters and not for the variates, a major simplification can be achieved by applying Bayes’ theorem to the associated normal variable Y, Equation (2). The expression for the posterior then takes the form

p ( 2 ) ( n ) ( m , s | { y k } ) = p ( 2 ) ( n ) ( { y k } | m , s ) π ( 2 ) ( m , s ) ∬ p ( 2 ) ( n ) ( { y k } | m , s ) π ( 2 ) ( m , s ) d m d s , (21)

where the likelihood function (the numerator of (21)) is now taken to be

L ( { y k } | m , s ) = ∏ k = 1 n p Y ( y k | m , s ) , (22)

instead of Equation (9), and the set of variates { y k } is obtained from Equation (2)

y k = ln ( z k ) , (23)

for k = 1 , ⋯ , n . The integral in the denominator of Equation (21) now involves the Gaussian density (4), instead of the log-normal density (5). Actually, had the calculation proceeded as originally formulated in Equation (20), a transformation of integration variable would have resulted in an expression equivalent to

Equation (21). The product ∏ k = 1 n z k in the denominator of the log-normal

likelihood (see Equations (5) and (9)) would have canceled from both numerator and denominator of Equation (20) since it is not a function of the integration variables m, s. In view of the equivalence of posteriors (20) and (21), the same symbol p ( 2 ) ( n ) is retained.

To evaluate the right side of Equation (21), one must have an appropriate expression for the prior π ( m , s ) . A general rule for determining the prior probability in a large class of estimation problems was developed by Jeffreys [

π ( m , s ) ∝ det ( M ( m , s ) ) , (24)

in which det ( M ( m , s ) ) is the determinant of the Hessian matrix

M ( m , s ) = ( ∂ m m ′ 2 ∂ m s ′ 2 ∂ s m ′ 2 ∂ s s ′ 2 ) ∫ p Y ( y | m , s ) p Y ( y | m ′ s ′ ) d y | m ′ = m s ′ = s , (25)

where the differential operators ( ∂ u v 2 ≡ ∂ 2 / ∂ u ∂ v ) act on the integral to the right. Substitution of Equation (4) into Equation (25) results in the matrix

M ( m , s ) = ( ( 4 s 2 ) − 1 0 0 ( 2 s 2 ) − 1 ) . (26)

Evaluation of the determinant in Equation (24) then yields the prior

π ( 2 ) ( m , s ) ∝ s − 2 . (27)

Constant factors in Equation (26) are unimportant since they cancel from the expression (21) for the posterior, and one can replace the proportionality in (27) with an equality.

Substitution of prior (27) into Equation (21) leads to the posterior probability density

p ( 2 ) ( n ) ( m , s | { y k } ) = n ( n + 1 ) / 2 S n exp ( − n 2 s 2 [ ( m − Y ¯ ) 2 + S 2 ] ) π 2 ( n − 1 ) / 2 Γ ( n / 2 ) s n + 2 , (28)

in which

Y ¯ = n − 1 ∑ k = 1 n y k = n − 1 ∑ k = 1 n ln ( z k ) , (29)

Y 2 ¯ = n − 1 ∑ k = 1 n y k 2 = n − 1 ∑ k = 1 n ln ( z k ) 2 , (30)

S 2 = Y 2 ¯ − Y ¯ 2 , (31)

and

Γ ( x ) = ∫ 0 ∞ u x − 1 e − u d u (32)

is the gamma function.

Since the set of variates { y k } obtained by sampling the crowd enters Equation (28) in the form of two statistics, a sample mean (29) and sample variance (31), the posterior probability function will be designated p ( 2 ) ( n ) ( m , s | Y ¯ , S ) in the remainder of the article.

Plots of solutions to the equation p ( 2 ) ( n ) ( m , s | Y ¯ , S ) = c for different values of the conditional probability c and sample size n form contours analogous to equipotential lines in electrostatics. Viewed as a topographical map, the peak—or point of highest probability density—provides a graphical means of estimating the best set of parameters ( m ˜ , s ˜ ) to be inferred from the sample statistics Y ¯ , S. This is the Bayesian counterpart to the ML procedure of maximizing the likelihood function analytically.

Numerical estimates of the parameters m, s, are obtained from PDF (28) by calculating the expectation values

m ¯ = ∬ m p ( 2 ) ( n ) ( m , s | Y ¯ , S ) d s d m = ∫ m p ( 2 m ) ( n ) ( m | Y ¯ , S ) d m = Y ¯ , (33)

s ¯ = ∬ s p ( 2 ) ( n ) ( m , s | Y ¯ , S ) d m d s = ∫ s p ( 2 s ) ( n ) ( m | Y ¯ , S ) d s = Γ ( ( n − 1 ) / 2 ) Γ ( n / 2 ) n 2 S , (34)

where the second equality in Equations (33) and (34) defines the marginal probability densities for m and s respectively, as indicated by subscripts (2m) and (2s)

p ( 2 m ) ( n ) ( m | Y ¯ , S ) = ∫ p ( 2 ) ( n ) ( m , s | Y ¯ , S ) d s = S n Γ ( ( n + 1 ) / 2 ) π Γ ( n / 2 ) [ ( m − Y ¯ ) 2 + S 2 ] ( n + 1 2 ) , (35)

p ( 2 s ) ( n ) ( s | Y ¯ , S ) = ∫ p ( 2 ) ( n ) ( m , s | Y ¯ , S ) d m = n n / 2 S n exp ( − n S 2 / 2 s 2 ) 2 n 2 − 1 Γ ( n / 2 ) s n + 1 . (36)

From Equations (33), (29), and (14), it is seen that the Bayesian mean m ¯ is identical to the maximum likelihood m ^ . However, the two estimates of s given by Equations (34) and (15) differ, since expansion of Equation (15) yields

s ^ 2 = n − 1 ∑ k = 1 n ( m ^ − ln ( z k ) ) 2 = n − 1 ( ∑ k = 1 n ln ( z k ) 2 ) − m ^ 2 = Y 2 ¯ − Y ¯ 2 = S 2 . (37)

In the limit of an infinite sample size, the numerical coefficient of S in Equation (34) reduces to

lim n → ∞ ( Γ ( ( n − 1 ) / 2 ) Γ ( n / 2 ) n 2 ) = 1 , (38)

in which case the Bayesian and ML scale parameters become identical. For finite sample sizes, a series expansion of the coefficient yields the Bayesian scale parameter to 4th order in n − 1 ,

s ¯ = ( 1 + 3 4 n − 1 + 25 32 n − 2 + 105 128 n − 3 + 1659 2048 n − 4 ) S (39)

in comparison with the ML scale parameter s ^ = S .

The uncertainties in Bayesian estimates, Δ m 2 and Δ s 2 , which can be compared with the ML uncertainties derived from the correlation matrix (18), are obtained again as expectation values of the marginal density functions (35) and (36) as follows

Δ m 2 ≡ ∫ ( m − m ¯ ) 2 p ( 2 m ) ( n ) ( m | Y ¯ , S ) d m = S 2 n − 2 , (40)

Δ s 2 ≡ ∫ ( s − s ¯ ) 2 p ( 2 s ) ( n ) ( s | Y ¯ , S ) d s = n S 2 [ 1 n − 2 − Γ ( ( n − 1 ) / 2 ) 2 2 Γ ( n / 2 ) 2 ] . (41)

In the limit of infinite sample size, Equations (40) and (41) respectively reduce to

lim n → ∞ ( Δ m 2 | Bayes ) = S 2 n = σ m 2 | ML , (42)

lim n → ∞ ( Δ s 2 | Bayes ) = S 2 2 n = σ s 2 | ML , (43)

which again shows large-sample agreement between Bayesian and maximum likelihood statistics. For finite sample sizes, series expansion of the Bayesian uncertainty (41) to 4th order in n − 1 yields

Δ s 2 = ( 1 + 15 4 n − 1 + 83 8 n − 2 + 1605 64 n − 3 ) S 2 2 n . (44)

As expected on the basis of the Central Limit Theorem [

p ( 2 m ) ( n ≫ 1 ) ( m | Y ¯ , S ) → 1 2 π σ m 2 exp ( − ( m − Y ¯ ) 2 / 2 σ m 2 ) = 1 2 π ( S 2 / n ) exp ( − ( m − Y ¯ ) 2 / 2 ( S 2 / n ) ) , (45)

p ( 2 s ) ( n ≫ 1 ) ( s | Y ¯ , S ) → 1 2 π σ s 2 exp ( − ( s − S ) 2 / 2 σ s 2 ) = 1 2 π ( S 2 / 2 n ) exp ( − ( s − S ) 2 / ( S 2 / n ) ) , (46)

signifying that m ¯ ∼ N ( Y ¯ , S 2 / n ) and s ¯ ∼ N ( S , S 2 / 2 n ) in the large-sample approximation.

A summary of the means and variances of the log-normal parameters obtained by both ML and Bayesian methods is given in

The log-normal distribution Λ ( m , s 2 ) , as demonstrated in Part 1, describes the distribution of crowdsourced estimates { z k } of the solution Z to a quantitative problem involving products of random variables. Both parameters (m, s) are

Statistic | Maximum Likelihood | Bayesian Expectation Values | |||
---|---|---|---|---|---|

Symbol | Value | Symbol | Value | Limit n → ∞ | |

location m | m ^ | Y ¯ | m ¯ = 〈 m 〉 | ( 2 ) ( n ) | Y ¯ | |

scale s | s ^ | S ≡ Y 2 ¯ − Y ¯ 2 | s ¯ = 〈 s 〉 | ( 2 ) ( n ) | ( n 2 ) 1 2 Γ ( ( n − 1 ) / 2 ) Γ ( n / 2 ) S | S |

var ( m ) | σ ^ m 2 | S 2 / n | Δ m 2 = 〈 ( m − m ¯ ) 2 〉 | ( 2 ) ( n ) | S 2 n − 2 | S 2 / n |

var ( s ) | σ ^ s 2 | S 2 / 2 n | Δ s 2 = 〈 ( s − s ¯ ) 2 〉 | ( 2 ) ( n ) | [ n n − 2 − n Γ ( ( n − 1 ) / 2 ) 2 2 Γ ( n / 2 ) 2 ] S 2 | S 2 / 2 n |

Sample Statistics: | Y ¯ = 1 n ∑ k = 1 n ln ( z k ) | Y 2 ¯ = 1 n ∑ k = 1 n ln ( z k ) 2 |

needed to determine the population statistics of Z, as shown explicitly by Equation (6). It is to be recalled, however, that m and s are respectively the mean and standard deviation of a normal random variable Y ≡ ln ( Z ) = N ( m , s 2 ) . For the purposes of this paper and its antecedent, which is to extract information from sampling or simulating the responses of a crowd, Z is the quantity of interest, and Y is merely an intermediary for obtaining the parameters m and s.

Under other circumstances, however, an analyst may be interested in the normal variable Y, but desire only to know its mean value, i.e. the location parameter m and its distribution. In such a case, it may seem reasonable simply to follow the approach of Section 3.2—namely, to use the marginal probability density p ( 2 ) ( n ) ( m | Y ¯ , S ) . Surprisingly, the matter of how to proceed in this case is controversial. Arguments against the preceding approach claim that it leads to “marginalization paradoxes” [

According to critics of using p ( 2 ) ( n ) ( m | Y ¯ , S ) , the correct Bayesian approach for estimating the posterior by which to calculate one parameter of a two-parameter distribution is to return to Jeffrey’s rule, Equation (24), and determine the prior π ( 1 ) ( m , s ) for a one-dimensional parameter space. Implementing this instruction leads to a matrix with the single element M 11 of matrix (26) whose substitution in Equation (24), then yields the prior

π ( 1 ) ( m , s ) ∝ s − 1 . (47)

Use of prior (47) in Equation (21) with subsequent integration over s as in Equation (35) results in the posterior

p ( 1 ) ( n ) ( m | Y ¯ , S ) = S n − 1 Γ ( n / 2 ) π Γ ( ( n − 1 ) / 2 ) [ ( m − Y ¯ ) 2 + S 2 ] n 2 , (48)

where the subscript (1) explicitly denotes a prior for a one-dimensional parameter space. Comparison of Equations (48) and (35) shows that p ( 1 ) ( n ) = p ( 2 ) ( n − 1 ) .

The dashed curves in

Part 1 reported a crowdsourcing experiment devised by the author and implemented with the collaboration of a BBC television show. In brief, the experiment involved a transparent tumbler in the shape of a conical frustum filled with £1 coins. Viewers saw the 3-dimensional tumbler as a 2-dimensional projection on their television screens. Viewers were asked to submit by email their estimates of the number of coins in the tumbler, which were subsequently transmitted to the author for analysis. The number of participants was n = 1706 . Objectives of the experiment were 1) to determine the statistical distribution of the viewers’ estimates and 2) to gauge how closely a statistical analysis of crowd responses matched the true count, which was N c = 1111 . The sample mean Z ¯ , sample variance (biased S ^ Z 2 or unbiased S Z 2 ), and standard error S Z ¯ of the responses from the BBC viewers were calculated to be

Z ¯ = 1 n ∑ k = 1 n z k = 982 , (49)

S ^ Z 2 = 1 n ∑ k = 1 n z k 2 = 1592.65 2 or S Z 2 = 1 n − 1 ∑ k = 1 n z k 2 = 1593.29 2 , (50)

S Z ¯ = S Z n = 38.56 , (51)

where Z is the random variable representing the estimated number of coins submitted by a participant in the crowd.

The sample of estimates was satisfactorily accounted for by a log-normal distribution as shown by the histogram (gray bars) in

From the relations of the previous section as summarized in

m ¯ = m ^ = Y ¯ = 6.5651 , (52)

and the Bayesian scale parameter s ¯ , Equation (34), for a sample size n = 1706, is

s ¯ = 1.0004399 × S = 0.7189 , (53)

which differs from the ML parameter s ^ only in the fourth decimal place. Thus,

for a sample size of nearly 2000, the ML and Bayesian analyses lead to statistically equivalent log-normal parameters.

Substitution of the Bayesian (or ML) parameters, Equations (52) and (53), into the log-normal expectation values (7) and (8) for the mean, variance, and standard error of Z results in an estimate of the number of coins significantly different from that of Equation (49)

〈 Z 〉 = exp ( m ¯ + 1 2 s ¯ 2 ) = 919 , (54)

σ Z 2 = exp ( 2 m ¯ ) ( exp ( 2 s ¯ 2 ) − exp ( s ¯ 2 ) ) = 756.26 2 , (55)

σ 〈 Z 〉 = σ Z / n = 18.31 . (56)

According to the Central Limit Theorem (CLT) [

| Z ¯ − 〈 Z 〉 | σ 〈 Z 〉 ≈ 3.5 , (57)

corresponding to a P-value(Ref [

Pr ( | Z ¯ − 〈 Z 〉 | / σ 〈 Z 〉 ≥ 3.5 ) = 4.7 × 10 − 4 . (58)

The low probability (58) signifies that it is very unlikely that the difference in the two means occurred as a matter of chance.

Since the sample mean has been the statistic routinely used in numerous crowdsourcing applications, the large discrepancy between Equations (49) and (54) raises questions crucial to the extraction and interpretation of crowdsourced information:

1) Is there some fundamental statistical principal that justifies use of the sample mean as a measure of the collective response of the crowd?

2) Why does the sample mean differ so markedly from the Bayesian (or ML) estimate of the mean number of coins?

3) Which estimate of Z—(a) the sample mean (49) obtained directly from the variates { z k } or (b) the population mean (54) of the log-normal distribution Λ ( m ¯ , s ¯ 2 ) —more accurately reflects the information contained in the collective response of the crowd?

These questions are resolved in Section 5.3 by first examining a third estimation procedure based on the principle of maximum entropy (PME).

When the probability distribution of a random variable is known, the maximum likelihood or Bayesian methods can be used to estimate the parameters of that distribution, as was done in previous sections. However, in numerous applications of crowdsourcing—starting with the original experiment of Sir Francis Galton in 1907 [

Given incomplete statistical information of a random variable, there is a procedure for finding the most objective probability distribution—i.e. the distribution least biased by unwarranted assumptions—consistent with the known information. This is the distribution that maximizes entropy subject to the constraints of prior information. The so-called principle of maximum entropy (PME) has a vast literature [

Suppose p ( z ) , z = 0 , ⋯ , ∞ , is the probability for outcome z of the random variable Z, which represents the possible estimates of the number of coins by the crowd. Given the discrete nature of the problem, z should be a non-negative integer, but it is written as the argument of a function rather than as an index because, where summation is required, it will be treated as a continuous variable to be integrated. The practical justification for the continuum approximation is that it leads to useful closed-form expressions. The mathematical justification lies in the fact that the range is infinite, and the mean and variance of the system are assumed to be large compared to the unit interval. Thus, treatment of p ( z ) as a continuous PDF is analogous to the well-known procedures for transforming a discrete distribution like the binomial or Poisson into a Gaussian.

The entropy H 0 of a system whose states (i.e. possible outcomes) z occur with probability p ( z ) is given by [

H 0 = − ∑ z = 0 ∞ p ( z ) ln ( p ( z ) ) , (59)

and corresponds to the quantity designated by Shannon as “information” in communication theory [

Suppose further that all that is known of the system, in addition to the non-negative range of outcomes, are the first and second moments of Z, or equivalently the mean and variance. In other words, the prior information can be summarized as

1 = ∑ z = 0 ∞ p ( z ) = ∫ 0 ∞ p ( z ) d z , (60)

〈 Z 〉 = ∑ z = 0 ∞ z p ( z ) = ∫ 0 ∞ z p ( z ) d z = α 1 , (61)

〈 Z 2 〉 = ∑ z = 0 ∞ z 2 p ( z ) = ∫ 0 ∞ z 2 p ( z ) d z = α 2 , (62)

in which Equation (60) is the completeness relation for p ( z ) to be a probability (for discrete z) or probability density (for continuous z). Moments (61) and (62), respectively defined by the first equality and calculated by the second equality, take the known numerical values ( α 1 , α 2 ) given by the third equality. Then, according to the PME, the least-biased distribution p ( z ) can be obtained by maximizing the functional

H = − ∑ z = 0 ∞ p ( z ) ln ( p ( z ) ) + λ 0 [ 1 − ∑ z = 0 ∞ p ( z ) ] + λ 1 [ α 1 − ∑ z = 0 ∞ z p ( z ) ] + λ 2 [ α 2 − ∑ z = 0 ∞ z 2 p ( z ) ] , (63)

with respect to each independent probability p ( z ′ ) , z ′ = 0 , ⋯ , ∞ , where the three factors λ k , k = 0 , 1 , 2 , are Lagrange multipliers.

Implementation of the maximization procedure

∂ H / ∂ p ( z ′ ) = 0 , (64)

by means of the orthonormality relation of independent probabilities

∂ p ( z ) / ∂ p ( z ′ ) = δ z z ′ , (65)

in which δ z z ′ is the Kronecker delta function [

p ( z | λ 1 , λ 2 ) = exp ( − λ 1 z − λ 2 z 2 ) Q ( λ 1 , λ 2 ) , (66)

where the multiplier λ 0 has been absorbed into the partition function

Q ( λ 1 , λ 2 ) ≡ ∫ 0 ∞ exp ( − λ 1 z − λ 2 z 2 ) d z = 1 2 π λ 2 ( 1 + erf ( λ 1 2 λ 2 ) ) exp ( ( λ 1 2 / 4 λ 2 ) ) , (67)

to yield

p ( z | λ 1 , λ 2 ) = 2 λ 2 π exp ( − λ 1 z − λ 2 z 2 ) e − λ 1 2 / 4 λ 2 ( 1 + erf ( λ 1 / 2 λ 2 ) ) . (68)

The error function erf ( x ) is defined by the integral

erf ( x ) ≡ 2 π ∫ 0 x exp ( − t 2 ) d t , (69)

which yields limiting values erf ( 0 ) = 0 , erf ( ∞ ) = 1 , and has odd symmetry erf ( − x ) = − erf ( x ) .

PDF (68) satisfies the completeness integral (60). From the definition of the partition function in Equation (67), it follows that the first two moments of the distribution can be calculated from the derivatives

〈 Z 〉 = − ∂ ln ( Q ( λ 1 , λ 2 ) ) / ∂ λ 1 , (70)

〈 Z 2 〉 = − ∂ ln ( Q ( λ 1 , λ 2 ) ) / ∂ λ 2 , (71)

which, when substituted into Equations (61) and (62), yield expressions for determining the Lagrange multipliers

λ 1 2 λ 2 + 1 π λ 2 e − λ 1 2 / 4 λ 2 1 − erf ( λ 1 / 2 λ 2 ) = α 1 , (72)

λ 1 2 4 λ 2 2 + 1 2 λ 2 − λ 1 2 λ 2 3 / 2 e − λ 1 2 / 4 λ 2 π ( 1 − erf ( λ 1 / 2 λ 2 ) ) = α 2 . (73)

^{2}To calculate moments of a distribution from the partition function, differentiation must be with respect to the Lagrange multipliers ( λ 1 , λ 2 ) and not the transformed variables ( a , b ) .

At this point^{2} the analysis is considerably facilitated by a change of variables from ( λ 1 , λ 2 ) to the variables ( a , b ) defined by

λ 1 = − a / b 2 , (74)

λ 2 = 1 / 2 b 2 , (75)

which, when substituted into Equation (68), result in the PDF

p ( z | a , b ) = 2 π exp ( − ( z − a ) 2 ) b ( 1 + erf ( a / 2 b ) ) . (76)

The form of PDF (76) gives the impression that a is a location parameter (mean) and b is a scale parameter (standard deviation). This is not strictly correct, as can be seen by substituting Equations (74) and (75) into Equations (72) and (73) to obtain

〈 Z 〉 = a + b q ( a , b ) = α 1 , (77)

〈 Z 2 〉 = a 2 + b 2 + a b q ( a , b ) = α 2 , (78)

where

q ( a , b ) = 2 / π e − a 2 / 2 b 2 1 + erf ( a / 2 b ) . (79)

However, lim a / b → ∞ q ( a , b ) → 0 in which case a = 〈 Z 〉 = α 1 and

b 2 = 〈 Z 2 〉 − 〈 Z 〉 2 = α 2 − α 1 2 . Thus, if the distribution is sharply defined, then, for all practical purposes, the error function in Equation (76) is equal to 1, and p ( z | a , b ) becomes a Gaussian PDF extending over the full real axis with mean a > 0 and standard deviation b.

To solve the set of PME Equations (77)-(79) for a and b one must supply the values of α 1 and α 2 , which constitute prior information, but which in practice must be estimated from the sample whose theoretical distribution is not part of the prior information. The optimal estimation procedure is to use the sample averages

α 1 ≈ Z ¯ = n − 1 ∑ i = 1 n z i , (80)

α 2 ≈ Z 2 ¯ = n − 1 ∑ i = 1 n z i 2 , (81)

again symbolized by overbars to distinguish them from theoretical expectation values symbolized by angular brackets. Justification of (80) and (81) derives from a general result of probability theory that maximizing the entropy subject to constraints (61) and (62) is equivalent to maximizing the likelihood function over the manifold of sampling distributions selected by maximum entropy. (See Ref. [

Equations (77)-(79) are highly nonlinear in the variables a and b. One way to solve the set of equations is graphically by plotting the variation of b as a function of a subject to each of the two constraints

a + b q ( a , b ) − α 1 = 0 , (82)

and

a 2 + b 2 + a b q ( a , b ) − α 2 = 0 , (83)

and finding the common point ( a ^ , b ^ ) of intersection. As an example that illuminates the present discussion, consider a hypothetical set of estimates with sample mean α 1 = 981 and sample mean-square α 2 = 1156 2 . The mean α 1 was chosen to be very close to the mean coin estimate Equation (49) of the BBC viewers, but the variance α 2 − α 1 2 = 612 2 is significantly lower than the sample variance Equation (50). The left panel of

The right panel of

So that the reader does not misinterpret these results, it is to be emphasized that the failure of the PME to yield a solution under some specified conditions is

not a failure of the method. Rather, it is useful information signifying that the prior information was insufficient to provide a solution, and that additional or more consistent information is required. Thus, to persist in this approach to finding the mean response of the crowd in the absence of a known distribution, one might have to include in the prior information the sample mean-cube α 3 and the sample mean-quartic α 4 and so on, until a satisfactory solution was obtained. However, to construct a solution incrementally by including higher-order sample moments is a very unsatisfactory way to proceed, since the mathematics soon becomes impractically complicated. Moreover, from a conceptual perspective, the need for such an approach is entirely unnecessary because the actual distribution can be deduced or estimated from the crowdsourced sample.

Recall that the rationale for using the PME in the first place arose from ignorance of the distribution, and that under such circumstances the PME furnishes the least biased distribution by which to interpret the sample mean and variance. However, the distribution of a wide class of crowdsourced samples is knowable, if only the analyst were to extract it from the set of responses: it is the log-normal distribution [

It may be argued that the complexity of the analysis in Sections 5.1 and 5.2 could be avoided if one simply omitted from the prior information the requirement that z ≥ 0 . Permitting z to range over the entire real axis would then yield a PME distribution of pure Gaussian form

p ( z | a , b ) = 1 2 π b 2 exp ( − ( z − a ) 2 / 2 b 2 ) , (84)

in which parameters a and b 2 are unambiguously the population mean and variance. Thus, given the mean and variance as the only prior information, it follows from the PME that 1) the most objective distribution is Gaussian, and 2) the theoretical mean and variance can be estimated directly from the sample mean and sample variance. In other words, it may seem that reducing the prior information would lead unfailingly to a PME solution (i.e. Equation (84)) with easily obtainable parameters. However, although omission of known information may simplify the mathematics, it yields an unreliable solution, as discussed in the following section.

In regard to Question (1), consider a large set of crowdsourced responses to a problem for which the analyst receives just the sample mean and standard deviation, and not the full set of responses. Under these conditions, the resulting maximum entropy distribution is a normal distribution, Equation (84), and the use of maximum likelihood or Bayesian methods for estimating the mean of a normal distribution is precisely the sample mean, as expressed by Equation (49) for the coin estimation experiment. Thus, use of the sample mean to estimate the population mean when the actual distribution is unknown is justified by the PME.

Moreover, the reverse logic also applies. To use the sample mean and standard deviation as the statistics representing the crowd’s collective answer to a problem is to assume implicitly that the responses received from the crowd were normally distributed. However, in the example of the coin-estimation experiment, that assumption is incorrect, as evidenced by the histogram of

There remains Question (3): Which statistic better represents the information of the crowd—the sample mean of a falsely presumed Gaussian distribution or the expectation value calculated from the appropriate log-normal distribution? The answer to this question is somewhat subjective, since it depends on how one views the process of crowdsourcing and what one expects to learn from it.

One way of thinking might be the following. Recall that the idea underlying crowdsourcing is to pose a problem to a large number of diverse, independent-minded people, who collectively represent a wide range of proficiencies and experiences, and see what answers they provide. It is assumed that the crowd will include some members who know enough to address the problem rationally, some members who will guess randomly, and most of the rest whose responses fall somewhere in-between. Since the crowd is large and their responses anonymous, it is not possible to distinguish the experts from the random guessers, so one might just as well average all solutions with equal weighting, which is what the sample mean does. The fact that the sample mean 982 [Equation (49)] of the coin-estimation experiment was closer to the true number N c = 1111 than the estimate 919 [Equation (54)] based on the log-normal distribution might seem to support this viewpoint.

There is, however, a different way to think about the question—but first examine

On theoretical grounds alone, the log-normal plot A manifests the most important statistical properties to be expected of the responses of a crowd to a problem calling for a positive numerical answer. The PDF p Z ( Λ ) ( z | m , s ) must be 0 for all z ≤ 0 , since every viewer could see that the tumbler had at least 1 coin (and, in fact, many more coins than 1). The shape of the plot—main body of roughly Gaussian form coupled to a highly skewed right tail—graphically displays the distinction between informed respondents (main body) and random guessers (outliers under the heavy tail). Thus, without knowing which respondents submitted which estimates, the log-normal PDF appropriately weights each estimate depending on its value relative to the totality of estimates. If the most accurate estimate of N c should actually differ significantly from the mean of Λ ( m ¯ , s ¯ 2 ) , that indicates that the crowd as a whole was not knowledgeable in regard to the posed problem.

A log-normal curve can be approximated by a normal curve, as carried out in detail in Appendix 2. The resulting Gaussian, which takes the form

p Z ( N ) ( z | m , s ) = 1 2 π ( e m s ) exp ( ( z − e m ) 2 / 2 ( e m s ) 2 ) , (85)

is shown as plot B in

A second, lesser accurate normal approximation to the log-normal plot A is obtained simply by substituting the log-normal mean and variance ( M C , S C 2 ) = ( 919 , 756 2 ) of Equations (7) and (8) into a Gaussian PDF. The resulting distribution comprises plot C in _{c} than the peak of plot B, but plot C is wider, overlaps plot A less, ascribes higher probability than plot B to the outliers in the heavy tail of plot A, and extends more significantly into the unphysical negative region.

The final Gaussian, plot D, is the distribution predicted by the PME with sample mean and sample variance ( M D , S D 2 ) = ( 982 , 1593 2 ) of the coin-estimation experiment with neglect of the non-negativity of the range of outcomes. Although the peak is closest to N_{c} of the four plots, plot D has the greatest width (and therefore greatest uncertainty), overlaps the true distribution (plot A) the least, gives the greatest weight to the outliers of plot A, and extends furthest into the domain of unphysical negative estimates. By weighting each estimate the same, the sample mean (center of plot D) ignores the distinction between informed respondents and wild guessers that is a critical part of the structure of plot A. In view of the adverse features of plot D, one must ask whether the fact that the mean of plot D, rather than the mean of plot A, lies closer to N_{c} is in any way significant.

The answer is “No”. Observe that the center of plot D can be displaced even further toward N_{c} simply by increasing the number of outliers with values greater than 3 or more times the value of N_{c}. In short, a statistic that can be made more accurate by the inclusion of estimates that are increasingly wrong is not reliable. Note that the effect of outliers on the theoretical mean of plot A is much weaker because (1) the exponential part of the log-normal PDF p Z ( Λ ) ( z | m , s ) is a function of ln ( z ) rather than z, and (2) the non-exponential part of p Z ( Λ ) ( z | m , s ) decreases inversely with increasing z.

Given the preceding observations regarding the plots of

The entropy of a distribution is a measure of its information content. Because the word “information” has different meanings in different fields of science and engineering that employ statistical reasoning, this section uses “information” as it is interpreted in physics—i.e. as a measure of uncertainty. The greater the entropy of a particular distribution, the greater is the uncertainty (and the lower is the reliability) of its predictive capability. The word “particular” is italicized above for emphasis so as to avoid misconstruing the objective of the method of maximum entropy.

When all one knows about a statistical system is partial prior information such as the mean and variance, the PME provides an inferential method to find the most probable distribution consistent with that information and only that information. This is the distribution that is consistent with the prior information in the greatest number of ways—i.e. which maximizes the entropy of the system. On the other hand, if an analyst has to chose between two known distributions for purposes of prediction, the better choice is the distribution for which the number of possible outcomes inconsistent with the observed properties of the system is fewer—i.e. the distribution with lower entropy.

The two distributions of relevance in this analysis of crowdsourcing are the log-normal and normal distributions whose entropies, given by Equation (59), are respectively evaluated to be

H 0 ( Λ ) = − ∫ 0 ∞ p ( Λ ) ( z | m , s ) ln ( p ( Λ ) ( z | m , s ) ) d z = ln ( 2 π e s ) + m , (86)

H 0 ( N ) = − ∫ 0 ∞ p ( N ) ( z | a , b ) ln ( p ( N ) ( z | a , b ) ) d z = ln ( 2 π e b ) , (87)

where the log-normal and Gaussian PDFs are respectively given by Equations (5) and (84). Substituting into Equations (86) and (87) the parameters obtained from the coin-estimation experiment (repeated below for convenience)

Log-Normal Λ ( m , s 2 ) ( m , s ) = ( 6.57 , 0.72 ) , (88)

Normal N ( a sample , b sample 2 ) ( a , b ) = ( 982.17 , 1593.65 ) , (89)

yields entropies

H 0 ( Λ ) = 7.65 , (90)

and

H 0 ( N ) = 8.79 , (91)

in units of nats (i.e. natural entropy units), since the natural logarithm is used in the definition of entropy in physics. (In communication theory, the logarithm to base 2 is usually employed, in which case entropy is expressed in bits, i.e. binary digits).

Although the numerical difference between relations (90) and (91), H 0 ( N ) − H 0 ( Λ ) = 1.14 , may appear unremarkable, the micro-statistical implications are actually beyond imagining. The number Ω of possible samples of size n consistent with the known prior information of a distribution formed from a particular sample—what in physics would be termed the multiplicity or number of accessible microstates [

ln ( Ω ) = n H 0 . (92)

The greater the entropy, the greater is the number of possible outcomes of any draw from the distribution describing the sample. It then follows from Equation (92) that the relative uncertainty—i.e. ratio of microstates—of the two distributions parameterized by (88) and (89) describing the BBC crowd of size n = 1706 is

Ω ( N ) Ω ( Λ ) = exp ( n ( H ( N ) − H Λ ) ) ≈ 4.5 × 10 844 . (93)

Numbers of the order of the ratio (93) rarely, if ever, occur even in physics on a cosmological scale. The import of (93) is that a vast number of Gaussian microstates—i.e. outcomes of the distribution (84) compatible with the prior information (89)—describe outcomes (e.g. negative numbers of coins) that are not compatible with the physical conditions of the experiment or the statistics of the crowd response as deducible from (88).

Section 5.3 and the foregoing analysis of Section 5.4 call for revisiting

The answer again is “No”. In brief, all that the CLT tells us in regard to the coin-estimation experiment is this: if the experiment is run a large number of times n, then the variation (standard deviation) of the mean result will be narrower than the variation for a single run in proportion to n − 1 / 2 . This is perfectly valid as applied to insert (a) since it derives from a legitimate single-run distribution function(of log-normal form) illustrated by the histogram A or plot B in

In sampling a large group of non-experts (a “crowd”) for the solution to a quantitative problem, there is no guarantee (e.g. by some principle of probability or statistics) that the answer provided by the crowd will be correct or accurate. What usable information the crowd may provide is encoded in the distribution of responses, which the analyst can observe empirically (e.g. as a histogram) or try to deduce theoretically (as in Part 1) by modeling the reasoning process of an informed and incentivized crowd. The distribution function provides the means for obtaining the mean, median, mode, variance, and higher-order moments of the hypothetical population of which the sampled crowd is an approximate representation. Without knowledge of the distribution, statistical measures of uncertainty cannot be interpreted probabilistically.

The antecedent paper [

In applications where the analyst receives only the mean response of the crowd and a measure of its uncertainty, the principle of maximum entropy shows that the most probable distribution compatible with this information is either a Gaussian (for outcomes that span the real axis) or a truncated Gaussian (for non-negative outcomes). It is possible, however, that the equations for the parameters of the maximum entropy distribution lead to no solution given the prior information. In such a case, as illustrated by the coin-estimation experiment, the sample mean of the crowd, irrespective of its value, is not a reliable statistic, since, without an underlying single-run distribution, no confidence limits can be assigned to the uncertainty of the sample mean.

The foregoing problem is in all cases avoidable if the analyst utilizes the complete set of responses from the crowd to obtain the sample distribution, either empirically or by appropriate modeling.

The author declares no conflicts of interest regarding the publication of this paper.

Silverman, M.P. (2019) Extraction of Information from Crowdsourcing: Experimental Test Employing Bayesian, Maximum Likelihood, and Maximum Entropy Methods. Open Journal of Statistics, 9, 571-600. https://doi.org/10.4236/ojs.2019.95038

Maximum Likelihood Solution to the Maximum Entropy

Distribution of Coin Estimates

A general consequence of probability theory cited in Section 5.2 is that maximizing the entropy subject to constraints on the first and second moments is equivalent to maximizing the likelihood function over the manifold of sampling distributions selected by maximum entropy. The significance of this is that one can use the sample mean and sample mean square to obtain the first and second moments as prior information with which to derive the maximum entropy distribution. This equivalence is demonstrated below for the coin-estimation experiment, which is an archetype for problems whereby the outcomes are non-negative numbers.

The likelihood function for the set { z k } , k = 1 , ⋯ , n , of estimates of the number of coins is given by

L = ∏ k = 1 n p ( z k | a , b ) = 2 n / 2 exp ( − ∑ k = 1 n ( z k − a ) 2 / 2 b 2 ) π n / 2 b n ( 1 − erf ( − a / 2 b ) ) n , (94)

where the form of the maximum-entropy PDF derived on the basis of prior information (60)-(62) is given by Equation (76). The log-likelihood function is then

L ( { z k } | a , b ) = − n ln ( b ) − n ln ( 1 − erf ( − a / 2 b ) ) − ∑ k = 1 n ( z k − a ) 2 / 2 b 2 , (95)

where only terms involving parameters a and b were included.

The ML equations for the parameters are

∂ L ∂ a = 0 ⇒ a = Z ¯ − b q ( a , b ) , (96)

∂ L ∂ b = 0 ⇒ b 2 = ∑ k = 1 n ( z k − a ) 2 + a b q ( a , b ) , (97)

where

q ( a , b ) = 2 / π e − a 2 / 2 b 2 1 + erf ( a / 2 b ) (98)

was defined previously in Equation (79), and

Z ¯ = n − 1 ∑ k = 1 n ( z k ) (99)

is the sample mean.

Comparison of Equations (96) and (77) shows that the two equations are equivalent if the expectation value 〈 Z 〉 is estimated by the sample mean (99). Furthermore, replacement of a in Equation (97) by the right hand side of Equation (96) and comparison with Equation (83) leads to the equivalence of the expectation value 〈 Z 2 〉 and the sample mean-square

Z 2 ¯ = n − 1 ∑ k = 1 n ( z k 2 ) . (100)

Thus, the ML and PME equations lead to the same distribution parameters when the first and second moments in the maximum entropy equations are estimated by the sample moments obtained by the method of maximum likelihood.

Gaussian Approximation to a Log-Normal Distribution

The PDF of a general log-normal as defined in Equation (5) is repeated below for convenience

p ( z | m , s ) = 1 2 π s z exp ( − ( ln ( z ) − m ) 2 / 2 s 2 ) . (101)

Transformation of the location parameter m

μ 0 = e m , (102)

and change of variable

where

Neglect of

with mean