_{1}

^{*}

Complete prior statistical information is currently required in the majority of statistical evaluations of complex models. The principle of maximum entropy is often utilized in this context to fill in the missing pieces of available information and is normally claimed to be fair and objective. A rarely discussed aspect is that it relies upon *testable* information, which is never known but estimated, *i.e*. results from processing of raw data. The subjective choice of this processing strongly affects the result. Less conventional posterior completion of information is equally accurate but is computationally superior to prior, as much less information enters the analysis. Our recently proposed methods of lean deterministic sampling are examples of very few approaches that actively promote the use of minimal incomplete prior information. The inherited subjective character of maximum entropy distributions and the often critical implications of prior and posterior completion of information are here discussed and illustrated, from a novel perspective of consistency, rationality, computational efficiency and realism.

The principle of maximum entropy (PME) can be utilized to determine a probability density distribution function (pdf) from incomplete statistical information. The approach is not limited to determination of prior pdfs in Bayesian estimation, even though that is a common application. It is rather a general recipe how to make known but incomplete statistical information complete with the most ‘fair’ or the weakest possible hypotheses. As such it fits very well into our practice of statistics, where the lack of complete information is the rule rather than the exception. For instance, knowing only the mean and the variance of an uncertain parameter PME results in a normal distribution [

with estimators [

Bayesian estimation generalizes traditional approaches limited to observations by inclusion of prior knowledge. Its claimed advantage or superiority relies heavily upon fair and truthful assignment of the prior pdf. If the applied method (like PME) to determine the prior pdf turns out to be subjective it would degrade the legitimacy of the approach. Indeed, PME is often motivated by its fairness or objectivity [

Given a set of observations it is not evident how they should be processed to provide testable information of the highest possible quality, i.e. with minimal residual uncertainty. For instance, which moments should be estimated with a quality (variance of estimator) that usually decrease with the order? As will be shown, the selection will directly influence the PME distribution function. It is also not evident why PME should be restricted to prior application, as usually practiced. Posterior utilization might in fact simplify the analysis greatly. After all, PME is a general method with no explicit reference to what the distribution describes, the input or the output of the analysis.

For a continuous sample space, the entropy [

The bar-symbol restricts the integration over distinguishable outcomes. That accounts for our possible ignorance1 of not distinguishing distinct outcomes. As there is an obvious contradiction of being aware of ignorance, it is better translated into irrelevance (for our stated problem). The integration over can be extended to, by locally measuring the relative density of with the Lebesgue measure, . Consistency requires transformation invariance [

Utilizing the integral in Equation (1.1) is converted into a Riemann integral,

The optimization is subject to all known testable information,

The functions will for convenience be denoted test functions. The mandatory zeroth constraint (Equation (1.4)) is the normalization condition. As pointed out [

The Lagrange multipliers of optimization are implicitly given by the testable information,

The maximum entropy pdf is then given by,

The PME solution (Equation (1.8)) does not specify the test functions. They are of major importance though since the solution is directly expressed in them. As stated in the introduction there is a convention of setting, etc. That is a habit, not a prescription. The choice of test functions must consequently be considered to be at our free disposal.

The difficulty or accuracy of estimating is dependent on the explicit form of. Also, the information contained in the observation set is to a variable degree transferred to the estimate of. For instance, if all observations are equally weighted, but if,

the exponent determines how much different observations contribute. In the asymptotic limit, the observation with the largest deviation is much more important than any other (which only contributes to the estimate of the mean). A lot of information is obviously disregarded. The estimator covariance will accordingly be large. Nevertheless, in many situations the range [

is of much larger interest than any other information. The confidence interval is a more general statistic allowing for sample spaces without bound. From the perspective of objectivity, it thus appears difficult to prescribe any specific set, or even their number (N). Clearly, the larger amount of data or independent information that is available, the larger is allowed without resulting in unacceptable estimator quality, as expressed with its bias and covariance.

To illustrate how subjectivity enters in practice, assume we have gathered a set of prior observations of a phenomenological constant contained in a computer model. To calibrate the model [

Since we are not aware of any ignorance, we set. The condition Equation (1.6) on implies,

A fair amount of subjective pragmatism now suggests to be limited to even numbers and the support to extend to the whole real axis, since that allows us to determine with ease. The symmetry of the integrand then implies. If is prohibited,. The current assignment then translates into an approximation of the support as well as the factor of. The remaining Lagrange multiplier is obtained by re-scaling,

The resulting pdf,

where describes the order “width” of the pdf, is for different displayed in

In practice, a testable piece of information is estimated from raw observations with quality depending on the test function. The maximum entropy pdf can be directly formulated in our observations, using a specific estimator of the moment around the mean. An obvious estimator is given by,

where is an unknown constant which hopefully

can be selected to eliminate the bias, e.g.. To express the bias of in terms of, expand and calculate its expectation over observations,

On the other hand,

Clearly, the number of terms of is much less than for for. Elimination of bias by proper selection of requires proportionality between and. That cannot in general be achieved since one of them have much more terms than the other. Thus, no scaling of can make it universally unbiased. Evaluating the first two coefficients of the expansion in Equation (1.12),

Surprisingly, these two terms satisfies the proportionality required to eliminate the bias. Bias can thus be eliminated by rescaling up to, but not for (see estimation of kurtosis in [

While,. Clearly, bears little resemblance to the conventional form of, where is the number of degrees of freedom.

After failing to obtain an unbiased estimator of the moment around an unknown mean, we may lower the ambition by assuming the mean is known. The corresponding estimator reads,

This is a considerably simpler situation for which the normalization makes the estimator unbiased for all. The expected precision of any estimator describing its typical relative variation may be defined by,

The least possible variance of any unbiased estimator of is given by the Cramer-Rao lower bound [

Since increases with, it is indeed more difficult to determine higher order moments accurately on an absolute scale, with any estimator. The efficiency of our specific estimator measures its relative quality,

where, is even and the precision is evaluated in Appendix B. For an efficient [

To conclude, the determination of the PME pdf involved several subjective choices of processing the observations. The shape was here restricted by using monomial test functions. The order was limited to be even, to simplify the calculation of the integrals. Likewise, an infinite symmetric sample space was assigned. The difficulty of estimating various statistical moments around the mean increases rapidly with the order. If there are not more than a few samples it is in most cases impossible to reliably estimate any other moments than the first and the second. The application of PME thus relied upon rational selection rather than objective deduction.

The PME constructs complete pdfs from incomplete testable information. It can be applied to the input (PME prior), or the output (PME post(-erior)) of the analysis. As no restriction or preference is stated in PME, both alternatives deserves to be considered and evaluated for efficiency and accuracy. The default method of linearization (LIN) [

in general propagate statistics like covariance which later can be expanded to information related to entire pdfs, e.g. confidence intervals. That is, LIN, UKF, and DS promotes PME post. Monte Carlo simulations [

An elementary example illustrates the principal differences between PME prior and PME post. Assume it is of interest to calculate the confidence interval of an uncertain model, dependent on one parameter. Let the model uncertainty [

.

Applying PME post to the result and implies that the resulting pdf is normal distributed, analogously to the input pdf in RS. That implies a coverage factor of resulting in a confidence interval,

For estimating confidence intervals, PME does not in fact need to be utilized at all. By using another DS technique of ours, sampling on confidence boundaries [

Consistency in evaluating confidence intervals of models here lead to the related concept of confidence boundaries in parameter space. For multivariate models with non-linear dependencies on parameters, both these results needs to be properly generalized [5,6]. Propagating a PME prior probability density function thus typically require several thousands of model samples [13,14] or complex algorithms [

PME prior requires much more information to be analyzed, than PME post. For the example above, it means 3000 or 2 model evaluations. Plain rationality thus strongly favors PME post. The reason for requesting complete prior information is likely an ambition to find unique unquestionable answers, even if that in practice requires an extensive amount of blind assignments or assumptions. Unique is not equivalent to accurate however, and the quality of any assignment should be critically judged. The loss in efficiency of PME prior compared to PME post is not compensated by superior accuracy. In a number of cases, DS produces the correct result without any error: Any moment of any model given by a finite Taylor expansion can be exactly calculated with stratified DS [

all parameters, and is any non-linear monotonic function. For comparison, we are not aware of any general analytic method to propagate a univariate pdf determined by PME prior through any non-linear model. In this case, RS is a numerical method that yield arbitrarily small errors, but at very high computational cost. For “genuinely” multivariate problems RS seizes to be accurate. Multivariate refers to nontrivial finite dependencies of any order, as required for optimal modeling [

With these observations, the traditional preference to prior instead of posterior completion of information with PME appears biased and strongly subjective. It is highly questionable if PME post methods like DS [6,11,16,17] can be claimed less accurate than state-of-the-art RS relying upon PME prior per se. As both practice statistical sampling once defined by Enrico Fermi [

Bayesian estimation is formulated for PME prior. The combination of the maximum likelihood function and the prior distribution makes Bayes approach superior to traditional approaches limited to the former. In practice though, two functions (PME prior) or two discrete sets of statistics (PME post) derived from observations are fused. Indeed, the common assumption of Gaussian noise and PME prior derived from the mean and covariance results in combination of covariance matrices (numbers) and not pdfs [

PME post (DS) makes it evident that results seldom are unique, while PME prior (RS) conceals ambiguities in more or less dubious or blind assignments in the problem set up. The indefiniteness of the result is an unavoidable consequence of starting with incomplete information, not the analysis. For PME post analyses (DS) this can easily be illustrated, for instance using different ensembles [

The PME test functions should be selected according to the evaluation of the model, its behavior, and the quality of estimating the corresponding testable information. If we are interested in propagating the covariance of a signal processing model [

The ubiquitous sparsity of statistical information in virtually all practical problems needs to be seriously addressed. If the realism of two approaches are comparable but their efficiency distinctively different, the choice should be based on rationality rather than consistency. Realism is distinct from resolution. Any complete set of assumptions will result in arbitrary high resolution, no matter how realistic the assumptions are. As the fidelity of the result never can be enhanced with blind assignments, it is questionable when statistical analyses should be made with distribution functions, rather than the testable information (statistics) these are derived from with PME. If desired, both approaches results in distribution functions. The completion of information is just made before (PME prior) or after (PME post) the analysis. The rational choice is PME post, as it propagates a minute fraction of the statistical information used in PME prior.

The character of maximum entropy (PME) distribution functions has been discussed. There are two principal innovations of our study. The PME solutions can to some extent be controlled by how primary data is processed. The prevailing preference for applying PME to the input (prior) and not the output (posterior) of statistical analyses is difficult to justify, as the their accuracy is comparable but the latter is computationally superior. The choice appears governed by the method of analysis. Hence, subjectivity enters into the processing of data as well how the analysis is made.

A non-trivial selection enters when observations are processed into testable information. A simple example illustrated common subjective choices, giving different results for identical observations.

Redirecting the focus from the treatment of known information to the targeted evaluation of the analysis emphasizes consistency, rather than objectivity. When consistency is indecisive, rationality or efficiency of the analysis provides obvious guidance. PME can be applied to find prior as well as posterior distributions. Its unconventional posterior application deserves to be seriously considered, as the analysis involves much less statistical information and is correspondingly more effective, than the prior. Consistency and rationality thus fundamentally questions the prevailing method of completing statistical information prior to the analysis, as in e.g. Monte Carlo simulations.

Maximal consistency and rationality are indeed primary goals of all our proposed methods of deterministic sampling. For complex models, such lean and customized approaches are often required to obtain any measure of modeling quality at all (within acceptable computational time). Without assessment of quality, any (modeling) result is of no value. These aspects are thus of paramount importance to our society where complex calculations (technical, physical, econometrical etc.) are rapidly increasing due to the fast development of computers.

Assume a random variable with zero mean is normal distributed,. By integrating by parts it can be shown that, where is the semi-factorial function. The probability distribution function for can then be expressed in,

For independent observations, the likelihood function is given by,

Cramer-Rao lower bound ([

where is the Fisher information matrix (scalar for one parameter),

The expected precision (Equation (1.17)) of the estimator hence satisfies,

An estimator of the statistical moment of around a known mean from a set of independent observations is given by,

where the normalization constant B is chose to minimize its bias. Since is a known constant it is trivially found that eliminates all bias, for all values of. Its variance is found to be,

An explicitly value can be found for a normal distributed parameter,. Then, by recursively integrating by parts it is found that, where is the semifactorial function, giving