Fluctuation-Model-Based Discrete Probability Estimation for Small Samples

A robust method is proposed for estimating discrete probability functions for small samples. The proposed approach introduces and minimizes a parameterized objective function that is analogous to free energy functions in statistical physics. A key feature of the method is a model of the parameter that controls the trade-off between likelihood and robustness in response to the degree of fluctuation. The method thus does not require the value of the parameter to be manually selected. It is proved that the estimator approaches the maximum likelihood estimator at the asymptotic limit. The effectiveness of the method in terms of robustness is demonstrated by experimental studies on point estimation for probability distributions with various entropies.


Introduction
For categorical observational data analysis, it is often necessary to deal with multivariate systems since variables of such data generally depend on each other.Highly predictive statistical inference requires parameters that achieve low-entropy, so it is preferable to use data of many variables because the following relationship in Shannon entropies H of random variables X and Y holds if X and Y depend on each other: ( ) H X Y denotes the conditional entropy of X given Y.These are respectively defined with marginal, joint, and conditional probability mass functions P as follows [1]: where i and j are indices of discrete states of X and Y.When we estimate probabilities in discrete probabilistic models with many-variables (e.g., Markov network and Bayesian network models [2]), statistical estimation of conditional and joint probabilities often needs exponentially large data because of the combinatorial explosion of binding events in variables.Therefore, the models are often inferred from insufficient data.The maximum likelihood (ML) method provides an estimated probability function, which of X with k discrete states is expressed by where n and i n respectively denote a sample size and the frequency of occurrences in i state.The estimated probability is correct in the large sample limit.However, the ML methods suffer from short size data, and few robust methods have been investigated for such data in estimation of discrete probability functions as far as we know, although many robust methods for outliers such as M-estimators have been developed [3] [4] in parametric continuous distributions.The maximum entropy method [5], which may be applied to small datasets, was originally appropriate for data with missing information.
In the present study, a new robust method is proposed for estimating discrete probability functions for small samples.The method uses a parameterized objective function based on Kullback-Leibler divergence [6] and Shannon entropy.The function has a similar form to the Helmholtz free energy function that appears in statistical physics [7] [8].A key feature of the method is a model of the parameter that controls the trade-off between likelihood and robustness in response to the degree of data fluctuation.The method thus does not require the value of the parameter to be manually selected.This model is a modification of a preceding work [9], in which the parameter is represented by an artificial model containing a free hyperparameter.
In the domain of machine learning, although several methods slightly similar to ours have been proposed [10]- [14], there is a critical distinction between these methods and ours.Many studies that have applied free energy to statistical inference have not included the similar trade-off parameter or have treated it as a fixed value, a manually controlled parameter, or a free parameter.Regarding the existing methods, we thus consider that the potentials of free-energy-like functions have not been well extracted.Other similar methods have been developed in the context of robust estimation for outliers, in which a free parameter is introduced in an analogous fashion [3] [15] [16].However, the problem of how to determine the value of the free parameters remains.
This paper is organized as follows.In the next section, an objective function with parameter β is introduced for robust estimation, and then probability functions obtained by the proposed method are shown to be formally equivalent to the canonical distributions that appear in statistical physics.In Section 3, a new representation of β is presented as a data-fluctuation-model, and the preferable asymptotic property of β is proved.In Section 4, some characteristic properties among quantities used in the proposed method are provided.In Section 5, we perform experiments using the proposed probability estimation method.In Section 6, conclusions regarding the estimation method are given.

Probability Estimation with Parameter β
Note that in this paper a capital letter (such as X) denotes a random discrete variable, a non-capital letter (such as x) denotes the special state of that variable, a bold capital letter (such as Y ) denotes a set of variables, and a bold non-capital letter (such as y ) denotes configurations of that set.
To construct a method for estimating finite-discrete-probability distributions of random variable X from sample set of finite size n, the following quantities are defined.( ) P X denotes a discrete-probability function estimated by a proposed method that is described below.A function of X is defined as ( ) 0 U X on the basis of the Kullback--Leibler (KL) divergence [6] between empirical functions and ( ) P X as follows: where ( ) P X  is a empirical distribution function.For non-parametric discrete distributions, the empirical distributions are equivalent to relative frequencies; i.e., they are equivalent to the maximum likelihood (ML) distributions.( ) P X  can thus be replaced by ML distributions denoted by ( ) P X .An objective function ( ) F X is defined as follows: where ( ) 0 U X is defined by Equation ( 1),

( )
H X is the Shannon entropy [1] of the estimated functions given as β is a parameter that is defined so that ( ) and ( ) X β are introduced for later convenience.
( ) is a normalization parameter of ( ) where ( ) ( ) F X can be rewritten by using In addition, the following quantity ( ) is defined as ( ) U X is also written with  as ( ) ( ) ; that is, denotes an expectation value in respect to ( ) The estimator of probability functions, ( ) P X , is defined so as to minimize Lagrangian L consisting of ( ) L is expressed as where λ is the Lagrange multiplier.( ) where ( ) as expressed in Equation ( 6) is used.Equation ( 8) is equivalent to a form known as the canonical distribution, which is also called Gibbs distribution, in statistical physics.The following equivalent form is more convenient for practical use: x For estimating conditional and joint-probability functions, conditional entropy are used.β for ( ) The formula for estimating conditional probabilities is therefore obtained by using the conditional entropy and KL divergence and ( ) Joint probability can be calculated by using Equations ( 8) and ( 10) and the definite relation In general, it is calculated using decomposition rules such that ( ) ( ) ( ) ( ) , , , , , , .

Model of β
( ) P X approaches ( ) P X when β in Equation ( 9) approaches 1.On the other hand, ( ) P X approaches the uniform distribution if ( ) ˆ0 P X > and β approaches 0. If β close to 1 represents that the data size is suffi- ciently large and if β close to 0 represents that the data size is very small, β has favorable properties for accurate and robust estimation.This is because the ML estimators generally have preferable consistency and asymptotic efficiency and the distributions close to the uniform can be regarded as the ones that have robustness for small size data.Before β is defined, the following quantity ( ) G n P X with n data size is defined by a geometric mean as ( ) ( ) where G n Z denotes a normalization constant, and ( ) i P X denotes the estimated function obtained from Equation ( 9) with initial i data.It is defined that , where X denotes the number of states of variable X. Fluctuation δ of X with n data is given as where 1 n ≥ , and ( ) P X is the ML estimator function obtained from n data.It is assumed that The normalized 0 β , that is, β , is defined by Equation (4).Note that the canonical distribution ( ) P x expressed by Equation ( 9) can be determined, without any free parameters, by using Equations ( 9), ( 11), ( 12), (4), and ( ) for data size 0 n = .It is also defined that in Equation (10) for conditional data size for any y in the same manner as ( ) Objective function F is rewritten in the same form as that in statistical mechanics as follows: where Z is the partition function, which is a similar function well known in statistical mechanics, defined for single or multivariate probabilities as (14) and for conditional probabilities as 1 1 log a definite value for ( ) 0 P x > and any state x .Equation (18) requires 0 β → or 1 β → in order that ( ) is a constant for any probability distribution ( ) t P x .However, 0 β → does not satisfy Equation (18), while 1 β → satisfies it.Accordingly, 1 β → at the asymptotic limit.□ According to Theorem 1, the more data are obtained, the more β approaches 1, and the more the estimator approaches the ML estimator.The estimator that is obtained by the proposed method therefore has the same preferable asymptotic properties, namely, consistency and efficiency, as the ML estimators have.For insufficient data size, 0 β is probably small due to the influence of the uniform distributions given by Equation ( 12), so β is also small.The estimated probability functions by the proposed method are thus interpreted as adap- tively tempered ML estimator functions in response to the degree of data fluctuation.The proposed estimation method does thereby not require manually selecting the value of parameter β , and is called "ATML," which is abbreviated as the "adaptively tempered ML" method.ATML has an advantage of simpleness over methods that need complicated algorithms (e.g., [17]).
The role of β can be seen as a trade-off parameter between likelihood and robustness by referring to another expression of Equation (2) as follows: where ( ) u P X denotes the uniform distribution function, which contributes to the robustness, while ( ) P X contributes to the likelihood.Additionally, objective function F can also be interpreted as a KL-based diver-gence measure since β is also represented by a KL divergence.
ATML has an analogy with statistical physics, since the canonical distribution for the estimator is obtained from Equation (8).Actually, U, H, β , and F respectively play similar roles to (internal) energy, entropy, (inverted) temperature, and Helmholtz free energy in statistical physics.Solving Equation ( ) 0 L P X ∂ ∂ = for obtaining ( ) P X mathematically corresponds to employing the minimum-free-energy (MFE) principle [7] in thermal physics.
ATML may seem analogous to Jaynes' maximum entropy (ME) methods [5], which are well known as leastbiased inference methods.However, the constraints on which ME methods are based may not be reliable for small samples and thus may be biased, although this kind of bias is not usually considered.On the other hand, ATML is thus designed so that even the bias can be corrected by using parameter β .

Characteristic Properties of ATML
The canonical distribution expressed as Equation ( 8) can some characteristic properties of ATML, which are similar to those in statistical physics.The following notations are defined for later convenience.Probability mass functions that are estimated by ATML, denoted by k P , have discrete states denoted as index k.The corresponding ML estimator is denoted by ˆk P , and { } In statistical physics, (inverted) temperature β is usually defined as [7] [8]   : .
If the canonical distribution of k P , which takes the form of Equation ( 8), is used, Equation ( 20 In the same way, it can be proved that 0 0 , which is called energy fluctuations in statistical mechanics, is shown to have the following relation, where denotes an expectation value with respect to the canonical distributions. In regard to  defined in the proposed estimation method, namely, Equation ( 6), the same relation as that shown here is satisfied as follows: ( ) ( ) Equation ( 24) is therefore proved.
Fisher information ( )  with a parameter β is defined in the usual way as where f is the likelihood function.We define tempered Fisher information ( ) I β as Fisher information where the likelihood function is replaced with the canonical distributions with parameter β .It is shown that ( ) The tempered Fisher information is therefore identical to Equation (24).It is noteworthy that ATML has other mathematical similarities with statistical physics.That is, the same relationships that appear in statistical physics listed as follows hold.
• The following relation is easily derived from the definition of partition function Z: • The following relation, known as the Gibbs-Helmholtz relation, is derived from Equations ( 13) and ( 27) as • The following relation is simply obtained from Equations ( 5) and (28) as • The tempered Fisher information is represented by the second-order differential of the partition function for β as ( )

Examples
Numerical experiments are performed to demonstrate the robustness of ATML for small samples, in comparison with the ML and ME methods [5].
X is assumed to have three internal states and four probability mass functions with a variety of entropies denoted as

( )
H X in natural logarithms as T. Isozaki distribution from only one sample.On the other hand, ME methods tend to increase entropies and can thereby accurately estimate distributions with high entropies close to the uniform distributions.Hence, the ML method tends to overfit data, and the ME method tends to underfit data in the view of misestimation.Even so, ATML showed relative stability in terms of both sample sizes and distributions.This result indicates the effectiveness of ATML as a probability estimation method.

Conclusion
A robust method for estimating discrete probability functions, called "adaptively tempered maximum likelihood" method (ATML for short), is proposed.The estimators obtained in this method minimize a parameterized objective function similar to Helmholtz free energies that appear in statistical physics.The key feature of the proposed method is a model of the parameter as a fluctuation of finite size data.The parameter that is modeled plays an important role in determining the appropriate trade-off between likelihood and robustness in response to the degree of the fluctuations.ATML does thereby not require manually selecting the value of the parameter.It is also proved that the obtained estimator approaches the maximum likelihood estimator at the asymptotic limit.The effectiveness of ATML in terms of robustness was demonstrated by experimental studies on point estimation for probability distributions with various entropies.
Shannon entropy of X and conditional KL divergence: