1. Introduction
Extensive literature exist describing how to find an approximate sampling distribution of one or more maximum-likelihood estimators of parameters of a regular distribution. The corresponding procedure is based on Rao-Cramer variance (in one-parameter case) or Fisher-information matrix (several parameters), resulting in (either univariate or multivariate, respectively) Normal distribution; in the former case, this can be further improved by including a few extra terms of a more accurate Edgeworth expansion (see, for example, [1] and [2] ). Constructing such approximation is quite routine, even though use of a computer may be necessary to find the required first four moments.
The purpose of this article is to show how much more difficult the same task becomes when the sampled distribution is not regular; we do this using the following one-parameter example: sampling a distribution whose probability density function is given by
(1)
(zero otherwise), and finding an approximation for the probability density function (PDF) of the maximum-likelihood estimator (MLE, denoted
) of the
(necessarily positive) parameter. We will also discuss several possibilities for constructing a confidence interval (CI) for the unknown value of
, aiming for good accuracy in terms of both the CI’s length (which needs to be small), and its level of confidence (which can be often only approximated).
Note that our example has two distinctive properties (beyond being a case of non-regular estimation), namely
· it involves what is called a scaling parameter; this simplifies the search for a good estimator,
· the corresponding MLE is at the local maximum of the likelihood function, not at its boundary (the latter case is easy to deal with—we leave it for another publication).
In what follows, we use the following terminology: analytic describes results expressed in formula form, while numeric implies using a computational algorithm; similarly, regarding accuracy, we use: exact (self-explanatory; only analytic results qualify), while practically exact answers can be computed numerically to arbitrary precision; accurate implies relatively small errors, and adequate restricts errors to less than one percent.
2. Maximum Likelihood Estimator
To find the MLE of
, we differentiate the natural logarithm of the corresponding likelihood function, namely
(2)
with respect to
, thus getting
(3)
Making the last expression equal to 0 results in
(4)
where the bar indicates the sample mean of the full expression. Solving the last equation for
, subject to the
constraint (this can be done only numerically) yields the resulting ML estimate, denoted
. Note that the distribution of
is parameter free, which simplifies our task; we then need to find the sampling distribution of Θ only (or an adequate approximation to it); the corresponding transformation back to
is then quite routine.
3. Sampling Distribution of Θ
Let us then investigate the distribution of Θ, a random variable defined as the unique (when
) solution to
(5)
where Y stands for a sample of size n from a distribution whose PDF is
(6)
The usual (regular-case) approach would be to expand the LHS of (5) at
(the asymptotic mean of Θ), thus getting
(7)
and then solving for
; this yields the following approximate value of
:
(8)
When both the numerator and denominator have finite moments, this then leads, without much difficulty, to the usual (Normal or Edgeworth) asymptotic result.
Unfortunately,
has
as its expected value but its variance is infinite, while
has both of these moments infinite. The best we can do at this point is to investigate the marginal distribution of the numerator and then, separately, of the denominator (deriving the bivariate distribution of U and V and proceeding in the manner of [3] or [4] proves too difficult).
3.1. The Numerator
Using (6), it is easy to derive the corresponding PDF of U, namely
(9)
This can be routinely converted into the corresponding characteristic function (CF) of
, namely
(10)
where
and
are second-kind Bessel functions. When expanded in t, the ln of the last expression yields
(11)
implying that the ln of the characteristic function of
(12)
is (now expanded in terms which decrease with n)
(13)
The
limit of the last expression is
, which means that (12) has, in the same limit, the Normal distribution with the mean of 0 and variance equal to
, in agreement with [5] (yet, the actual variance of (12) remains infinite at any n). Unfortunately, the convergence to this limit is so slow that using it as an approximation is virtually useless; even when employing the full expression (9), adequate approximation is achieved only when
. As an example, see Figure 1 which displays,
· an empirical histogram of one million values of (12), generated by Monte Carlo (MC) simulation (done using Mathematica) of the same number of
samples of size
, and converting each of them to a value of (12),
Figure 1. Approximating
distribution when n = 1000.
· and the corresponding PDF, obtained by raising e to the power of (13) and applying the inverse Fourier transform (also delegated to Mathematica) to the resulting CF; note that
had to be further expanded to
to facilitate proper convergence.
The two graphs support our claim of extremely slow convergence to a Normal distribution, which needed to be extended by extra
and
-proportional terms of (13) to reach (still less than adequate) accuracy at n as large as 1000. Nevertheless, even this information (the normalizing constant of (12) in particular) is essential in our subsequent attempt to find a useful approximation for the PDF of MLE’s distribution.
3.2. The Denominator
By a similarly routine exercise, the PDF of V is
(14)
when
(zero otherwise). This time, there is no simple analytic expression for the corresponding CF; nevertheless, we can still find its t expansion by dividing the required integral into two parts, thus:
(15)
To evaluate the first of these, we expand
at
, which lets us proceed by numerical integration; in the second part, we expand the
at
, which enables us to evaluate the resulting integral (term by term) analytically. This yields, for the final sum
(16)
implying that the ln of the CF of
(17)
is (when expanded as a function of n rather than t)
(18)
This readily implies that, in the
limit, (17) converges to the constant
(yet, its expected value and variance remain infinite at any n); but again: this asymptotic (degenerate) distribution is modified (at any finite n) by additional terms, which make the actual convergence extremely slow. Similarly to results of the previous section, the empirical distribution of (17) agrees with the PDF constructed (via inverse Fourier transform) from (18) only at fairly large values of n; Figure 2 illustrates that, even when
, the error of this approximation is less than adequate.
Figure 2. Approximating the
distribution when n = 1000.
3.3. And Their Ratio
Having such (rather imperfect) approximations for the distributions of both the numerator and denominator of (8) does not readily translate into a similar approximation for the ratio itself; attempting to find one would be extremely difficult and (considering the very limited accuracy achieved by such approach so far) ultimately of a very limited practical value. There is empirical evidence that the RHS of (8) cannot adequately approximate
unless additional terms of the (7) expansion are included, thus further complicating the matter.
Nevertheless, the previous two sections have given us a good indication that
needs to be multiplied by
to converge to an asymptotic distribution whose main part is Normal (with zero mean and variance equal to
), but which needs additional,
and
-proportional extra terms to reach adequate accuracy. But, to construct such an approximation, we can now rely only on MC simulation of a large number of random independent samples (RIS) from (6), each yielding one random value of
, as described earlier. Doing this, we discover that, unlike
and
, the moments of the
distribution are finite; this enables us to use Edgeworth approximation (described shortly) for the corresponding PDF.
But first we go over some details of the MC simulation.
4. Monte-Carlo Simulation
Existing software (such as Mathematica) can easily generate random values from
any of the commonly used distributions. Since
, where T is a random variable having the Beta (2, 2) distribution, it is quite easy to generate a RIS of a specific size n from this distribution. Solving (5) for
is more difficult, as the procedure needs to be fast, accurate and reliable. The way we do it starts with
(19)
and continues by performing four iteration of the basic Newton technique, where a single iteration replaces the current value of
by
(20)
This results in a single, random value of
; to get the corresponding empirical distribution of
, this is repeated as many times as feasible. The resulting mega-sample of the values of
(21)
can then be used to
· display the corresponding histogram;
· convert it into an empirical PDF function (we use Mathematica’s “SmoothKernel Distribution” for this purpose; for the details of the corresponding algorithm, see [6] );
· find accurate estimates of the first few moments of the W distribution, of selected percentiles, etc.
Note that all our simulations use such large mega-samples (of one million RISs, to be specific) that their results can be considered practically exact.
4.1. Example
As an example, we present the results of this procedure when
, displaying the histogram of the W values together with the corresponding empirical PDF in Figure 3; note that the latter can be evaluated only numerically (it does not have an analytic form). Also note both graphs have been constructed based on the same set of data; this explains their perfect agreement.
The simulation has also yielded the following (practically exact) values of the mean, variance, skewness and excess kurtosis of the W distribution: −0.331,
, −0.571 and 0.391 respectively, while −2.598, −2.342, −1.978, −1.670. 0.733, 0.893, 1.078 and 1.203 are its empirical percentiles (also known as critical
Figure 3. Empirical PDF of W when n = 30.
values), corresponding to 0.5%, 1%, 2.5%, 5%, 95%, 97.5%, 99% and 99.5% respectively (these enable us to construct CIs for
when
, as explained below).
4.2. Edgeworth Approximation
When a reasonably accurate analytic formula is desired (for the PDF of W, at a specific value of n), it is natural to use the usual Edgeworth-series expansion, namely
(22)
where
(23)
and
, V,
and
, are the mean, variance, skewness and excess kurtosis taken from a MC simulation. This approximation will not be as accurate as the empirical PDF of Figure 3—the actual difference (when
) is displayed in Figure 4—but it can be utilized by people lacking the sophistication and computer resources to do their own simulation. Note that the maximum error of this approximation is never bigger than 1% (the shaded area of Figure 4), and is expected (as has been confirmed) to decrease when the sample size gets larger.
By extending the simulation (this will take several hours of CPU) to get the same set of results for
, we have computed and are displaying (in Figure 5) the values of
(red),
(blue),
(green) and
Figure 4. Error of Edgeworth approximation when n = 30.
Figure 5. Four characteristics of W distribution.
(brown) of the W distribution, for any practical sample size n (on the horizontal, log10 scale); the small dots represent the simulated values, each of the four curves has the (sufficiently flexible, by numerical exploration) form of
(24)
where
have been found by a least-square fit. The accuracy of the resulting (22)-based approximation is similar to what we have observed for
(improving, rather slowly, with increasing n).
Note that the mean (also known as the expected value) of W has such a substantial bias (the red line) that (unlike in a regular case) it cannot be ignored even when n gets extremely large (this goes for the other three quantities as well).
5. Confidence Intervals
We will now turn our attention to constructing CIs for
, with the aim of making their expected length short and their claimed level of confidence as accurate as possible.
The CI construction critically depends on the sample statistics used for the point estimator of
; we will consider several possibilities, staring with the obvious choice of MLE (the only estimator discussed so far).
5.1. MLE-Based CI
Most of what is needed to construct a CI for the true value of
has already been done in the previous section. Now, we just need to
· choose the level of confidence (denoted
), with
typically between 1% and 10% (5% being most common);
· find the corresponding critical values (denoted
and
) of the W distribution (these are obtained either as a by-product of MC simulation or, somehow less accurately, computed from the Edgeworth’s approximation to the PDF of W);
· find
, based on a real (not simulated) RIS of size n, and solve
(25)
for
to get the CI’s lower limit, and repeat with
in place of
to get the upper limit.
Note that the expected length of the resulting CI is, to a sufficient approximation
(26)
When
and
, this yields
(implying that we can be easily off the true value of
by 30%); increasing n to 30720, the expected CI length is reduced to an impressive
.
We still need to explain how to get (practically exact) critical value
for a
Figure 6. Critical values C0.975 (top) and C0.05 (bottom).
given sample size n, based on a mega-sample of a million
estimates: all we need to do is to take the kth smallest of these estimates, where k is the nearest integer to
(something we have already done for several values of
and
). By an extensive simulation (discussed previously) we have extended these results up to
; the values of
and
are displayed in Figure 6. This time, the curves were fitted using (24) with an additional constant term.
An alternate (but more elaborate) way of constructing an MLE-based CI is to minimize the distance between two percentiles (denoted
and
, where
) while keeping the probability of the corresponding interval equal to
; this is achieved (based on the same mega-sample) by usual minimization of this distance, after displaying it as a function of
(this can be done graphically). In the case of
and
. this yields −1.855 (instead of −1.978) and 0.977 (instead of 1.078) when
; the distance of the corresponding critical values has thus been shortened to 2.832 (from the old 3.056, i.e. by about 8%), thus improving the CI’s predictive power.
The main disadvantage of using the technique of this section rests in our inability to find, without extensive simulation, an analytic and sufficiently accurate expression for the PDF of W (even computing the estimate itself may prove difficult for some). Yet, it needs to be acknowledged that MLE is more efficient than any other potential estimator of
, especially at large values of n (due to the extra
factor in its variance; most estimators have a variance decreasing with
only). Therefore, in our subsequent search for a simpler, formula-based point estimator of
, we must expect some reduction in the estimator’s relative efficiency, and thus in the predictive power of the resulting CI.
The question now is: when abandoning MLE, what other potential candidates are there for a good (i.e. relatively efficient and also unbiased, at least asymptotically) estimator? Furthermore, we would like to be able find, analytically, the asymptotic form of its distribution, and have the exact distribution rapidly converge to this
limit.
5.2. Method of Moments
The next traditional choice of an estimator (called MME) uses the method of moments. It works by solving the
equation for
, where
stands for the expected value of a single X, and
is the usual sample mean of n such values. Since, in our case,
(thus, not a function of
), the method needs to be modified by using a function of X instead of X itself; in our case, a convenient choice is
, whose PDF, namely
(27)
leads to
. The new (fully unbiased) estimator of
is thus
(28)
(we keep the original notation for the new estimator).
We can now (rather routinely) compute the mean, variance, skewness and excess kurtosis of this estimator to be
,
,
and
respectively. The Central Limit Theorem (CLT) tells us that the sampling distribution of
is approximately Normal, having the above mean and variance. Unlike the previous case of MLE, this approximation is adequate even at relatively small values of n. Furthermore, we can substantially improve its accuracy by incorporating corrections indicated by (22); Figure 7 displays (using the z scale) the corresponding error of both approximations
Figure 7. Error of CLT and Edgeworth (dashed) PDFs when n = 30.
relative to (MC generated but practically exact) PDF of
(29)
Note that the simulation was required only for this comparison; it is not a part of the technique itself.
The largest possible error of the CLT approximation at
(the worst-case scenario) is about 1% (adequate), while the Edgeworth approximation is distinctively more accurate (indicated by the area under the curve, not by its height). Note that the CTL’s error will decrease with
, while that of Edgeworth’s approximation does it with
(substantially faster).
The advantage of these approximations is that both have a simple analytic form; this enables us to directly compute the critical values required for construction of CIs. Furthermore, these critical values (using the W scale) are, when using the CLT approximation, independent of n; thus, for example
and similarly
. Using MC simulation and
, we have established the true confidence level of the corresponding CI to be 0.9508; the approximation is thus quite accurate even for such a small samples (better accuracy yet can be achieved via the Edgeworth approximation; something we leave for the reader to try).
Using the appropriate analog of (26), namely
(30)
the expected length of a CI when
and
is
(59% longer than MLE-based CI); when n increases to 30,720, the length becomes 0.0145 (2.43 times longer, which makes the technique rather non-competitive at large n).
In summary: using (28) as a basis of CI construction is computationally quite simple, as its distribution is approximately Normal yet reasonably accurate; unfortunately, the predictive power of the resulting CI (measured by its length) is less than impressive.
5.3. Utilizing Order Statistic
In this section, we explore yet another estimator of
, namely the nth-order statistic of a sample of n values of
, i.e.
; note that its exact sampling distribution and the corresponding
limit are now both readily available in an analytic form. The idea of using this estimator rests on a hint we get from our MLE (which was always only slightly bigger than this maximum).
Transforming
to
(31)
We can easily compute
(32)
based on
(33)
which follows from (27), and the corresponding
(34)
Note that, when n is sufficiently large, we can approximate (32) by its
limit, namely
(35)
Both formulas (exact and approximate) indicate that
is unbiased only asymptotically; the exact bias needs to be computed from (32); for sufficiently large n we get the asymptotic result of
(36)
obtained from (35). Similarly, the asymptotic variance of
is
.
To construct a CI for
, one needs to find
(
) by solving
(
) for w, and then solve
for
(getting the lower boundary) and
(upper boundary). In Figure 8, we plot the two critical values (using
) as functions of n; note that there are only minute differences between these and their
limits, namely to
and
.
This implies that the asymptotic values can be used at practically any n; even when
, the true confidence level thus achieved is 94.6% instead of 95% (a tolerable error).
To assess the predictive power of the new technique: the expected length of a CI (when
and
) is, based on (30),
(only 4% higher than when using MLE); increasing n to 30720 reduces it to
(about 39% higher—still quite tolerable).
Furthermore, we can make any such CI shorter by optimizing its length while keeping the confidence level at 95%, as done in the previous section; to demonstrate
Figure 8. C0.975 (top) and C0.025 (bottom) of (31) distribution.
this, we use
and the asymptotic formulas (practically exact at such large n). Skipping routine details, this yields
and
, resulting in a new length of
(only 32% higher than MLE-based CI).
To summarize: with the last estimator, we have achieved a combination of simplicity, ease of computation and a fast convergence to asymptotic formulas, without much loss in the predictive power of the corresponding CI.
6. Conclusions and Future Research
We have shown that finding an accurate and efficient CI for a single parameter is, in a case of non-regular estimation, a surprisingly difficult task; as a convincing simple example, we have presented a case study of estimating a scaling parameter of a distribution closely related to Beta (2, 2). We have delineated several possibilities for dealing with this problem by using the MLE, MME, and the largest-order statistics as point estimators, and investigating their sampling distributions. We have discovered that this distribution converges to Normal for both MLE and MME, but that in the former case the convergence is too slow to be of any practical use (the CI construction needs to rely on CPU-intensive numerical approach). Using MME and
resulted in simple, analytic formulas for each of their sampling distributions and their asymptotic form (reached relatively quickly as n increases). Unfortunately, the variance of MME is too large to make the resulting CI competitive, but using
proved to be both simple and relatively efficient.
Empirically (and rather surprisingly) we have also discovered that, similarly to regular cases, the distribution of
(37)
(where
was defined in (2)) is approximately
. This is demonstrated in Figure 9 using
(at such small n, most approximations become visibly inaccurate; the practically perfect fit we see here is truly amazing). To construct a
Figure 9. Histogram of (37)-values and PDF of
.
95% CI for the value of
, all we have to do is to compute
(based on a specific RIS), make (37) equal to
of the
distribution (i.e. 3.8415), and solve for
(there will always be two solutions, providing the CI’s boundaries). This produces CI’s similar to those constructed in the last paragraph of the previous section, including the
-proportional decrease of the CI’s length. Theoretical justification of these statements, and finding out whether they are true in general (after excluding MLEs involving max or min which have different, but easy-to-find, asymptotic distributions) will require further investigation.
We acknowledge that every case of non-regular estimation is unique and may require exploring other possibilities than those presented here; nevertheless, the results of our article provide guidelines for such an exploration.