Bayesian Set Estimation with Alternative Loss Functions: Optimality and Regret Analysis ()
1. Introduction
Together with point estimation and testing, set estimation is the most popular method for inference on the unknown parameter of a statistical model. In the decision-theoretic framework, set estimation has traditionally received less attention than the other two problems, both from a frequentist and a Bayesian point of view. Set estimators are in fact typically derived using non-decisional approaches. For instance, frequentist intervals are found using pivotal quantities or inverting acceptance regions of tests; Bayesian intervals, such as equal tails or highest posterior density sets, are obtained by exploiting some features of the posterior distribution of the parameter, with no explicit connections with a decision-theoretic context. Probably this is the consequence of the difficulty in choosing a satisfactory loss function for this decision problem.
In the few studies on this topic, loss functions typically take into account two aspects of the candidate set estimators, i.e. their size (aptly penalized by a specific coefficient) and their ability to contain the true value of the parameter. The most widespread loss function is the so-called linear loss, which is a linear function of the size of the set (see Expression (2) of Section 1). Despite its simplicity, this function has been found to produce optimal sets (both frequentist and Bayesian) with paradoxical flaws even in a very standard problem such as set estimation of the normal mean with unknown variance: see [1] and [2] . In order to address these inconveniences, the Authors have then proposed to consider other losses that are increasing but non-linear functions of the sizes of the sets under evaluation. Among these, the exponential and the rational losses (see Expressions (3) of Section 2).
The analysis contained in the present article is closely related to that in [1] and [2] . Our contribution is intended to fill the following three gaps. 1) The approach of [1] and [2] is mainly theoretical. In the present article we explore the features of optimal actions and posterior expected losses induced by three different loss functions (linear, exponential and rational) in several numerical examples. Specifically, we assess the sensitivity of optimal actions with respect to both the choice of the loss function and of the prior distribution (informative vs non-informative priors). 2) As a second contribution, we propose a regret analysis: we evaluate the additional losses of standard highest posterior density sets of fixed posterior probability with respect to optimal sets under the three losses. 3) As far as we know, a formal decision-theoretic approach set estimation under different loss functions has never been applied to any real experimental context. We here provide an application to clinical trial data.
As we said, the literature on the subject is scant. In addition to the already mentioned motivating articles [1] and [2] , general introductions to Bayesian decision-theoretic set estimation can be found, for instance, in [3] [4] [5] [6] [7] . An objective Bayesian decisional approach is proposed in [8] . The use of posterior regret for interval estimation has been lately considered in [9] . Evaluation of additional posterior loss of non-optimal actions for point estimation and testing has been examined by [10] and [11] . Considerations on the conflict between optimal and non-optimal Bayesian actions can be found in [12] .
This article moves from a preliminary study in [13] for the Poisson model and is organized as follows. In Section 2 we introduce the basic elements of the Bayesian decision-theoretic approach to set estimation. Section 3 is about optimality analysis for interval estimation of a normal mean with known and unknown variance. Using both simulations and clinical trial data we discuss the effect of the values of the penalizing constant for the length of intervals and the impact of prior choices on features of optimal sets (length, posterior probability and posterior expected loss) using the three losses. In Section 4 we change perspective: from the search of optimal sets (whose length and credibility is not fixed in advance) we switch to a regret analysis that aims at evaluating the additional cost of using intervals with fixed credibility levels rather than optimal sets. Using the clinical trial example again, we examine the effects of the following quantities on the posterior regret associated with the exponential and rational losses: the value of the penalizing constant, the sample size, the credibility level. Finally, Section 5 summarizes the main points come to light in the article. All elaborations are performed using the software R [14] : code is available upon request.
2. Methodology
Given a parametric model,
, let
denote the prior distribution of
,
an observed sample of size n and
the corresponding posterior distribution. For simplicity, suppose that
and that
is a probability density function. Assuming to be interested in set estimation of
from a decision theoretic perspective (see [3] and [7] ), let
be a class of subsets of
and
the loss function for a generic set
. This approach prescribes one to select a set
that minimizes the posterior expected loss
as C varies in
, i.e.:
The most widely used family of losses for set estimation is defined by setting
(1)
where the size
is an increasing function of
—the Lebesgue measure of C—and
is the indicator function of the set
. The resulting posterior expected loss of
is
which embodies a compromise between the size of C and its posterior probability of containing
, denoted as
. One important property of the class of monotone functions (see, for instance, [2] ) is that, if
is an absolutely continuous random variable (as we assume here), optimal actions are highest posterior density (HPD) sets defined as
. More specifically, we here assume that HPD sets are intervals
.
Let
be the length of a generic interval C. The simplest form of loss (1) for C is obtained by selecting
(2)
as size function, which yields the class of linear loss functions,
, where a is a penalizing constant for interval lengths. Casella, Hwang and Robert in [1] and [2] show that, in the case of unbounded parameter space, optimal sets under the linear loss function may be dominated by unreasonable sets. For instance, in the case of the normal model
with unknown variance, the standard Student’s t-interval for
is dominated by a set that is empty as the sample variance is sufficiently large. They also show that (under mild conditions) these kinds of problems are avoided if both the components of (1) assume values in
or, more specifically, if
is a nonlinear and increasing function that ranges monotonically in the unit interval such that
and
. To resolve this paradox the Authors propose the following two nonlinear functions
(3)
that result in the classes of exponential and rational loss functions. The posterior expected losses corresponding to the three size functions under examination in this article are then given by:
(4)
Note that whereas
, for all C, i.e. it is unbounded above, the values of
and
are upper bounded. As an example, Figure 1 shows
,
as a function of
: shaded areas represent the contributions of
(dotted lines) and of
(solid lines) respectively.
Posterior expected losses
allow us to follow two different decisional approaches:
(i) Optimality analysis, i.e. determination and comparison of optimal intervals
(ii) Regret analysis, i.e. evaluation of standard intervals of given credibility level
, denoted by
, in terms of their additional expected loss (or expected regret) with respect to
, quantified by
(5)
Note for any arbitrary choice of
,
.
Approaches (i) and (ii) will be now specialized to the normal model in Sections 3 and 4 respectively.
3. Optimality Analysis
Let us assume that
, with
known, and that
. These assumptions imply that
where
and
. In this case the class of HPD sets—which are also equal tails (ET) intervals—is
, and
(6)
where
is the standard normal cdf and
, for
, are obtained by setting
in (2) and (3). The non-informative case is simply obtained by setting
, which yields standard confidence intervals
. Note that the minimizers
can be determined as follows:
(i)
;
Figure 1. Posterior expected losses
,
(dotted lines) and
(solid lines),
, as functions of
.
(ii)
satisfies the equation
, where
and
denotes the density function of a standard normal random variable;
(iii)
is determined numerically.
For the unknown variance case let us assume that
and that
or, equivalently, an inverse chi-square r.v. of parameters
. In this case standard conjugate analysis yields that
, where
,
is unchanged w.r.t. the known
variance case,
,
is the sample vari
ance and
(see Lesaffre and Lawson (2012), p. 86-87). HPD sets are then
, and
(7)
where
is the cdf of the t random variable with
degrees of freedom, location
and scale
and
, for
, are obtained by setting
in (2) and (3). Note that the non-informative case is retrieved for
, i.e. using a flat prior for
and
. In this case HPD sets are
(see Lesaffre and Lawson (2012) p. 88). The minimizers
of (7) depend on the data and can be determined numerically, for
, or according to results similar to (i)-(iii). For instance,
it can be checked that
satisfies the equation
, where
and
denotes the density function of a
random variable
with
degrees of freedom, location
and scale
. For futher details see [2] .
3.1. Numerical Examples
For each loss function and for selected values of a we determine the optimal sets using the following procedure. We first consider a grid of values for k and determine the corresponding ET/HPD sets
and their expected loss
. Then we select
as the minimizer of
, for
. Figure 2 shows the plots of
as functions of k for Normal posteriors in the known variance case, under an informative prior (
, left panels) and a non-informative prior (
, right panels). For each value of a the optimal
is circled. According to the definition of the different size functions in (2) and (3), the impact of the penalizing coefficient a on interval lengths is not directly comparable. However, in general, as expected, the larger a the smaller
,
. Among the three losses,
is the most sensitive with respect to the values of a, which results in the largest range of values for
. Note also that the optimal values
are the largest, yielding optimal intervals with posterior probability
larger than 0.94 with smaller variability than those of the linear and of the exponential loss. For all the explored a values, the exponential loss function provides optimal sets with intermediate levels of length and posterior probability, and overall smaller values of the posterior expected loss. As expected, for each value of a, the curves
as functions of k are slightly but
Figure 2. Posterior expected losses
,
, as functions of k for different values of a for Normal posteriors under an informative prior with
(left column) and a non-informative prior with
(right column). For each
,
, circles denote
.
uniformly higher for the non-informative case (right panels). Interestingly, even though a selected
can be surprisingly smaller in the non-informative case, the corresponding value of
is always greater than in the informative case, due to the larger posterior variance of
. For instance, under the exponential loss function with
,
and
are equal to 1.54 and 0.335 in the non-informative case and 1.70 and 0.264 in the informative case (see Table 1). The above comments are consistent with the numerical values of
,
,
and
reported in Table 1 for the Normal model with known variance, both for the informative and the non-informative case.
Table 2 reports
,
,
and
for the unknown variance case assuming
, (i)
and (ii)
. Notice that the values of Table 2 side (i) are very close to those of Table 1 side (ii), since
yields a prior for
highly concentrated on the variance parameter that is assumed to be known in the former case. As regards the comparison between the informative and the non-informative case, considerations similar to those made for the known variance case also apply to Table 2 [compare (i) and (ii)].
3.2. Application to Clinical Data
In this section we consider an application from [15] , in which an experimental drug (t), indicated for the treatment of iron deficiency anemia in adult patients with cronic kidney disease, is compared to a control treatment (c). The primary endpoint is the mean change in hemoglobin from baseline to day 35. Let
be the difference between the expected changes in hemoglobin under the two treatments. Let
be the difference between the sample mean changes in hemoglobin
and
under the two treatments, where
Table 1. Length, posterior probability, posterior expected loss for
and values of
under the three loss functions for selected values of a, for the Normal model with known variance
assuming (i) an informative prior for
(e.g.
) and (ii) a non-informative prior for
(e.g.
).
Table 2. Length, posterior probability, posterior expected loss for
and values of
under the three loss functions for selected values of a, for the Normal model with unknown variance, assuming
, (i)
and (ii)
.
and
, with
the size of the arm
. In the example
,
,
and
. Here, we assume that
and
are known and equal to the sample standard deviations reported in the original example, i.e. 1.14 and 1 respectively. Using the non-informative prior
for
, the 95% credible interval is
.
Table 3 shows the optimal intervals under the three loss functions for
with different choices of the prior parameters: (A)
(no difference between treatment effects); (B)
(prior information perfectly matching sample data); (C)
(optimistic prior mean); (D)
(non-informative).
As
increases [from (A) to (C)], intervals bounds are shifted towards larger values, but the selected
is the same for each given loss function, thus yielding the same values of
,
and
,
. In the non-informative case (D), the values of
and
are uniformly greater than in the previous cases, due to the larger posterior variance. The rational loss always yields the widest optimal intervals, with posterior probability close to the conventional level 95%.
Figure 3 shows the behavior of
,
,
and
as functions of the coefficient a. As a increases,
and
tend to decrease and the corresponding values of
tend to increase. As expected the linear loss (solid line) is the most sensitive to changes of the values of a. Due to remark (i) of Section 3,
is non trivial (
) for values of
, that is equal to 2.08 in this example. This motivates the presence of the cusp in the three plots. Note that in the second panel the curve representing
shows values always very
Figure 3. Length, posterior probability and posterior expected loss of the optimal set
as functions of a, under the linear (solid line), rational (dashed line) and exponential loss (dotted line), for the Normal model with known variance assuming
(non-informative).
close to 0.95; whereas
progressively reduces as a increases. Finally, inspection of the values of
allows an overall look at the sensitivity to a of the three losses (see third panel). Although a direct comparison of the absolute values of
is not meaningful, due to the different role of a in the three loss functions,
Table 3. Bounds, length, posterior probability, posterior expected loss for optimal sets
and values of
under the three loss functions with
, for the Normal model with known variance, assuming four alternative priors: (A)
(no difference between treatment effects); (B)
(prior information perfectly matching sample data); (C)
(optimistic prior mean); (D)
(non-informative).
the rate of increase of the three curves shows a greater degree of robustness of the exponential loss.
4. Regret Analysis
As discussed at the end of Section 2 one typically uses the suboptimal sets
that guarantee minimal length in the class of fixed
-credibility level intervals.
For the normal model with known variance
. In this section
we are interested in evaluating
, the additional expected loss of sets
given by Equation (5). Our goal is to quantify the cost of using the more pragmatic interval
instead of the optimal set
under a given loss function. Here the comparison is restricted to the exponential and rational loss functions, due to the drawbacks of the linear loss previously discussed. The values of
are computed noting that
and determining the values of
numerically as discussed in the previous section.
We explore the behaviour of
with respect to a, n and
for the application of Section 3.2. Moreover, we compare two alternative prior assumptions, i.e. the non-informative case (with
) and an informative case (with
). Note under the non-informative prior the
-credible interval coincides
with the frequentist
-confidence interval
.
Application to Clinical Data (Continued)
Figures 4-6 show the values of
,
, as functions of a, n and
respectively. Solid lines correspond to a prior sample size
, dashed lines to the non-informative case (
).
1) Effect of a (Figure 4)
a) As a varies the curves
have a minimum in
, which represents the value of the penalizing coefficient such that
are the closest to the optimal sets
. Note that
when the value
is such that
is optimal under the loss function j.
b) In all panels but the last one (see next item),
is smaller for
than
Figure 4. Additional expected loss
with
(top panels) and
(bottom panels) as function of a, under the exponential (left column) and the rational loss (right column), for the Normal model with known variance assuming
(solid line) and
(dashed line) with
.
Figure 5. Additional expected loss
with
(top panels) and
(bottom panels) as functions of n, under the exponential (left column) and the rational loss (right column) with
, for the Normal model with known variance assuming
(solid line) and
(dashed line).
for
. This means that, since the non-informative prior yields larger sets, the degree of penalization of interval length has to be smaller than it can be for larger values of
.
c) Under the rational loss function, for
(last panel)
is substantially negligible for a large range of values of a. This is consistent with what we observed in the previous section (see for instance Figure 3 second panel) since the optimal sets under the rational loss have credibility approximately equal to 0.95 for several values of a.
d) According to the choice of a either the standard confidence interval or the Bayesian credible interval can be preferred in terms of additional expected loss.
Figure 6. Additional expected loss
as function of
, under the exponential (left column) and the rational loss (right column) with
(top panels)
(center panels) and
(bottom panels) for the normal model with known variance assuming
(solid line) and
(dashed line).
2) Effect of n (Figure 5)
a) The range of
is much smaller than that of
. Values of
are particularly small when
(bottom panels).
b) In all the plots the solid and dashed curves
as functions of n have a minimum point
. This means that there exist a value of the sample size such that
is optimal. Such a value
is smaller for
than for
since the informative prior implies shorter
-credible sets than the non-informative one.
c) For values larger than
the curves
increase. In particular the higher steepness of solid curves is due to the shorter length of intervals under the informative prior for each given sample size. Moreover, solid and dashed curves eventually tend to coincide because the effect of
becomes more and more negligible with respect to larger and larger values of n.
d) For
(top panels) there are not values of n such that
, i.e. sets
are never optimal.
3) Effect of
(Figure 6)
a) In all panels the maximum value of
is obtained for
(when
reduces to
).
b) For increasing values of
the curves
decrease and reach their minimum at
, i.e.
increases with
.
c) The smaller the value of
the larger the discrepancy between the lengths of
and
and this discrepancy is stronger for informative priors that yield shorter intervals with respect to non-informative priors. The opposite is observed for large values of
when the exponential loss is used; whereas the rational loss seems to be more insensitive to the prior at least for values of
.
d) Depending on the chosen value for
, either the standard confidence interval or the Bayesian credible interval can have the smallest regret. However, for large values of
—typically chosen in the practice—regret is smaller for the informative Bayesian interval.
5. Conclusions
We can now attempt to summarize some indications drawn from the numerical example and the clinical data application of Sections 3 and 4.
As regards the optimality analysis the main points are the following.
1) As shown by Table 1 and Table 2, the value of the penalizing coefficient a is critical: there are no general guidelines for its choice that is, however, highly influential in determining optimal intervals. This is also true in general in decisional analysis, for instance in testing problems when generalized 0 - 1 losses are used.
2) Non-informative distributions imply larger values of
and, as a consequence, of
than those obtained with informative priors, as one can argue by looking at Table 3.
3) Figure 2 and Figure 3 allow us to sketch a comparison among the behaviour with respect to a of the posterior expected losses induced by the three loss functions: the linear loss is the most sensitive; the rational loss yields larger sets and values of
close to 0.95; the values of
are the most robust with respect to a.
In the regret analysis of Section 4 we have explored the impact of a, n and
. Here are the main comments.
1) The value
, that minimizes the additional loss of
, quantifies the degree of penalization such that
is as close as possible to be optimal. In this sense it can be useful to support the interpretation and the choice of a.
2) As shown in Figure 4, according to the choice of a either the standard confidence interval or the Bayesian credible interval can be preferred in terms of additional expected loss.
3) Figure 5 suggests that, as n varies, the range of
is smaller than that of
. Furthermore it hints that it is possible to find values of n such that
is optimal and sample size beyond which the weight of the prior on
becomes more and more negligible.
4) In terms of additional expected loss, it is not granted that the Bayesian credible interval has to be preferred with respect to the standard confidence interval: it depends on the chosen value for
. Nevertheless, in the practice large values of
are typically fixed. According to Figure 6 for large values of
the informative Bayesian interval have (slightly) smaller regret than confidence intervals.
As pointed out in [2] “it is difficult to treat the set estimation problem in a decision-theoretic way” because of some limitations related to the available loss functions and the difficulty of selecting an appropriate loss function. Nevertheless the possibility of balancing size and credibility is still appealing. In this paper we have explored the features of some loss functions that are alternative to the linear loss function and that produce sensible sets for normal models. Bearing in mind that the distribution of the data plays an important role, we hope to extend the present analysis to other models, in the spirit of [13] .