A Comparison of Hierarchical Bayesian Models for Small Area Estimation of Counts

Small area estimation (SAE) tackles the problem of providing reliable estimates for small areas, i.e., subsets of the population for which sample information is not sufficient to warrant the use of a direct estimator. Hierarchical Bayesian approach to SAE problems offers several advantages over traditional SAE models including the ability of appropriately accounting for the type of surveyed variable. In this paper, a number of model specifications for estimating small area counts are discussed and their relative merits are illustrated. We conducted a simulation study by reproducing in a simplified form the Italian Labour Force Survey and taking the Local Labor Markets as target areas. Simulated data were generated by assuming population characteristics of interest as well as survey sampling design as known. In one set of experiments, numbers of employment/unemployment from census data were uti-lized, in others population characteristics were varied. Results show persistent model failures for some standard Fay-Herriot specifications and for generalized linear Poisson models with (log-)normal sampling stage, whilst either unmatched or nonnormal sampling stage models get the best performance in terms of bias, accuracy and reliability. Though, the study also found that any model noticeably improves on its performance by letting sampling variances be stochastically determined rather than assumed as known as is the general practice. Moreover, we address the issue of model determination to point out limits and possible deceptions of commonly used criteria for model selection and checking in SAE context.


Introduction
In recent years, small area estimation (SAE) has emerged as an important area of statistics as private and public agencies try to extract the maximum information from sample survey data. Sample surveys are generally designed to provide estimates of characteristics of interest for large areas or domains. However, governments are more and more interested in obtaining statistics for smaller geographical areas such as counties, districts or census divisions, or smaller demographic subsets such as specific age-sex-race subgroups. These domains are called small areas. SAE concerns statistical techniques aimed at producing estimates of characteristics of interest for small areas or domains. A review of SAE methods is in [1] [2] [3]. The simplest approach is to consider direct estimators, that is estimating the variable of interest using the domain-specific sample data.
However, it is well known that the domain sample sizes are rarely large enough to support reliable and accurate direct estimators since budget and other constraints usually prevent drawing adequate samples from each of the small areas.
When direct estimates are unreliable (or even non computable), a major direction considers the use of explicit small area models that "borrow strength" from related areas across space and/or time or through auxiliary information which is supposed to be correlated to the variable of interest. Explicit models can be classified into two categories: 1) area level models and 2) unit level models. They can be estimated by adopting several alternative approaches and one of these has been the hierarchical Bayesian (HB) paradigm. However, applications of HB models to SAE, though growing [1], still are quite a few. Moreover, they have mainly focused on continuous variables. To date, there is no thorough discussion on what is the most appropriate nonlinear specification of area level models when small area estimates are needed for discrete or categorical variables.
In this paper, we focus on HB area level models for producing small area estimates of counts. In the literature, Bayesian specifications commonly derive from classical models for SAE, e.g. the Fay-Harriot model [4], or more properly consider either a generalized linear Poisson model [5] [6] or a multinomial logit model [7]. [8] presented a Normal-logNormal model within the class of the so called unmatched models. Following the HB way of thinking, we independently proposed a Normal-Poisson-logNormal model arguing that this unmatched form could be more appropriate for taking explicitly into account the nature of the variable of interest [9] [10]. An application of this model, originally extended to enable the use of multiple data sources possibly misaligned with small areas, is in [11]. Moreover, we suggested a Gamma-Poisson-logNormal model, that introduces a nonnormal sampling error stage, and advocated a natural extension of the several above specifications by letting sampling variances be stochastically determined rather than fixed to design estimates as is the general practice [12].
For completeness, we mention [13] who compare four HB small area models for producing state estimates of proportions: the original proposal consists of a Beta sampling stage with a logit linking model. Still in a Bayesian context, [14] [15] [16] handle the problem of unknown sampling variances.
Under appropriate conditions each of these models may have some merits and whether it is appropriate depends on various circumstances like size of the areas, availability of good explanatory variables at area level, accuracy of sampling variance estimates, etc. Practical use of HB models has been boosted by the availability of software that implements Markov chain Monte Carlo (MCMC) simulations so that model estimation can be straightforward and relatively easy. Room is left for investigating the peculiarity of different specifications and for identifying criteria and guidelines for choosing among alternative Bayesian specifications.
Purpose of the present work is comparing alternative HB area level models for SAE of counts. Comparison is made first on a theoretical side and then by a simulation study. This last is aimed to reproduce one of the most relevant instances where SAE has proven its potential, i.e. estimation of labour force statistics at a local level finer than the survey planned domains. The specific framework for the simulation is estimation of the number of unemployed (employed) within Local Labor Markets (LLMs, i.e. areas including a group of municipalities which share the same labor market conditions). In most developed countries, the major source of information on the labor market is a Labor Force Survey (LFS).
In Italy, LFS design has been planned so that reliable (design-based) estimates of given precision can be obtained for regional and provincial quantities, quarterly and yearly respectively. LLMs are a finer regional partition and the sample sizes associated with such minor domains result inadequate to allow for stable (design-based) estimates [12]. Simulated data were generated by assuming population characteristics of interest as well as sampling survey design as known. In one set of experiments, the actual LLM unemployment (employment) figures from census data were utilized, in others population characteristics were varied (by changing the type of distribution symmetry). Still, LLM survey sample sizes were either maintained fixed at actual LFS values or given different values. The sampling design was kept quite simple across all studies, moreover, synthetic estimates comprise the sole source of auxiliary information incorporated into model framework. Although the core of models is quite basic, it is worth noting that it is the framework actually used in Italy to produce totals of unemployed for LLMs since late nineties.
In summary, this paper, through a number of HB area level models for SAE of totals, compares three broad classes: matched, unmatched and nonnormal sampling stage models. A first comparison, on the basis of a design-based simulation from census data, is made by assuming known sampling variances. Secondarily, once detected specific model failures in terms of bias, accuracy and reliability, this hypothesis is abandoned and minor ameliorations are furtherly carried out to models. The comparison is repeated also by varying the finite-population simulation. Moreover, we address the issue of model determination to point out limits and possible deceptions of commonly used criteria for model selection and checking in SAE context, namely, the deviance information criterion (DIC) and the posterior predictive p-value (PPp). In the sequel, Section 2 presents the alternative HB models at comparison specifying motivations behind their introduction, Section 3 describes the simulation study, discusses the results and pro-poses a number of model refinements, finally, Section 4 contains some concluding remarks.

Simulation Plan and Performance Measures
To compare the performance of the six HB models, we carried out a simulation study based on reproducing a simplified LFS in the "world" of 1991 Veneto  Table 7 in Section 5 presents census, sampling design as well as one sample of simulated LFS data that were used in our study.) To strengthen the results of our study, a further series of simulations was carried out by generating over the small areas set considered above three synthetic populations having, in particular, a positively asymmetric (Table 8), an essentially symmetric (Table 9) and a negatively asymmetric (Table 10) Table 7, the suffix e u denotes whether the quantity is related to employment or unemployment.) Variances i σ of sampling models (1)-when assumed as known-are set to for sampled areas with non-null direct estimates ˆi θ . Whilst, for those areas having y i = 0 with n i < 0, we impute the synthetic proportion of employed/unemployed, that is we replace ˆi p with p , in the formula for The quantities calculated, for each area i, are: relative bias (rb) and the absolute relative bias (arb), absolute relative error (are) and the relative root mean squared error (rE) , both relating to estimator accuracy; ) which is not necessarily the preferential one; furtherly, estimation goal is not unique (is triple according to [1] and [17]). Regardless, we need to know to what extent such an indicator is reliable for measuring the degree of uncertainty about the provided estimates.
Comparison among different HB models is finally completed by looking at some standard criteria for model selection in a Bayesian framework, namely we consider: 1) a likelihood based criterion, the DIC and 2) a predictive distribution-based method, the PPp. The DIC is based upon posterior expectation of the deviance which is defined as θ θ θ for a chosen likelihood ( ) ; L θ θ and for some standardizing function f , where θ serve as data in small area context. In Table 4 and similar ones, , according to the definition given by [18] in their article first proposing this criterion (and the notation therein). PPp is defined, for normal sampling stage models (NN, FH, YR, whereas, for the remaining ones (PlN, GPlN), as with ,  Table 1 and Table 2 display averages, over the I small areas, of rb, arb, are, rE, eff and rel (%), for design-based as well as HB model-based estimators, from simulated data (based on the real population) on employment and unemployment respectively. The "average" is measured in terms of the mean and the median (first and second columns respectively for each measure). In Table 2 and related ones which follow, averages were computed by excluding non-sampled areas as well (rows named as non-na).

First Findings
In Table 1   It is worth noting that any discrepancy in percent numbers of Table 1 can be appreciated just at one decimal digit. In fact, employment rates range from 34.7% to 44.5% (Table 7) whence i cv 's range from 13.8/4.3% to 11.2/3.5% with n i = 100/1000 ( Figure 1). Then, to assess model performance we will focus especially Outcomes pointed out in Table 2 Tables 8-10).
Moreover, it turns out that mse of synthetic estimates is badly estimated by Marker's method which tends to heavily overestimate it.
NN gets the worst scores in terms of (absolute) bias and accuracy (Table 2).
Looking at single small areas situation, ˆH B i θ s suffer from a terrific underestimation especially for the least populated/sampled areas (see the reduced bias for non-na areas and Figure 2, second column of panels). Moreover, the estimated cv is, in median terms, scarcely reliable ( Figure 2, fifth row) except for sampled areas with little information (it is above 100 just for non-na areas with the lowest  Table 2; see also Figure 2, second row, third and fourth panels) and their accuracy is low in terms of expectation (high are and rE ) again depending on the biased estimation for some areas (Figure 2, third and fouth rows), though improve in accuracy with respect to NN and both the design-based estimators. Estimated cv is scarcely reliable for both of them, yet more markedly for FH and for non-sampled areas. This is likely due to a noticeable shrinkage of the estimates toward the mean (yet a biased mean), as it results also from a low DIC (in particular, a small effective number of parameters, pD ; see Table 3, Table 4 which will be discussed later).
Right now, it is worth comparing the three aforementioned FH estimators: FH(2) and FH(3) are practically indistinguishable; FH is less biased than FH (2) (on the other hand, this is well expected since Jensen inequality) yet its cv is totally unreliable. Therefore, without loss of information, only FH(2) and FH(3) will be reported from now on. Since now, a minor difference is detectable between these two just as to reliability (this keeps through all the simulation experiments).
The unmatched models, YR and NPlN, and the non-normal sampling model GPlN show to have the best performance in terms of (absolute) bias, accuracy and reliability (see the highlighted values and regions of light or no shading in Table 2). The unmatched models show to be the most accurate while GPlN seems to be the most "well-centered" and to give the most "well-calibrated" efficiency measure (the last two columns of Figure 2 seem almost indistinguishable; at a careful reading, difference between the unmatched models and GPlN relates   Table 3 for explanation).
to non-na areas with lowest sample size: GPlN is somewhat less accurate though the estimated cv associated to such areas are fully reliable). However, a tendency to underestimation (negative bias) is a non negligible weak point: as we will see in the next section, such a deficiency can be promptly remedied by a (obvious) model refinement.  A further discussion on model selection and checking is beyond the scope of the paper. Yet, the problem of small area model diagnostics constitutes an important, and still largely unexplored, direction of SAE research.

Some Refinements
In this section we consider some refinements of the HB models previously introduced, which involve relevant improvements of their performance. The major development consists in letting sampling variances ( i σ or i σ when a logarithmic transformation occurs or i a in GPlN) be stochastic, whereas, so far, they have been assumed as known and fixed to off-set estimates of sampling design variance ( ) p i v θ . This extension-that we refer to as model-based sampling variance function -arises naturally from a model-based approach to SAE problems. If we assume that a certain model is valid for a certain phenomenon θ, then all the unknown quantities that depend on θ should be made model-generated instead of being imputed as fixed to the model. In our context, the design variance is assumed to be a function of the unknown quantity of interest, i.e.

( ) ( )
But, the design quantity to be imputed (whatever is: ) is usually function of the unknown i θ , e.g.
=− under the SRSWR hypothesis as for our simulation. Common practice essentially consists in replacing ˆi θ to i θ in the design quantity function, e.g., in our example, It is now easy to detect why the standard FH estimator ˆH B cv θ fixed to its true value (rows named as 'true cv').
According to the foregoing comments, in all of the HB models so far considered, we "plug-in" the unobservable i θ into the sampling design variance function ( ) ( ) modelled. Incidentally, we note that the implementation of such a strategy is relatively straightforward within the Bayesian approach (not as such in the frequentist one).
Before going through simulation results, we also mention some ameliorations to NN model. The linearity assumed for the linking model is likely inadequate.
Yet, since here the linear predictor (î S i i θ θ α ν = + + ) is quite nave (a null one), we may try to adjust it somehow before dropping the linear link definitely. In fact, the bad performance is probably due to mis-calibrated random effects i ν 's: in log-normal linking models, i θ depends in a multiplicative way on i ν Thereby, a simple remedy to NN model deficiency might consist in weighting i ν by population N i . By doing that, indeed, a surprising improvement of model performance is immediately obtained (see row NN ( i i N ν ⋅ ) in Table 5 and Table 6).
Model performance greatly improves for all of the models thanks to the aforementioned stochastic extension. Table 5 and Tables 8-10 show how all of  Results shown in Table 5 and Table 8 were obtained using, respectively, the simulated data from census and from a synthetic positively skewed population (thus maintaining the same type of asymmetry of the original unemployment dataset; see details in Section 3.1), hence both comparable to results in Table 2.
The most noticeable change in Table 5 concerns bias. FH sv estimator is now almost unbiased, moreover, YR, NPlN and GPlN are no more negatively biased: a stochastic variance helps model centering (the explanation for that has been given above). Such an improvement affects the rest of performance measures making all the sv models almost competitive.
In Table 6  Two further synthetic populations were considered: one with an essentially symmetric distribution (Table 9) and the other with negative skewness (Table   10), in order to explore (the expected worsening of) performance (with negative asymmetry) of standard FH models, and, at the same time, detect the parallel reaction of the models that have so far shown to perform better. We show the results only for canonical and extended FH, YR, NPlN models as well as for GPlN, this last just in the sv version (being the most naturally conceivable for it), to enlighten the difference in performance of the first one (that worsens with negative asymmetry, mostly in terms of expectation) compared to the competitive outcomes of the stochastic variance models. (Results on "true cv" FH have also been reported and written in italic to stress they are only theoretically conceivable.) Commenting on Tables from 8 onward would give rise to observations analogous to the ones already given above. We merely note that the advantage in using stochastic variance models diminishes with increasing sample size (letting fixed variance models regain ground in the rankings), up to the situation where direct estimators result better than model-based ones (see are and rE in the bottom panel of Table 9 and middle panel of Table 10).

Concluding Remarks
The benefits of using HB models for SAE problems have been largely recognized.
They include the availability of a wider set of tools to handle complex and more realistic models and to get reliable measures of variability. When estimating counts it might not be clear which specification is more appropriate but relevant quantities should be properly modeled. Object of the study was to show that different model specifications are possible from the ones customarily used in non-normal/non-linear cases. The purpose was definitely not to show the general superiority of one framework over the others, being the range of situations one has to face with in real analyses practically unlimited. Moreover, there are (at least) two kinds of reasons why the generalization of our results is rather restricted. First, the type of simulation study we have performed is not the one ordinarily carried out in statistical studies of model comparison. In fact, we have not simulated from a posited model, rather we have considered design-based simulations. That is, given a real phenomenon (in our case, LLMs unemployment in a given region and year either known from census data or simulated), a sample survey has been (repeatedly) simulated according to a fixed sampling design. The idea we follow is "all models are wrong but some are useful" [24], and, this concept goes quite naturally with the demand to a SAE analyst for providing estimates of "real world" quantities. A design-based simulation, then, meets the requirement of positing a "true" population instead of a "true" model. Unfortu On this regard, it is worth noting that, in presence of other sources of information to account for, comparison would have become even more complicated if the data are not simulated on the ground of a posited model, but are considered as real world observations and so generated by a "black box" mechanism.
However, the results from simulation study show persistent model failures for some standard Fay-Herriot specifications and for generalized linear Poisson models with (log-)normal sampling stage in SAE problems with count data, and how even minor model modifications can noticeably improve on their performance. In particular, we advocate the extension of model specification from assuming sampling variances fixed to some off-set estimates (typically design sampling variance estimates) to letting them be (stochastically) generated by the model (at least in their components depending from latent variables estimated within the model itself). On the other hand, unmatched and non-normal sampling stage models show definitely a better performance in terms of bias, accuracy and reliability even in the fixed sampling variance version. Notice that the extension to stochastic sampling variances is straightforward and relatively easy to implement within a Bayesian framework as well as the non-standard specifications are practically feasible only by means of HB models.
Moreover, the study brings out some limits and possible deceptions of com-monly used criteria for model determination in the context of SAE problems, namely DIC and PPp. The version of PPp here addressed is the one commonly used in Bayesian analyses-within SAE context yet also statistic-worldwidewhich tests the (global) validity of model first stage, i.e., in our case, the sampling model. Nevertheless, this particular PPp measure cannot inform us on many other possible model failures. On the other hand, the DIC is not comparable across the models, the main reason being that it is calculated-as routinely is-relatively to the first stage unobservable variables which vary in essence and number across the HB specifications.
In conclusion, future research should definitely focus on defining proper devices for model determination in the field of SAE. Besides, in order to magnify differences between models at comparison in simulation studies, model structure has to be complicated for considering more realistic situations (adding auxiliary information, taking spatial structure into account, etc.), more complex sampling designs are to be experienced, as well as different real population phenomena are to be explored (small areas at different level of territorial aggregation, small area population with different distributional characteristics, etc.