Scientific Research

An Academic Publisher

**A Comparison of Hierarchical Bayesian Models for Small Area Estimation of Counts** ()

*i.e.*, subsets of the population for which sample information is not sufficient to warrant the use of a direct estimator. Hierarchical Bayesian approach to SAE problems offers several advantages over traditional SAE models including the ability of appropriately accounting for the type of surveyed variable. In this paper, a number of model specifications for estimating small area counts are discussed and their relative merits are illustrated. We conducted a simulation study by reproducing in a simplified form the Italian Labour Force Survey and taking the Local Labor Markets as target areas. Simulated data were generated by assuming population characteristics of interest as well as survey sampling design as known. In one set of experiments, numbers of employment/unemployment from census data were utilized, in others population characteristics were varied. Results show persistent model failures for some standard Fay-Herriot specifications and for generalized linear Poisson models with (log-)normal sampling stage, whilst either unmatched or nonnormal sampling stage models get the best performance in terms of bias, accuracy and reliability. Though, the study also found that any model noticeably improves on its performance by letting sampling variances be stochastically determined rather than assumed as known as is the general practice. Moreover, we address the issue of model determination to point out limits and possible deceptions of commonly used criteria for model selection and checking in SAE context.

Keywords

Share and Cite:

*Open Journal of Statistics*,

**7**, 521-550. doi: 10.4236/ojs.2017.73036.

1. Introduction

In recent years, small area estimation (SAE) has emerged as an important area of statistics as private and public agencies try to extract the maximum information from sample survey data. Sample surveys are generally designed to provide estimates of characteristics of interest for large areas or domains. However, governments are more and more interested in obtaining statistics for smaller geographical areas such as counties, districts or census divisions, or smaller demographic subsets such as specific age-sex-race subgroups. These domains are called small areas. SAE concerns statistical techniques aimed at producing estimates of characteristics of interest for small areas or domains. A review of SAE methods is in [1] [2] [3] . The simplest approach is to consider direct estimators, that is estimating the variable of interest using the domain-specific sample data. However, it is well known that the domain sample sizes are rarely large enough to support reliable and accurate direct estimators since budget and other constraints usually prevent drawing adequate samples from each of the small areas. When direct estimates are unreliable (or even non computable), a major direction considers the use of explicit small area models that “borrow strength" from related areas across space and/or time or through auxiliary information which is supposed to be correlated to the variable of interest. Explicit models can be classified into two categories: 1) area level models and 2) unit level models. They can be estimated by adopting several alternative approaches and one of these has been the hierarchical Bayesian (HB) paradigm. However, applications of HB models to SAE, though growing [1] , still are quite a few. Moreover, they have mainly focused on continuous variables. To date, there is no thorough discussion on what is the most appropriate nonlinear specification of area level models when small area estimates are needed for discrete or categorical variables.

In this paper, we focus on HB area level models for producing small area estimates of counts. In the literature, Bayesian specifications commonly derive from classical models for SAE, e.g. the Fay-Harriot model [4] , or more properly consider either a generalized linear Poisson model [5] [6] or a multinomial logit model [7] . [8] presented a Normal-logNormal model within the class of the so called unmatched models. Following the HB way of thinking, we independently proposed a Normal-Poisson-logNormal model arguing that this unmatched form could be more appropriate for taking explicitly into account the nature of the variable of interest [9] [10] . An application of this model, originally extended to enable the use of multiple data sources possibly misaligned with small areas, is in [11] . Moreover, we suggested a Gamma-Poisson-logNormal model, that introduces a nonnormal sampling error stage, and advocated a natural extension of the several above specifications by letting sampling variances be stochastically determined rather than fixed to design estimates as is the general practice [12] .

For completeness, we mention [13] who compare four HB small area models for producing state estimates of proportions: the original proposal consists of a Beta sampling stage with a logit linking model. Still in a Bayesian context, [14] [15] [16] handle the problem of unknown sampling variances.

Under appropriate conditions each of these models may have some merits and whether it is appropriate depends on various circumstances like size of the areas, availability of good explanatory variables at area level, accuracy of sampling variance estimates, etc. Practical use of HB models has been boosted by the availability of software that implements Markov chain Monte Carlo (MCMC) simulations so that model estimation can be straightforward and relatively easy. Room is left for investigating the peculiarity of different specifications and for identifying criteria and guidelines for choosing among alternative Bayesian specifications.

Purpose of the present work is comparing alternative HB area level models for SAE of counts. Comparison is made first on a theoretical side and then by a simulation study. This last is aimed to reproduce one of the most relevant instances where SAE has proven its potential, i.e. estimation of labour force statistics at a local level finer than the survey planned domains. The specific framework for the simulation is estimation of the number of unemployed (employed) within Local Labor Markets (LLMs, i.e. areas including a group of municipalities which share the same labor market conditions). In most developed countries, the major source of information on the labor market is a Labor Force Survey (LFS). In Italy, LFS design has been planned so that reliable (design-based) estimates of given precision can be obtained for regional and provincial quantities, quarterly and yearly respectively. LLMs are a finer regional partition and the sample sizes associated with such minor domains result inadequate to allow for stable (design-based) estimates [12] . Simulated data were generated by assuming population characteristics of interest as well as sampling survey design as known. In one set of experiments, the actual LLM unemployment (employment) figures from census data were utilized, in others population characteristics were varied (by changing the type of distribution symmetry). Still, LLM survey sample sizes were either maintained fixed at actual LFS values or given different values. The sampling design was kept quite simple across all studies, moreover, synthetic estimates comprise the sole source of auxiliary information incorporated into model framework. Although the core of models is quite basic, it is worth noting that it is the framework actually used in Italy to produce totals of unemployed for LLMs since late nineties.

In summary, this paper, through a number of HB area level models for SAE of totals, compares three broad classes: matched, unmatched and nonnormal sampling stage models. A first comparison, on the basis of a design-based simulation from census data, is made by assuming known sampling variances. Secondarily, once detected specific model failures in terms of bias, accuracy and reliability, this hypothesis is abandoned and minor ameliorations are furtherly carried out to models. The comparison is repeated also by varying the finite-population simulation. Moreover, we address the issue of model determination to point out limits and possible deceptions of commonly used criteria for model selection and checking in SAE context, namely, the deviance information criterion (DIC) and the posterior predictive p-value (PPp). In the sequel, Section 2 presents the alternative HB models at comparison specifying motivations behind their introduction, Section 3 describes the simulation study, discusses the results and proposes a number of model refinements, finally, Section 4 contains some concluding remarks.

2. HB Models for Small Area Estimation with Count Data

2.1. General Framework

The core of classical small area models consists of linear mixed models. The basic small area model for area level data is the Fay-Herriot model [4] which consists of an area-linking model, e.g. ${\theta}_{i}={x}_{i}^{\text{T}}\beta +{\nu}_{i}$ , ${\nu}_{i}~N\left(\mathrm{0,}\tau \right)$ (hereafter ~ stays for “independently distributed as”), coupled with a sampling error model, e.g. ${\stackrel{^}{\theta}}_{i}={\theta}_{i}+{\u03f5}_{i}$ , ${\u03f5}_{i}\mathrm{|}{\theta}_{i}~N\left(\mathrm{0,}{\sigma}_{i}\right)$ , with ${\theta}_{i}$ , ${\stackrel{^}{\theta}}_{i}$ and ${x}_{i}$ denoting respectively characteristic of interest, survey estimate (when available) and possible auxiliary data, for each area i. The linking model is merely a linear model with mixed coefficients: fixed coefficients $\beta $ , accounting for $x$ effects valid for the entire population, and random area-specific effects ${\nu}_{i}$ . Sampling variances ${\sigma}_{i}$ are usually assumed to be known; parameters $\beta $ and $\tau $ have to be estimated.

Under a HB approach mixed models are stage-wise specified; in particular, the Fay-Herriot model gives rise to the following specification:

${\stackrel{^}{\theta}}_{i}\mathrm{|}{\theta}_{i}\mathrm{,}{\sigma}_{i}~N\left({\theta}_{i}\mathrm{,}{\sigma}_{i}\right)$ (1)

${\theta}_{i}\mathrm{|}\beta \mathrm{,}\tau ~N\left({x}_{i}^{\text{T}}\beta \mathrm{,}\tau \right)$ (2)

$\left(\beta \mathrm{,}\text{\hspace{0.05em}}\tau \right)~p\left(\beta \mathrm{,}\tau \right)\mathrm{.}$ (3)

Sampling and linking models, (1) and (2), are unchanged, whereas an additional hyperprior stage, (3), is required within a full HB approach.

Notwithstanding a (proper) informative prior distribution on the hyperparameters would be appropriate for a full Bayesian analysis, for ignorance or because we want inference to be driven solely by the data at hand, noninformative priors are often used (this is still mainstream practice in SAE analyses). In this case, to avoid posterior density to be improper, diffuse yet proper (otherwise said, weakly-informative) priors are routinely assumed. Such a choice―which however needs a careful sensitivity analysis especially when models are barely identified―generally ensures a valid inference.

The classical FH specification may be defective either because it: 1) assumes the sampling errors
${\u03f5}_{i}={\stackrel{^}{\theta}}_{i}-{\theta}_{i}$ as normal or because 2) sets a linear link
${\theta}_{i}={x}_{i}^{\text{T}}\beta +{\nu}_{i}$ directly between
${\theta}_{i}$ and
${x}_{i}$ . Indeed, q_{i}’s are counts (i.e. positive-integer valued variates), moreover, a non-identity link
$g\left({\theta}_{i}\right)={x}_{i}^{\text{T}}\beta +{\nu}_{i}$ may be more appropriate when the predicted variable
${\theta}_{i}$ is non-continuous and/or the covariates
${x}_{i}$ are thought to produce a non-additive effect on it. Of course, the Normal-Normal model (1 - 3) owes its popularity to being in general computationally convenient and inferentially tractable by classical estimation methods. On the other hand, in a HB approach inference is straightforward and computationally feasible thanks to MCMC methods, the most popular computing tools in Bayesian practice. The flexibility inherent to HB modeling and its computational tractability allow the choice of more realistic models for SAE problems than alternative approaches could never envision.

2.2. Alternative HB Models

We define alternative HB area level models for a SAE problem with count data. In the following, ${}_{S}{\stackrel{^}{\theta}}_{i}$ indicates a synthetic estimate for small area i, while ${\theta}_{i}$ , ${\stackrel{^}{\theta}}_{i}$ and ${x}_{i}$ have the same meaning as in Section 2.1.

Six model specifications are theoretically conceivable when the parameter of interest is the small area total (such as the number of unemployed or employed in our application). They are:

the Normal-Normal model (NN)

${\stackrel{^}{\theta}}_{i}\mathrm{|}{\theta}_{i}\mathrm{,}{\sigma}_{i}~N\left({\theta}_{i}\mathrm{,}{\sigma}_{i}\right)$ (4)

${\theta}_{i}\mathrm{|}\beta \mathrm{,}\tau ~N\left({}_{S}{\stackrel{^}{\theta}}_{i}+\alpha +\beta {x}_{i}\mathrm{,}\tau \right)\mathrm{,}$ (5)

the log-Normal-Normal model (FH)

$\mathrm{log}\left({\stackrel{^}{\theta}}_{i}\right)|{\theta}_{i},{\stackrel{\u02dc}{\sigma}}_{i}~N\left(\mathrm{log}\left({\theta}_{i}\right),{\stackrel{\u02dc}{\sigma}}_{i}\right)$ (6)

$\mathrm{log}\left({\theta}_{i}\right)|\beta ,\tau ~N\left(\mathrm{log}\left({}_{S}{\stackrel{^}{\theta}}_{i}\right)+\alpha +\beta {x}_{i},\tau \right),$ (7)

the Normal-logNormal model (YR)

${\stackrel{^}{\theta}}_{i}\mathrm{|}{\theta}_{i}\mathrm{,}{\sigma}_{i}~N\left({\theta}_{i}\mathrm{,}{\sigma}_{i}\right)$ (8)

$\mathrm{log}\left({\theta}_{i}\right)|\beta ,\tau ~N\left(\mathrm{log}\left({}_{S}{\stackrel{^}{\theta}}_{i}\right)+\alpha +\beta {x}_{i},\tau \right),$ (9)

the Normal-Poisson-logNormal model (NPlN)

${\stackrel{^}{\theta}}_{i}\mathrm{|}{\theta}_{i}\mathrm{,}{\sigma}_{i}~N\left({\theta}_{i}\mathrm{,}{\sigma}_{i}\right)$ (10)

${\theta}_{i}\mathrm{|}{\mu}_{i}~Poisson\left({\mu}_{i}\right)$ (11)

$\mathrm{log}\left({\mu}_{i}\right)|\beta ,\tau ~N\left(\mathrm{log}\left({}_{S}{\stackrel{^}{\theta}}_{i}\right)+\alpha +\beta {x}_{i},\tau \right),$ (12)

the Poisson-logNormal model (PlN)

${\stackrel{^}{\theta}}_{i}\mathrm{|}{\mu}_{i}~Poisson\left({\mu}_{i}\right)$ (13)

$\mathrm{log}\left({\mu}_{i}\right)|{\theta}_{i},{\stackrel{\u02dc}{\sigma}}_{i}~N\left(\mathrm{log}\left({\theta}_{i}\right),{\stackrel{\u02dc}{\sigma}}_{i}\right)$ (14)

$\mathrm{log}\left({\theta}_{i}\right)|\beta ,\tau ~N\left(\mathrm{log}\left({}_{S}{\stackrel{^}{\theta}}_{i}\right)+\alpha +\beta {x}_{i},\tau \right),$ (15)

and the Gamma-Poisson-logNormal model (GPlN)

${\stackrel{^}{\theta}}_{i}\mathrm{|}{\mu}_{i}~Poisson\left({\mu}_{i}\right)$ (16)

${\mu}_{i}\mathrm{|}{\theta}_{i}\mathrm{,}{a}_{i}~Gamma\left({a}_{i}\mathrm{,}{a}_{i}/{\theta}_{i}\right)$ (17)

$\mathrm{log}\left({\theta}_{i}\right)|\beta ,\tau ~N\left(\mathrm{log}\left({}_{S}{\stackrel{^}{\theta}}_{i}\right)+\alpha +\beta {x}_{i},\tau \right).$ (18)

For a fully Bayesian specification, the basis of every model hierarchy consists of an hyperprior stage, like as (3), which we leave generically expressed since the discussion is focused on analysing how closer-to-data stages can be variously specified.

Either the NN (4 - 5) or the FH (6 - 7) specifications consist in the Fay-Herriot model with ${\stackrel{^}{\theta}}_{i}$ and ${\theta}_{i}$ both untransformed (NN) or both log-transformed (FH). NN and FH are both members of the so-called matched models [7] in the sense that sampling and linking models can be combined to produce a linear mixed model. Namely, once a suitable function $g(\cdot )$ is chosen to relate the parameter of interest to auxiliary variables through a Normal model ( $g\left({\theta}_{i}\right)={x}_{i}\beta +{\nu}_{i}$ , with ${\nu}_{i}$ normal variates), also direct estimates are accordingly transformed in the sampling model ( $g\left({\stackrel{^}{\theta}}_{i}\right)=g\left({\theta}_{i}\right)+{e}_{i}$ , again with normal errors ${e}_{i}$ ) in order to combine the two equations into a single linear model (from the foregoing equations, $g\left({\stackrel{^}{\theta}}_{i}\right)={x}_{i}\beta +{\nu}_{i}+{e}_{i}$ ). Small area estimates are then obtained by inverting $g(\cdot )$ .

On the other side, YR (8 - 9) and NPlN (10 - 12) forms are both unmatched models in the sense that stage-1 and stage-2 models cannot be combined into a single equation model. You and Rao [8] proposed the YR form when g is specifically the log-function, but their arguments can be generalized to any nonlinear function of ${\theta}_{i}$ . They warned that customary hypotheses on sampling errors ${e}_{i}$ may be quite questionable when g is nonlinear and area sample size is small (in particular, they refer to the unbiasedness assumption $E\left({e}_{i}|{\theta}_{i}\right)=0$ and the Tay-

lor approximation ordinarily set for the variance, i.e. $var\left({e}_{i}\mathrm{|}{\theta}_{i}\right)\approx {\left\{{g}^{\prime}\left({\theta}_{i}\right)\right\}}^{2}{\sigma}_{i}$

with ${\sigma}_{i}=var\left({\u03f5}_{i}\mathrm{|}{\theta}_{i}\right)$ ). Thus their advice is to let sampling model ${\stackrel{^}{\theta}}_{i}={\theta}_{i}+{\u03f5}_{i}$ be unaltered so that condition $E\left({\u03f5}_{i}\mathrm{|}{\theta}_{i}\right)=0$ (i.e. design-unbiasedness of ${\stackrel{^}{\theta}}_{i}$ ) holds and, moreover, the design-variance ${v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)$ , which is taken as known, can be imputed to the sampling variance ${\sigma}_{i}$ . (Note that we use ${\u03f5}_{i}$ instead of ${e}_{i}$ whenever direct estimates ${\stackrel{^}{\theta}}_{i}$ are left untransformed in the sampling model.) Finally, they choose the HB approach since inference on non-standard specifications may not be feasible by means of classical estimation methods. We note also that with specific reference to FH (6 - 7) model, survey information may be partly wasted, in that transformed direct estimates $log\left({\stackrel{^}{\theta}}_{i}\right)$ are not defined when ${\stackrel{^}{\theta}}_{i}=0$ . Thus, missing data originate both from areas with null direct estimates (which may not be so rare when area sample size is small) and, as usually, from non-sampled small areas.

Trevisani and Torelli [9] [10] proposed the NPlN model which derives from building, stage by stage, a HB structure suited for SAE problems with count variables. Thereby, first stage (10) is modeled, similarly to YR (8), in terms of untransformed direct estimates ${\stackrel{^}{\theta}}_{i}$ , so that both design-unbiasedness and ${\sigma}_{i}={v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)$ assignment hold. Then, noting that response variable ${\theta}_{i}$ of the linking regression analysis is a count, second (11) and third (12) stages consist of a standard Poisson-log-Normal model.

Indeed, this choice has been (determined not only by the type of
${\theta}_{i}$ variable but also) borrowed from the extensive literature on disease mapping. In such an area, disease counts are modelled as Poisson variates with mean
${\mu}_{i}={\rho}_{i}\times {E}_{i}$ where
${\rho}_{i}$ is the relative risk in area i and
${E}_{i}$ the expected count. A regression equation is then usually set on a logarithmic scale,
$log\left({\rho}_{i}\right)={x}_{i}\beta +{\nu}_{i}$ , to opportunely accommodate for a linear predictor
${x}_{i}\beta $ and any random effect
${\nu}_{i}$ (which is customarily assumed to be normally distributed). Note that E_{i} may be taken as random but is usually taken as known and (for including any information prior to inference,) enters the regression equation as an offset. In SAE context, E_{i} can be set at known synthetic estimate
${}_{S}{\stackrel{^}{\theta}}_{i}$ and a model-like-disease mapping is then fitted to counts
${\theta}_{i}$ .

To some readers the specification of ${\stackrel{^}{\theta}}_{i}$ might appear inconsistent with that of ${\theta}_{i}$ : ${\stackrel{^}{\theta}}_{i}$ are generated from a continuous distribution over the real line while ${\theta}_{i}$ are drawn from a discrete distribution like the Poisson one. Yet, at a second insight, one realizes that it is the sampling model for ${\stackrel{^}{\theta}}_{i}$ ’s to be really inconsistent with the integer type of the variable of interest ${\theta}_{i}$ . Nonetheless, we decided to let sampling model be specified as standardly is in SAE literature (indeed, ${\stackrel{^}{\theta}}_{i}$ might be non-integer since it derives from an estimation process not necessarily constrained to produce integer values, though definitely it cannot be negative) while we originally assumed ${\theta}_{i}$ to be generated from a Poisson model. It is superfluous to remark that there is no inconsistency in restricting the parameter space of a Normal distribution mean ( ${\theta}_{i}$ ) to the sole integers. Moreover, there is no need of discretizing ${\stackrel{^}{\theta}}_{i}$ : ${\theta}_{i}$ naturally arises as an integer for being generated as a Poisson variate. (Incidentally, Bugs software allows specifying a Poisson prior for any continuous quantity.) We apologize to those readers not in need of clarification for such a “byzantine” digression. Even more, if one considers that the core feature for which we turned to a Poisson model was its variance-pro- portional-to-mean property; though nonnegativity and discreteness of its sampling space are undoubted advantages, they are not so urgent as to require the replacement of the standardly assumed Normal model.

Finally, the two last models, PlN (13 - 15) and GPlN (16 - 18), are characterized by non-normal first stage specifications. The characteristic of interest is a count, thus a canonical Poisson model is set from the very first stage. The PlN specification is a standard generalized linear mixed model for count variates― the ${\stackrel{^}{\theta}}_{i}$ s―here written in a form suitable to SAE problems. In particular, the log-Normal stage (14) depends on two sources of random variability: the sampling error, ${e}_{i}$ , and the random effect, ${\nu}_{i}$ . Again, the sampling error variance is set according to the Taylor approximation defined above, i.e. ${\stackrel{\u02dc}{\sigma}}_{i}={\theta}_{i}^{-2}{\sigma}_{i}$ . In order to remedy possible failures implied by the Taylor approximation, GPlN model sets a Gamma distribution for inflating the Poisson variance to the extent of the sampling variability, i.e. ${\mu}_{i}={\theta}_{i}{\gamma}_{i}$ where ${\gamma}_{i}~\Gamma \left({a}_{i}\mathrm{,}{a}_{i}/{\theta}_{i}\right)$ is conve-

niently specified so that both design unbiasedness $\left(E\left({\stackrel{^}{\theta}}_{i}\mathrm{|}{\theta}_{i}\right)={\theta}_{i}\right)$ and design

variance imputation $\left(var\left({\stackrel{^}{\theta}}_{i}\mathrm{|}{\theta}_{i}\right)={v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)\right)$ hold. An ordinary log-Normal model follows at the linking stage (18).

3. A Simulation Study

3.1. Simulation Plan and Performance Measures

To compare the performance of the six HB models, we carried out a simulation study based on reproducing a simplified LFS in the “world” of 1991 Veneto census data. That is, we generate each LLM population with employment and unemployment rates (% over population ${N}_{i}$ ) fixed at ${p}_{i}^{e}$ and ${p}_{i}^{u}$ values as derived from census data. Hence, a sample of size ${n}_{i}={r}_{i}{N}_{i}/1000$ , with sampling fraction ${r}_{i}$ (over population) fixed at 1999 LFS value, was repeatedly selected by simple random sampling without replacement (SRSWOR) from each LLM. (Table 7 in Section 5 presents census, sampling design as well as one sample of simulated LFS data that were used in our study.)

To strengthen the results of our study, a further series of simulations was carried out by generating over the small areas set considered above three synthetic populations having, in particular, a positively asymmetric (Table 8), an essentially symmetric (Table 9) and a negatively asymmetric (Table 10) distribution of numbers of unemployed (in all cases keeping fixed the mean to the historical value of 3.35% of the LLM unemployment rates). Moreover, as regards the LFS simulation, non-sampled areas were randomly selected (keeping fixed the number to 14 over the total 51 small areas) and sample sizes were varied from ${n}_{i}=100$ to 300 up to 500, $\forall i$ (i.e. balanced designs), yet with sampling fraction ${r}_{i}$ kept fixed to the realized historical value of 4/5‰.

From each simulated survey sample, the following estimators were calculated: (a) (poststratified) direct estimator
${\stackrel{^}{\theta}}_{i}={N}_{i}{\stackrel{^}{p}}_{i}$ , with
${\stackrel{^}{p}}_{i}={y}_{i}/{n}_{i}$ , y_{i} being the observed number of employed/unemployed over
${n}_{i}$ sampled units; (b) synthetic estimator
${}_{S}{\stackrel{^}{\theta}}_{i}={N}_{i}\stackrel{^}{p}$ , where
$\stackrel{^}{p}=y/n$ ,
$y={\displaystyle {\sum}_{i}{y}_{i}}$ and
$n={\displaystyle {\sum}_{i}{n}_{i}}$ ; (c) coeffi-

cient of variation (cv) of direct estimator was estimated by $\stackrel{^}{cv}{\text{\hspace{0.05em}}}_{i}\text{\hspace{0.17em}}=\sqrt{{\stackrel{^}{v}}_{p}\left({\stackrel{^}{\theta}}_{i}\right)/{\stackrel{^}{\theta}}_{i}}$

with ${\stackrel{^}{v}}_{p}\left({\stackrel{^}{\theta}}_{i}\right)\mathrm{=}{N}_{i}\left({N}_{i}-{n}_{i}\right){\stackrel{^}{p}}_{i}\left(1-{\stackrel{^}{p}}_{i}\right)/\left({n}_{i}-1\right)$ being the sampling design variance estimated according to a SRSWOR scheme; lastly; (d) $cv$ of synthetic

estimator was estimated by ${\text{\hspace{0.05em}}}_{S}\stackrel{^}{cv}{\text{\hspace{0.05em}}}_{i}\text{\hspace{0.17em}}=\sqrt{\stackrel{^}{mse}\left({}_{S}{\stackrel{^}{\theta}}_{i}\right)/{}_{S}{\stackrel{^}{\theta}}_{i}}\text{\hspace{0.05em}}$ with mean squared error

of synthetic estimator, $mse\left({}_{S}{\stackrel{^}{\theta}}_{i}\right)$ , computed by the method of Marker [1] . (In Table 7, the suffix $e/u$ denotes whether the quantity is related to employment or unemployment.)

Variances
${\sigma}_{i}$ of sampling models (1)―when assumed as known―are set to
${\stackrel{^}{\sigma}}_{i}={\stackrel{^}{v}}_{p}\left({\stackrel{^}{\theta}}_{i}\right)$ for sampled areas with non-null direct estimates
${\stackrel{^}{\theta}}_{i}$ . Whilst, for those areas having y_{i} = 0 with n_{i} > 0, we impute the synthetic proportion of employed/unemployed, that is we replace
${\stackrel{^}{p}}_{i}$ with
$\stackrel{^}{p}$ , in the formula for
${\stackrel{^}{v}}_{p}\left({\stackrel{^}{\theta}}_{i}\right)$ . Lastly, for those areas having n_{i} = 0, missing data
${\stackrel{^}{\theta}}_{i}$ are considered as latent variables according to a Bayesian approach. Various alternatives of value-impu- tation for the associated
${\sigma}_{i}$ ’s are then sensible: e.g. in the trials where synthetic estimates are given as initial values to MCMC chains of latent (missing)
${\stackrel{^}{\theta}}_{i}$ , the associated
${\sigma}_{i}$ ’s were consistently fixed at Marker’s estimate of
$mse\left({}_{S}{\stackrel{^}{\theta}}_{i}\right)$ .

To compare the performance of ${\stackrel{^}{\theta}}_{i}^{HB}$ estimators, based on the formerly introduced HB models, we compute a series of measures out of $R=100$ simulated samples. (We came to this number after found out that results kept fairly stable even stopping at 50 replications.) Let $\stackrel{^}{\theta}=\left({\stackrel{^}{\theta}}_{1}\mathrm{,}{\stackrel{^}{\theta}}_{2}\mathrm{,}\cdots \right)$ be the vector of the overall direct estimates, then ${\stackrel{^}{\theta}}_{i}^{HB}$ is defined as the posterior mean $E\left({\theta}_{i}\mathrm{|}\stackrel{^}{\theta}\right)$ of small area parameter ${\theta}_{i}$ in the considered HB model. For both FH and PlN models, where ${\theta}_{i}$ is modelled on the log-scale, a definition of ${\stackrel{^}{\theta}}_{i}^{HB}$ has to be properly settled. The one (ones) that we adopt are described in Section 3.2. Computations have been made by means of arm package which allows running Bugs, the best-known Bayesian inference software, from within the general statistical package R. Same standard (proper) non-informative hyperpriors are chosen for every model under comparison.

The quantities calculated, for each area i, are:

relative bias (rb) and the absolute relative bias (arb),

$r{b}_{i}=\frac{1}{R}{\displaystyle \underset{r=1}{\overset{R}{\sum}}}\left({\stackrel{^}{\theta}}_{ir}^{HB}/tru{e}_{i}-1\right),\text{\hspace{1em}}ar{b}_{i}=\frac{1}{R}\left|{\displaystyle \underset{r=1}{\overset{R}{\sum}}}\left({\stackrel{^}{\theta}}_{ir}^{HB}/tru{e}_{i}-1\right)\right|,$

both measuring the bias of the estimator ( ${\stackrel{^}{\theta}}_{ir}^{HB}$ denotes the value of the considered estimator of ${\theta}_{i}$ for the rth simulation, $tru{e}_{i}$ is the census number of employed/unemployed);

absolute relative error (are) and the relative root mean squared error (rE)

$ar{e}_{i}=\frac{1}{R}{\displaystyle \underset{r=1}{\overset{R}{\sum}}}\left|{\stackrel{^}{\theta}}_{ir}^{HB}/tru{e}_{i}-1\right|,\text{\hspace{1em}}r{E}_{i}=\sqrt{\stackrel{^}{mse}\left({\stackrel{^}{\theta}}_{i}^{HB}\right)}/tru{e}_{i}$

with $\text{\hspace{0.05em}}\stackrel{^}{mse}\left({\stackrel{^}{\theta}}_{i}^{HB}\right)={\displaystyle {\sum}_{r=1}^{R}{\left({\stackrel{^}{\theta}}_{ir}^{HB}-tru{e}_{i}\right)}^{2}/R\text{\hspace{0.05em}}}$ , both relating to estimator accuracy;

efficiency (eff)

$ef{f}_{i}=\frac{1}{R}{\displaystyle \underset{r=1}{\overset{R}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}\stackrel{^}{cv}\left({\stackrel{^}{\theta}}_{ir}^{HB}\right)$

where expressions for $\stackrel{^}{cv}\left({\stackrel{^}{\theta}}_{i}\right)=\stackrel{^}{cv}{\text{\hspace{0.05em}}}_{i}$ and $\stackrel{^}{cv}\left({}_{S}{\stackrel{^}{\theta}}_{i}\right)=\text{\hspace{0.17em}}{}_{S}\text{\hspace{0.05em}}\stackrel{^}{cv}{\text{\hspace{0.05em}}}_{i}$ have already been

given whereas $\stackrel{^}{cv}\left({\stackrel{^}{\theta}}_{i}^{HB}\right)$ is measured by $\text{\hspace{0.05em}}\stackrel{^}{sd}\left({\theta}_{i}|\stackrel{^}{\theta}\right)/{\stackrel{^}{\theta}}_{i}^{HB}$ with $sd\left({\theta}_{i}|\stackrel{^}{\theta}\right)$ de-

noting the posterior standard deviation of ${\theta}_{i}$ ; lastly

reliability (rel)

$re{l}_{i}=\sqrt{\frac{1}{R}{\displaystyle \underset{r=1}{\overset{R}{\sum}}}{\stackrel{^}{cv}}^{2}\left({\stackrel{^}{\theta}}_{ir}^{HB}\right)}/r{E}_{i}$

which is intended to measure how much reliable is a standardly used indicator of estimator efficiency (cv) when related to a comparable measure (rE) yet based on known true values.

The degree of cv reduction with respect to direct estimators is largely used for selecting the best (model- or design-based) estimator. Nevertheless, it consists of one aspect (essentially depending on the shrinkage degree of direct estimates to the mean level ${}_{S}{\stackrel{^}{\theta}}_{i}+\alpha +\beta {x}_{i}$ or ${}_{S}{\stackrel{^}{\theta}}_{i}exp\left(\alpha +\beta {x}_{i}\right)$ ) which is not necessarily the preferential one; furtherly, estimation goal is not unique (is triple according to [1] and [17] ). Regardless, we need to know to what extent such an indicator is reliable for measuring the degree of uncertainty about the provided estimates.

Comparison among different HB models is finally completed by looking at some standard criteria for model selection in a Bayesian framework, namely we consider: 1) a likelihood based criterion, the DIC and 2) a predictive distribution-based method, the PPp. The DIC is based upon posterior expectation of the deviance which is defined as $D\left(\theta \right)=-2\mathrm{log}L\left(\theta ;\stackrel{^}{\theta}\right)+2\mathrm{log}f\left(\stackrel{^}{\theta}\right)$ for a chosen likelihood $L\left(\theta ;\stackrel{^}{\theta}\right)$ and for some standardizing function $f$ , where $\stackrel{^}{\theta}$ serve as data in small area context. In Table 4 and similar ones, $\stackrel{\xaf}{D}=E\left(D\left(\theta \right)|\stackrel{^}{\theta}\right)$ , $D\left(\stackrel{\xaf}{\theta}\right)=D\left(E\left(\theta |\stackrel{^}{\theta}\right)\right)$ , ${p}_{D}=\stackrel{\xaf}{D}-D\left(\stackrel{\xaf}{\theta}\right)$ , $\text{DIC}=\stackrel{\xaf}{D}+{p}_{D}$ , according to the definition given by [18] in their article first proposing this criterion (and the notation therein). PPp is defined, for normal sampling stage models (NN, FH, YR, NPlN), as

$\text{PP}p=P\left\{{\displaystyle \underset{i}{\sum}}{\left({y}_{i\mathrm{,}rep}-{\theta}_{i}\right)}^{2}/{\sigma}_{i}>{\displaystyle \underset{i}{\sum}}{\left({\stackrel{^}{\theta}}_{i}-{\theta}_{i}\right)}^{2}/{\sigma}_{i}\mathrm{|}\stackrel{^}{\theta}\right\}$ (19)

whereas, for the remaining ones (PlN, GPlN), as

$\text{PP}p=P\left\{{\displaystyle \underset{i}{\sum}}{\left({y}_{i\mathrm{,}rep}-{\mu}_{i}\right)}^{2}/{\mu}_{i}>{\displaystyle \underset{i}{\sum}}{\left({\stackrel{^}{\theta}}_{i}-{\mu}_{i}\right)}^{2}/{\mu}_{i}\mathrm{|}\stackrel{^}{\theta}\right\},$ (20)

with ${y}_{i\mathrm{,}rep}$ indicating hypothetical replicated data under the assumed model and where the reference distribution is derived from the posterior distribution of $\left({y}^{rep}\mathrm{,}\theta \right)$ or, in the last case, $\left({y}^{rep}\mathrm{,}\mu \right)$ . According to these criteria, the smaller the DIC or $\left|\text{PP}p-0.5\right|$ the better the model.

3.2. First Findings

Table 1 and Table 2 display averages, over the I small areas, of rb, arb, are, rE, eff and rel (%), for design-based as well as HB model-based estimators, from simulated data (based on the real population) on employment and unemployment respectively. The “average” is measured in terms of the mean and the median (first and second columns respectively for each measure). In Table 2 and related ones which follow, averages were computed by excluding non-sampled areas as well (rows named as non-na).

In Table 1 and similar ones, three ways have been adopted to transform ${\eta}_{i}=log\left({\theta}_{i}\right)$ in the FH model back to the original scale. The “classical way” (FH) provides the required estimator by exponentiating ${\stackrel{^}{\eta}}_{i}^{HB}=E\left({\eta}_{i}\mathrm{|}\stackrel{^}{\theta}\right)$ i.e. as

Table 1. Comparison between design-based and HB model-based estimators (simulated data on employment).

Table 2. Comparison between design-based and HB model-based estimators (simulated data on unemployment).

${\stackrel{^}{\theta}}_{i}^{HB}=exp\left({\stackrel{^}{\eta}}_{i}^{HB}\right)$ ; the derived cv is necessarily given by multiplying $cv\left({\stackrel{^}{\eta}}_{i}^{HB}\right)=sd\left({\eta}_{i}\mathrm{|}\stackrel{^}{\theta}\right)/{\stackrel{^}{\eta}}_{i}^{HB}$ by ${\stackrel{^}{\theta}}_{i}^{HB}$ , which, though, shows to be clearly inadequate as a measure of efficiency (in Table 2, $\stackrel{\xaf}{rel}=re{l}^{M}=9\%$ , in Table 1, $\stackrel{\xaf}{rel}=12\%$ , $\text{\hspace{0.05em}}re{l}^{M}=10\%$ , and the size of mse coverage keeps being around 10% also in the subsequent unemployment simulation experiments). The “Bayesian way” (referred to as FH(2)) provides ${\stackrel{^}{\theta}}_{i}^{HB}$ as posterior expectation of the back-transformed parameter ${\theta}_{i}=exp\left({\eta}_{i}\right)$ . Lastly, FH(3) estimator stems from using the classical formulas for deriving expectation and standard deviation of a lognormal variable i.e. by computing ${\stackrel{^}{\theta}}_{i}^{HB}=\mathrm{exp}\left({\stackrel{^}{\eta}}_{i}^{HB}+var\left({\eta}_{i}|\stackrel{^}{\theta}\right)/2\right)$ and $var\left({\theta}_{i}\mathrm{|}\stackrel{^}{\theta}\right)=exp\left(2\left({\stackrel{^}{\eta}}_{i}^{HB}+var\left({\eta}_{i}\mathrm{|}\stackrel{^}{\theta}\right)\right)\right)-exp\left(2{\stackrel{^}{\eta}}_{i}^{HB}+var\left({\eta}_{i}\mathrm{|}\stackrel{^}{\theta}\right)\right)$ . (In each case, ordinary $cv\left({\stackrel{^}{\theta}}_{i}^{HB}\right)$ formula is then used for $cv$ computation.) PlN ${\stackrel{^}{\theta}}_{i}^{HB}$ is obtained solely by the Bayesian way.

It is worth noting that any discrepancy in percent numbers of Table 1 can be appreciated just at one decimal digit. In fact, employment rates range from 34.7% to 44.5% (Table 7) whence
$c{v}_{i}$ ’s range from 13.8/4.3% to 11.2/3.5% with n_{i} = 100/1000 (Figure 1). Then, to assess model performance we will focus especially

Figure 1. Percentual cv’s with respect to proportion p and varying n.

on simulation experiments on unemployment whose rates range from 2.1% to 8.4% whence
$c{v}_{i}$ ’s are much more relevant (from 68.6/21.6% to 33.2/10.4% with n_{i} = 100/1000).

Outcomes pointed out in Table 2 will be basically repeated in subsequent simulation experiments, thus revealing somehow a typical performance for each model at comparison. Ranking of model-based estimators is to be interpreted as follows: best performance values are boxed while the others are gray/green-co- lored with intensity growing with worsening performance (the range of expectation/median values, for each measure, has been divided into six equal intervals―being seven the models at comparison; in rel columns, only values less than 100% have been shadowed). FH row has been excluded from shadowing since FH estimator is clearly unusable (as will be clarified below). With regard to columns related to efficiency, it is of interest only detecting whether efficiency measures (as they are ordinarily computed) are instead a “lark-mirror”. Thus, low eff values (i.e. low cv whence high efficiency) coupled with low rel values (i.e. low reliability of cv in measuring actual estimator accuracy) are boxed and gray- filled.

Direct estimates are expected to be, on average, unbiased, increasingly accurate with growing sample size, as well as their $\stackrel{^}{cv}$ (since it is analitically derived from a formula consistent with the sampling design) ought to be a reliable measure of accuracy. Table 1 and Table 2 (as well as 8 - 10) prove these expectations, though showing a tendency of such $\stackrel{^}{cv}$ ’s to overestimate direct estimates’ uncertainty. On the contrary, synthetic estimates are typically biased and bias (and/or accuracy, their variance being irrelevant) gets worse for data tending to negative asymmetry (see the worsening of bias across Tables 8-10). Moreover, it turns out that mse of synthetic estimates is badly estimated by Marker’s method which tends to heavily overestimate it.

NN gets the worst scores in terms of (absolute) bias and accuracy (Table 2). Looking at single small areas situation, ${\stackrel{^}{\theta}}_{i}^{HB}$ s suffer from a terrific underestimation especially for the least populated/sampled areas (see the reduced bias for non-na areas and Figure 2, second column of panels). Moreover, the estimated cv is, in median terms, scarcely reliable (Figure 2, fifth row) except for sampled areas with little information (it is above 100 just for non-na areas with the lowest

Figure 2. Design-based and HB model-based estimators at comparison (simulated data on unemployment, Table 2). Areas are ordered by increasing sample size; non sampled areas are placed at the extreme left.

sample size). The adoption of a linear form for the linking model is likely inadequate, but, in this study, where the regression model is merely a null model (with the synthetic estimate offset consisting in the sole input of auxiliary information), we may try to handle its elementary components in an effective way before definitely dropping the linear link. In Section 3.3 we will see what are the refinements suited to improve NN performance.

FH and PlN exhibit a similar performance, though defects are exacerbated for FH. They tend to overestimate the ${\theta}_{i}$ ’s (positive bias, Table 2; see also Figure 2, second row, third and fourth panels) and their accuracy is low in terms of expectation (high $\stackrel{\xaf}{are}$ and $\stackrel{\xaf}{rE}$ ) again depending on the biased estimation for some areas (Figure 2, third and fouth rows), though improve in accuracy with respect to NN and both the design-based estimators. Estimated cv is scarcely reliable for both of them, yet more markedly for FH and for non-sampled areas. This is likely due to a noticeable shrinkage of the estimates toward the mean (yet a biased mean), as it results also from a low DIC (in particular, a small effective number of parameters, $pD$ ; see Table 3, Table 4 which will be discussed later).

Right now, it is worth comparing the three aforementioned FH estimators: FH(2) and FH(3) are practically indistinguishable; FH is less biased than FH(2) (on the other hand, this is well expected since $E\left\{\mathrm{exp}\left({\eta}_{i}\right)\right\}\ge \mathrm{exp}\left\{E\left({\eta}_{i}\right)\right\}$ by Jensen inequality) yet its cv is totally unreliable. Therefore, without loss of information, only FH(2) and FH(3) will be reported from now on. Since now, a minor difference is detectable between these two just as to reliability (this keeps through all the simulation experiments).

The unmatched models, YR and NPlN, and the non-normal sampling model GPlN show to have the best performance in terms of (absolute) bias, accuracy and reliability (see the highlighted values and regions of light or no shading in Table 2). The unmatched models show to be the most accurate while GPlN seems to be the most “well-centered” and to give the most “well-calibrated” efficiency measure (the last two columns of Figure 2 seem almost indistinguishable; at a careful reading, difference between the unmatched models and GPlN relates

Table 3. Model determination diagnostics (simulated data on employment).

^{a}PPp = 0.43 results by imputing
${\sigma}_{i}$ fixed at
${\stackrel{^}{\sigma}}_{i}={\left(\stackrel{^}{c}{v}_{i}{\stackrel{^}{\theta}}_{i}\right)}^{2}$ in (19); otherwise, PPp = 0.47 by using
${\stackrel{^}{\sigma}}_{i}={\left(\stackrel{^}{c}{v}_{i}{\theta}_{i}\right)}^{2}$ as it is generated by the model.

Table 4. Model determination diagnostics (simulated data on unemployment).

^{a}PPp = 0.40 by using
${\stackrel{^}{\sigma}}_{i}={\left(\stackrel{^}{c}{v}_{i}{\theta}_{i}\right)}^{2}$ in (19) (see Table 3 for explanation).

to non-na areas with lowest sample size: GPlN is somewhat less accurate though the estimated cv associated to such areas are fully reliable). However, a tendency to underestimation (negative bias) is a non negligible weak point: as we will see in the next section, such a deficiency can be promptly remedied by a (obvious) model refinement.

Finally, a poor estimation for non sampled areas (placed at the far left of x-axis in Figure 2) is detectable across all models. A reason for it is definitely the lack of direct information. Thereby, any improvement of model-based estimators over the synthetic ones crucially depends on the model ability of borrowing strength from all of the available information (thus producing a consistent estimate of the functional form $\alpha +\beta {x}_{i}$ which solely constitutes non sampled areas estimates). In the experiments under study which consider a null model (no ${x}_{i}$ ) the borrowing is clearly not sufficient.

Table 3 and Table 4 show $\stackrel{\xaf}{D}$ , $D\left(\stackrel{^}{\theta}\right)$ , ${p}_{D}$ , DIC and PPp values averaged over the R replications, for employment and unemployment simulation data respectively. We are essentially interested to detect whether commonly used model selection and validation criteria serve their purpose.

We start with the predictive criterion, the PPp, since it seems to pass the examination. We know “lights and shades” of the models under comparison from the foregoing external validation (i.e. by knowing the true values of the para- meter of interest): PPp effectively chooses the unmatched models and GPlN while rejects NN and FH. (PlN is clearly picked out only for the employment application, yet we stress that focus is on the unemployment study where models’ performance differs significantly.) However, PPp is based on a particular measure of discrepancy (the “event” under probability in (19)/(20)), which carries out a model validation relatively to a single aspect of (global) model performance. In particular, it answers the question whether the sampling stage of the examined model might be an adequate mechanism for generating our data. It is immediately evident that either NN or FH sampling model are quite unsuited (the associated PPp’s are far below 0.50), and this supports the thesis against either linear models for non-normal variables (NN) or questionable sampling models (the dubious hypotheses under the first stage of FH specification). Nevertheless, this particular PPp measure cannot inform us on many other possible model failures [19] [20] [21] .

On the other side, comparison between likelihoods is even more delicate [22] , especially with hierarchical models which allow the possibility of marginalization across levels in different ways. FH model that is the second-worst model in terms of performance (Table 1, Table 2) should be chosen according to the principle “lower the DIC better the model”. Is the DIC misleading?

First, the DIC calculated from (NN, FH, YR, NPlN) group cannot be compared to the one from the (PlN, GPlN) pair: the deviance in models with normal sampling stage is focused on θ’s whereas is on μ’s for poisson sampling stage specifications. In fact, the DIC has been calculated―as routinely is―relatively to the first stage unobservable variables (or parameters, in classical terminology) of the HB models. Instead, comparison would be made more consistent by marginalizing the likelihood of all the candidate HB models to the same parameters of interest (in SAE studies, the θ’s or small area quantities for they are the primary object). Even doing so, care must be taken in making comparisons of DIC’s, since, for instance, normalizing constants vary with the parametric family of distribution (thus all constants must be retained when making comparisons), or, the support for a model can vary with the possible (marginal) likelihoods [23] .

Second, the comparison of FH versus all the remaining normal sampling stage models is questionable as well, since data are the log-transformed direct estimates for FH while are the untransformed direct estimates for the other specifications. As we have already noted in Section 2.2, all the null direct estimates are excluded from the usable data for FH, hence the posterior expected deviance, $\stackrel{\xaf}{D}$ , is accordingly decreased [18] .

However, despite the critical comments above, can we draw any information from the DIC output for our study? Posterior expected deviance
$\stackrel{\xaf}{D}\left(\theta \right)$ has been standardized by the maximized log-likelihood, then, if the model is “true”, its expectation is approximately the number of the free parameters in θ, that is the total size of unit sample: in our case, it is the number of sampled areas (51 − 14 = 37, see Table 7). Thereby, computation of the standardized posterior deviance might be appropriate for checking the overall goodness-of-fit of the model. With this respect, NN clearly does not fit the data, FH and PlN show to be somewhat inconsistent with the data, whereas GPlN, NPlN and YR show, in order, to have the best fit. Such conclusion matches the one derived on the basis of PPp values. As for p_{D}/DIC, they tell of the model complexity. In our study (where a null model is considered) the number of effective parameters actually lacks in interest (neither mentioning Poisson sampling stage models for which
${p}_{D}$ is practically pointless).

A further discussion on model selection and checking is beyond the scope of the paper. Yet, the problem of small area model diagnostics constitutes an important, and still largely unexplored, direction of SAE research.

3.3. Some Refinements

In this section we consider some refinements of the HB models previously introduced, which involve relevant improvements of their performance. The major development consists in letting sampling variances ( ${\sigma}_{i}$ or ${\stackrel{\u02dc}{\sigma}}_{i}$ when a logari- thmic transformation occurs or ${a}_{i}$ in GPlN) be stochastic, whereas, so far, they have been assumed as known and fixed to off-set estimates of sampling design variance ${v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)$ .

This extension―that we refer to as model-based sampling variance function ―arises naturally from a model-based approach to SAE problems. If we assume that a certain model is valid for a certain phenomenon θ, then all the unknown quantities that depend on θ should be made model-generated instead of being imputed as fixed to the model. In our context, the design variance is assumed to be a function of the unknown quantity of interest, i.e. ${v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)=f\left({\theta}_{i}\right)$ , with f known from sampling design characteristics. Then, according to the outlined strategy, sampling variances will be obtained as model-based estimates, $f\left({\theta}_{i}\right)$ , through ${\theta}_{i}$ .

By the way, this general consideration allows us to explain a specific fault occurring with the customary FH model. Recall that for a FH model with link function g, it is assumed that (according to the Taylor approximation) ${\stackrel{\u02dc}{\sigma}}_{i}={\left\{{g}^{\prime}\left({\theta}_{i}\right)\right\}}^{2}{\sigma}_{i}\text{\hspace{0.05em}}$ , and, in the standard version of the model, sampling error variance ${\sigma}_{i}$ is fixed at the design variance ${v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)$ value. In particular, if g is the $\mathrm{log}$ function then ${\stackrel{\u02dc}{\sigma}}_{i}={\theta}_{i}^{-2}{\sigma}_{i}=c{v}_{i}^{2}$ , and $c{v}_{i}=\sqrt{{\sigma}_{i}/{\theta}_{i}}$ is likewise fixed at $c{v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)={v}_{p}^{1/2}\left({\stackrel{^}{\theta}}_{i}\right)/{\theta}_{i}$ . But, the design quantity to be imputed (whatever is:

${v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)$ , $c{v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)$ , etc.) is usually function of the unknown ${\theta}_{i}$ , e.g. $c{v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)\approx $

$\sqrt{\left(1-{p}_{i}\right)/\left({p}_{i}{n}_{i}\right)}=\sqrt{\left({N}_{i}-{\theta}_{i}\right)/\left({\theta}_{i}{n}_{i}\right)}$ under the SRSWR hypothesis as for our simulation. Common practice essentially consists in replacing ${\stackrel{^}{\theta}}_{i}$ to ${\theta}_{i}$ in the design quantity function, e.g., in our example, $\stackrel{^}{c}{v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)\approx \sqrt{\left(1-{\stackrel{^}{p}}_{i}\right)/\left({\stackrel{^}{p}}_{i}\left({n}_{i}-1\right)\right)}$ with ${\stackrel{^}{p}}_{i}={\stackrel{^}{\theta}}_{i}/{N}_{i}$ .

It is now easy to detect why the standard FH estimator ${\stackrel{^}{\theta}}_{i}^{HB}$ is so much positively biased. If, indeed, ${\theta}_{i}<{\stackrel{^}{\theta}}_{i}$ then $\stackrel{^}{c}{v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)$ is smaller than $c{v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)$ (recall

Figure 1), hence model estimator ${\stackrel{^}{\theta}}_{i}^{HB}$ ( $<{\stackrel{^}{\theta}}_{i}$ if the model is well posited), in

order to adequately fit ${\stackrel{^}{\theta}}_{i}$ , is pushed upward i.e. tends to overestimate ${\theta}_{i}$ ( ${\stackrel{^}{\theta}}_{i}^{HB}>{\theta}_{i}$ ). Vice versa, if ${\theta}_{i}>{\stackrel{^}{\theta}}_{i}$ , the greater $\stackrel{^}{c}{v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)$ (than $c{v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)$ ) com-

bined with ${\stackrel{^}{\theta}}_{i}^{HB}>{\stackrel{^}{\theta}}_{i}$ (if the model is well posited) does not pull ${\stackrel{^}{\theta}}_{i}^{HB}$ down

(rather, ${\stackrel{^}{\theta}}_{i}^{HB}$ is let to rise, in any case does not underestimate ${\theta}_{i}$ ). To prove such an insight we include in the following tables the results obtained from fitting a standard FH model yet with $c{v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)$ fixed to its true value (rows named as ‘true cv’).

According to the foregoing comments, in all of the HB models so far considered, we “plug-in” the unobservable ${\theta}_{i}$ into the sampling design variance function ${v}_{p}\left({\stackrel{^}{\theta}}_{i}\right)=f\left({\theta}_{i}\right)$ by which sampling variance ( ${\sigma}_{i}$ or ${\stackrel{\u02dc}{\sigma}}_{i}$ or ${a}_{i}$ ) is modelled. Incidentally, we note that the implementation of such a strategy is relatively straightforward within the Bayesian approach (not as such in the frequentist one).

Before going through simulation results, we also mention some ameliorations to NN model. The linearity assumed for the linking model is likely inadequate. Yet, since here the linear predictor (
${\theta}_{i}={}_{S}\stackrel{^}{\theta}{}_{i}+\alpha +{\nu}_{i}$ ) is quite nave (a null one), we may try to adjust it somehow before dropping the linear link definitely. In fact, the bad performance is probably due to mis-calibrated random effects
${\nu}_{i}$ ’s: in log-normal linking models,
${\theta}_{i}$ depends in a multiplicative way on
${\nu}_{i}$ (
${\theta}_{i}={}_{S}\stackrel{^}{\theta}{}_{i}{\text{e}}^{\alpha +\beta {x}_{i}+{\nu}_{i}}$ ), whence
${\nu}_{i}$ effect is weighted by
${}_{S}{\stackrel{^}{\theta}}_{i}$ (i.e. by population N_{i}). Thereby, a simple remedy to NN model deficiency might consist in weighting
${\nu}_{i}$ by population N_{i}. By doing that, indeed, a surprising improvement of model per- formance is immediately obtained (see row NN (
${\nu}_{i}\cdot {N}_{i}$ ) in Table 5 and Table 6).

Model performance greatly improves for all of the models thanks to the aforementioned stochastic extension. Table 5 and Tables 8-10 show how all of

Table 5. Comparison between design-based and HB model-based estimators (simulated data on unemployment).

Table 6. Model selection diagnostics (simulated data on unemployment).

^{a}PPp = 0.44 by using
${\stackrel{^}{\sigma}}_{i}={\left(\stackrel{^}{c}{v}_{i}{\theta}_{i}\right)}^{2}$ in (19) (see Table 3). ^{b}PPp = 0.26/0.36 results from including/excluding {
${\stackrel{^}{\theta}}_{i}=0$ with
${n}_{i}>0$ } cases from the computation.

them―denoted by $sv$ i.e. stochastic variance―reach comparable good scores (while maintaining performance characteristics detected in Section 3.2).

Results shown in Table 5 and Table 8 were obtained using, respectively, the simulated data from census and from a synthetic positively skewed population (thus maintaining the same type of asymmetry of the original unemployment dataset; see details in Section 3.1), hence both comparable to results in Table 2. The most noticeable change in Table 5 concerns bias. FH sv estimator is now almost unbiased, moreover, YR, NPlN and GPlN are no more negatively biased: a stochastic variance helps model centering (the explanation for that has been given above). Such an improvement affects the rest of performance measures making all the sv models almost competitive.

In Table 6 results on $\stackrel{\xaf}{D}$ , DIC and PPp show to be concordant in choosing the YR, NPlN and GPlN as the best tern of models. Further, note that the DIC is relatively larger for FH-true cv and sv models, thus the partial explanation given in Section 3.2 for the tendency of customary FH models to have lower DIC’s (the exclusion of ${\stackrel{^}{\theta}}_{i}=0$ values from the DIC computation) turns out not to be a sufficient reason by alone. Another reason why the DIC tends to be lower for fixed variance FH model is probably the over-shrinkage (thus a lower ${p}_{D}$ ) of ${\stackrel{^}{\theta}}_{i}^{HB}$ estimators. Finally, note that PPp gets farer from the ideal 0.5 value both for FH and NN sv specifications. Although we might have thought to remedy the marked deficiencies of these two models by merely making the sampling variance stochastic, other faults show to be still alive (at least, sampling model is not fully adequate to reproduce survey data). One more time, the YR, NPlN and GPlN tern proves to be the most natural response to model-based SAE of count variables.

Two further synthetic populations were considered: one with an essentially symmetric distribution (Table 9) and the other with negative skewness (Table 10), in order to explore (the expected worsening of) performance (with negative asymmetry) of standard FH models, and, at the same time, detect the parallel reaction of the models that have so far shown to perform better. We show the results only for canonical and extended FH, YR, NPlN models as well as for GPlN, this last just in the sv version (being the most naturally conceivable for it), to enlighten the difference in performance of the first one (that worsens with negative asymmetry, mostly in terms of expectation) compared to the competitive outcomes of the stochastic variance models. (Results on “true cv” FH have also been reported and written in italic to stress they are only theoretically conceivable.) Commenting on Tables from 8 onward would give rise to observations analogous to the ones already given above. We merely note that the advantage in using stochastic variance models diminishes with increasing sample size (letting fixed variance models regain ground in the rankings), up to the situation where direct estimators result better than model-based ones (see are and rE in the bottom panel of Table 9 and middle panel of Table 10).

4. Concluding Remarks

The benefits of using HB models for SAE problems have been largely recognized. They include the availability of a wider set of tools to handle complex and more realistic models and to get reliable measures of variability. When estimating counts it might not be clear which specification is more appropriate but relevant quantities should be properly modeled. Object of the study was to show that different model specifications are possible from the ones customarily used in non- normal/non-linear cases. The purpose was definitely not to show the general superiority of one framework over the others, being the range of situations one has to face with in real analyses practically unlimited. Moreover, there are (at least) two kinds of reasons why the generalization of our results is rather restricted. First, the type of simulation study we have performed is not the one ordinarily carried out in statistical studies of model comparison. In fact, we have not simulated from a posited model, rather we have considered design-based simulations. That is, given a real phenomenon (in our case, LLMs unemployment in a given region and year either known from census data or simulated), a sample survey has been (repeatedly) simulated according to a fixed sampling design. The idea we follow is “all models are wrong but some are useful” [24] , and, this concept goes quite naturally with the demand to a SAE analyst for providing estimates of “real world” quantities. A design-based simulation, then, meets the requirement of positing a “true” population instead of a “true” model. Unfortunately, this concept of simulation makes us totally ignorant about the possible presence of any useful/good model among the ones at comparison. It can be the case that all of them are inadequate to fit the phenomenon under study. Then, any comparison quite unlikely would lead to clear-cut results (or select the not-present “best” candidate). Second, in our application the model structure is kept as simple as possible to ease comparison. On the other hand, a null model, i.e. with only synthetic estimates entering the model and no additional auxiliary information, can be compared to traditional estimators and are free from further complications derived from specific characteristics of auxiliary information (like as variables measured with errors; type of relationship existing with small areas quantities of interest: e.g. multiplicative or additive effect? on which scale? etc.). On this regard, it is worth noting that, in presence of other sources of information to account for, comparison would have become even more complicated if the data are not simulated on the ground of a posited model, but are considered as real world observations and so generated by a “black box” mechanism.

However, the results from simulation study show persistent model failures for some standard Fay-Herriot specifications and for generalized linear Poisson models with (log-)normal sampling stage in SAE problems with count data, and how even minor model modifications can noticeably improve on their performance. In particular, we advocate the extension of model specification from assuming sampling variances fixed to some off-set estimates (typically design sampling variance estimates) to letting them be (stochastically) generated by the model (at least in their components depending from latent variables estimated within the model itself). On the other hand, unmatched and non-normal sampling stage models show definitely a better performance in terms of bias, accuracy and reliability even in the fixed sampling variance version. Notice that the extension to stochastic sampling variances is straightforward and relatively easy to implement within a Bayesian framework as well as the non-standard specifications are practically feasible only by means of HB models.

Moreover, the study brings out some limits and possible deceptions of commonly used criteria for model determination in the context of SAE problems, namely DIC and PPp. The version of PPp here addressed is the one commonly used in Bayesian analyses―within SAE context yet also statistic-worldwide― which tests the (global) validity of model first stage, i.e., in our case, the sampling model. Nevertheless, this particular PPp measure cannot inform us on many other possible model failures. On the other hand, the DIC is not comparable across the models, the main reason being that it is calculated―as routinely is―relatively to the first stage unobservable variables which vary in essence and number across the HB specifications.

In conclusion, future research should definitely focus on defining proper devices for model determination in the field of SAE. Besides, in order to magnify differences between models at comparison in simulation studies, model structure has to be complicated for considering more realistic situations (adding auxiliary information, taking spatial structure into account, etc.), more complex sampling designs are to be experienced, as well as different real population phenomena are to be explored (small areas at different level of territorial aggregation, small area population with different distributional characteristics, etc.).

Appendix

Some tables and figures referred to in the text are below listed.

Table 7. Census and one sample of simulated data reported for each LLM: (census data) employment and unemployment rates (% over population), ${p}_{i}^{e}$ and ${p}_{i}^{u}$ ; (sampling design data) sampling fraction (over population), ${r}_{i}$ , sample size, ${n}_{i}$ ; (simulated data) direct and synthetic estimates of employment and unemployment rates (%), ${\stackrel{^}{p}}_{i}^{e}$ , ${\stackrel{^}{p}}^{e}$ , ${\stackrel{^}{p}}_{i}^{u}$ and ${\stackrel{^}{p}}^{u}$ , as well as the associated $cv$ (%) estimates, $\stackrel{^}{c}{v}_{i}^{e}$ , ${}_{S}\stackrel{^}{c}{v}_{i}^{e}$ , $\stackrel{^}{c}{v}_{i}^{u}$ and ${}_{S}\stackrel{^}{c}{v}_{i}^{u}$ . LLMs have been ordered according to increasing population count.

Table 8. Simulated data on unemployment: $\stackrel{\xaf}{p}=3.35$ and positive asymmetry (real data); from top to bottom: n = 100/N = 25000, n = 300/N = 75000, n = 500/N = 100000 (hence r = 4, 4, 5‰).

Table 9. Simulated data on unemployment: $\stackrel{\xaf}{p}=3.35$ and symmetric data; from top to bottom: n = 100/N = 25000, n = 300/N = 75000, n = 500/N = 100000 (hence r = 4, 4, 5‰).

Table 10. Simulated data on unemployment: $\stackrel{\xaf}{p}=3.35$ and negative asymmetry; from top to bottom: n = 100/N = 25000, n = 300/N = 75000, n = 500/N = 100000 (hence r = 4, 4, 5‰).

Conflicts of Interest

The authors declare no conflicts of interest.

[1] |
Rao, J.N.K. and Molina, I. (2015) Small Area Estimation. 2nd Edition, Wiley, New York. https://doi.org/10.1002/9781118735855 |

[2] | Pfeffermann, D. (2002) Small Area Estimation—New Developments and Directions. International Statistical Review, 70, 125-143. |

[3] |
Jiang, J. and Lahiri, P. (2006) Mixed Model Prediction and Small Area Estimation. Test, 15, 1-96. https://doi.org/10.1007/BF02595419 |

[4] |
Fay, R.E. and Herriot, R.A. (1979) Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data. Journal of the American Statistical Association, 85, 398-409. https://doi.org/10.1080/01621459.1979.10482505 |

[5] |
Ghosh, M., Natarajan, K., Stroud, T.W.F. and Carlin, B.P. (1998) Generalized Linear Models for Small-Area Estimation. Journal of the American Statistical Association, 93, 273-282. https://doi.org/10.1080/01621459.1998.10474108 |

[6] | Lu, L. and Larsen, M. (2007) Small Area Estimation in a Survey of High School Students in iowa. Proceedings of the American Statistical Association Section on Survey Research Methods, 2627-2634. |

[7] |
Molina, I., Saei, A. and Lombardia, M.J. (2007) Small Area Estimates of Labour Force Participation under a Multinomial Logit Mixed Model. Journal of the Royal Statistical Society A, 170, 975-1000.
https://doi.org/10.1111/j.1467-985X.2007.00493.x |

[8] |
You, Y. and Rao, J.N.K. (2002) Small Area Estimation Using Unmatched Sampling and Linking Models. Canadian Journal of Statistics, 30, 3-15.
https://doi.org/10.2307/3315862 |

[9] | Trevisani, M. and Torelli, N. (2004) Small Area Estimation by Hierarchical Bayesian Models: Some Practical and Theoretical Issues. Atti della XLII Riunione Scientifica della Societ Italiana di Statistica, 273-276. |

[10] | Trevisani, M. and Torelli, N. (2006) Comparing Hierarchical Bayesian Models for Small Area Estimation. In: Liseo, Montanari, Torelli, Eds., Metodi statistici per l'integrazione di basi di dati da fonti diverse, Franco Angeli, Milano, 17-36. |

[11] | Trevisani, M. and Gelfand, A. (2013) Spatial Misalignment Models for Small Area Estimation: A Simulation Study. In: Advances in Theoretical and Applied Statistics, Springer-Verlag, Berlin Heidelberg, 269-279. |

[12] | Torelli, N. and Trevisani, M. (2008) Labour Force Estimates for Small Geographical Domains in Italy: Problems, Data and Models. International Review of Social Sciences, 4, 443-464. |

[13] | Liu, B., Lahiri, P. and Kalton, G. (2014) Hierarchical Bayes Modeling of Survey-Weighted Small Area Proportions. Survey Methodology, 40, 1-13. |

[14] | You, Y. and Chapman, B. (2006) Small Area Estimation Using Area Level Models and Estimated Sampling Variances. Survey Methodology, 32, 97-103. |

[15] | You, Y. (2008) An Integrated Modeling Approach to Unemployment Rate Estimation for Subprovincial Areas of Canada. Survey Methodology, 34, 19-27. |

[16] | Maples, J., Bell, W.R. and Huang, E.T. (2009) Small Area Variance Modeling with Application to County Poverty Estimates from the American Community Survey. Proceedings of the American Statistical Association Section on Survey Research Methods, 5056-5067. |

[17] |
Shen, A.C. and Louis, T.A. (1998) Triple-Goal Estimates in Two-Stage Hierarchical Models. Journal of the Royal Statistical Society Series B, 60, 455-471.
https://doi.org/10.1111/1467-9868.00135 |

[18] |
Spiegelhalter, D.K., Best, N., Carlin, B.P. and van der Linde, A. (2002) Bayesian Measures of Model Complexity and Fit (with Discussion). Journal of the Royal Statistical Society Series B, 64, 583-639. https://doi.org/10.1111/1467-9868.00353 |

[19] | Gelman, A., Meng, X.-L. and Stern, H. (1996) Posterior Predictive Assessment of Model Fitness via Realized Discrepancies. Statistica Sinica, 6, 733-807. |

[20] |
Bayarri, M.J. and Castellanos, M.E. (2007) Bayesian checking of the second levels of hierarchical models. Statistical Science, 22, 322-343.
https://doi.org/10.1214/07-STS235 |

[21] |
Dey, D.K., Gelfand, A.E., Swartz, T.B. and Vlachos, P.K. (1998) A Simulation-Intensive Approach for Checking Hierarchical Models. Test, 7, 325-346.
https://doi.org/10.1007/BF02565116 |

[22] |
Dempster, A.P. (1997) The Direct Use of Likelihood for Significance Testing. Statistics and Computing, 7, 247-252. https://doi.org/10.1023/A:1018598421607 |

[23] |
Trevisani, M. and Gelfand, A.E. (2003) Inequalities between Expected Marginal log-Likelihoods, with Implications for Likelihood-Based Model Complexity and Comparison Measures. Canadian Journal of Statistics, 31, 239-250.
https://doi.org/10.2307/3316084 |

[24] |
Box, J.E.P. (1976) Science and Statistics. Journal of the American Statistical Association, 71, 791-799. https://doi.org/10.1080/01621459.1976.10480949 |

Copyright © 2020 by authors and Scientific Research Publishing Inc.

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.