Simulated Minimum Hellinger Distance Inference Methods for Count Data

In this paper, we consider simulated minimum Hellinger distance (SMHD) inferences for count data. We consider grouped and ungrouped data and emphasize SMHD methods. The approaches extend the methods based on the deterministic version of Hellinger distance for count data. The methods are general, it only requires that random samples from the discrete parametric family can be drawn and can be used as alternative methods to estimation using probability generating function (pgf) or methods based matching moments. Whereas this paper focuses on count data, goodness of fit tests based on simulated Hellinger distance can also be applied for testing goodness of fit for continuous distributions when continuous observations are grouped into intervals like in the case of the traditional Pearson’s statistics. Asymptotic properties of the SMHD methods are studied and the methods appear to preserve the properties of having good efficiency and robustness of the deterministic version.


Introduction 1.New Distribution Created Using Probability Generating Functions
Nonnegative discrete parametric families of distributions are useful for modeling count data.Many of these families do not have closed form probability mass functions nor closed form formulas to express the probability mass function (pmf) recursively.Their pmfs can only be expressed using an infinite series representation but their corresponding Laplace transforms have a closed form and, in many situations, they are relatively simple.Probability generating functions are often used for discrete distributions but Laplace transforms are equivalent and can also be used.In this paper, we use Laplace transforms but they will be converted to probability generating functions (pgfs) whenever the need arises to link with results which already appear in the literature.We begin with a few examples to illustrate the situation often encountered when new distributions are created.
Example 1 (Discrete stable distributions) The random variable 0 X ≥ fol- lows a positive stable law if the probability generating function and Laplace transform are given respectively as The distribution was introduced by Christoph and Schreiber [1].
It is easy to see that ( ) ( ) .
The Poisson distribution can be obtained by fixing 1 α = .The distribution is infinitely divisible and displays long tail behavior.The recursive formula for its mass function has been obtained; see expression (8) given by Christoph and Schreiber [1].
Now if we allow λ to be a random variable with an inverse Gaussian distri- bution whose Laplace transform is given by ( ) is the Laplace transform of a nonnegative infinitely divisible (ID) distribution.
We can see that it is not always straightforward to find the recursive formula for the pmf for a nonnegative count distribution.Even if it is available, it might still complicated to be used numerically for inferences meanwhile the Laplace transform or pgf can have a relatively simple representation.
We can observe that the new distribution is obtained by using the inverse Gaussian distribution as a mixing distribution.This is also an example of the use of a power mixture (PM) operator to obtain a new distribution.The PM operator will be further discussed in Section 1.2.
From a statistical point of view, when neither a closed form pmf nor a recur- sive formula for the pmf exists, maximum likelihood estimation can be difficult to implement.
The power mixture operator was introduced by Abate and Whitt [2] (1996) as a way to create new distributions from an infinitely divisible (ID) distribution together with a mixing distribution using Laplace transforms (LT).We shall review it here in the next section, after a definition of an ID distribution.
Definition 1.1.3.A nonnegative random variable X is infinitely divisible if its Laplace transform can be written as Willmott [3] (1992, p42) for this definition.
Abate and Whitt [2] (1996) introduced the power mixture (PM) operator for ID distributions and also some other operators.To the operators already developed by them, we add the Esscher transform operator and the shift operator.All operators considered are discussed below.

( )
H y can be discrete or continuous but needs to be ID.This is the PM method for creating new parametric families, i.e., using the PM operator.The PM method can be viewed as a form of continuous compounding method.The ID property can be dropped but as a result the new distribution created using the PM operator needs not be ID.For the traditional compounding methods, see Klugman et al. [4] (p141-148).Abate and Whitt [2] Gerber [5] used a different parameterization and named this distribution generalized gamma.It is also called positive tempered stable distribution in finance.
Let ( ) ( ) ( ) The pgf is given by expression (21) in the paper by Gerber [5].The GNB distribution is infinitely divisible.If stochastic processes are used instead of distributions, the distribution can also be derived from a stochastic process point of view by considering a Poisson process subordinated to a generalized gamma process and obtain the new distribution as the distribution of increments of the new process created.See section 6 of Abate and Whitt [2] (p92-93).See Zhu and Joe [7] for other distributions which are related to the GNB distribution.
Note that, if is the Laplace transform of a random variable expressible as a random sum.A random sum is also called stopped sum in the literature, see chapter 9 by Johnson et al. [8] (p343-403).The Neymann-Type A distribution given below is an example of a distribution of a random sum.

Example 3 Let
,the i U 's conditioning on Y are independent and identically distributed and follows a Poisson distribution with rate ф and Y is distributed with a Poisson distribution with rate λ .Using the Power mixture operator we conclude that the LT for X is ( ) and the pgf is Properties and applications of the Neymann type A distribution have been studied by Johnson et al. [8] (p368-378).The mean and variance of X are given respectively by ( ) + .From these expressions, moment estimators (MM) have closed form expressions, see section (4.1) for comparisons between MM estimators and SMHD estimators in a numerical study.For applications often the parameter λ is smaller than the parameter ф .

Esscher Transform Operator
By tilting the density function using the Esscher transform, the Esscher trans-form operator can be defined and, provided the tilting parameter τ introduced is identifiable, new distributions can be created from existing ones.
Let X be the original random variable with Laplace transform ( ) In some cases, even the pmf of Y has a closed form but the maximum likelihood (ML) estimators might be attained at the boundaries, the ML estimators might not have the regular optimum properties.
Note that parallel to the closed form pgf expressions for these new discrete distributions, it is often simple to simulate from the new distributions if we can simulate from the original distribution before the operators are applied.For example, let us consider the new distribution obtained by using the Esscher operator.It suffices to simulate from the distribution before applying the operator and apply the acceptance-rejection method to obtain a sample from the Esscher transformed distribution.The situation is similar for new distributions created by the PM operator.If we can simulate one observation from the mixing distribution of Y which gives a realized value t and if it is not difficult to draw one observation from the distribution with LT ( ) t s κ then combining these two steps, we would be able to obtain one observation from the new distribution created by the PM operator.Consequently, simulated methods of inferences offer alternative methods to inferences methods based on matching selected points of the empirical pgf with its model counterpart or other related methods, see Doray et al. [9] for regression methods using selected points of the pgfs.For these methods there is some arbitrariness on the choice of points which make it difficult to apply.The techniques of using a continuum number of points to match are more involved numerically, see Carrasco and Florens [10].The new methods also avoid the arbitrariness of the choice of points which is needed for the regression methods and the k-L procedures as proposed by Feurverger and McDunnough [11] if characteristic functions are used instead of probability generating functions and they are more robust than methods based on matching moments (MM) in general.We can reach the same conclusions for another class of distributions namely mixture distributions created by other mixing mechanisms, see Klugman et al. [4], Nadarajah and Kotz [12], Nadarajah and Kotz [13].[17] (p2132) in their Theorem 2.6.We also use the property of the compact domains under considerations shrink as the sample size n → ∞ to verify conditions of Theorem 3.3 given by Pakes and Pollard [16] (1989) for SMHD methods using grouped data and conditions of Theorem 7.1 of Newey and McFadden [17] (p2185) for ungrouped data.This approach appears to be new and simpler that other approaches which have been used in the literature to establish asymptotic normality for estimators using simulations; previous approaches are very general but they are also more complicated to apply.A similar notion of continuity in probability has been introduced in the literature of stochastic processes.
It is worth to mention that simulated methods of inferences are relatively recent.In advanced econometrics textbook such as the book by Davidson and McKinnon [18], only section 9.6 is devoted to simulated methods of inferences is more appropriate for a version S and it is already known that it generates minimum HD estimators which are as efficient as the minimum chi-square estimators or maximum likelihood (ML) estimators for grouped data, see Cressie-Read divergence measure with 1 2 λ = − given by Cressie and Read [19]   (p457) for version D.
Note that and by using Cauchy-Schwartz inequality, we have Since the objective function remains bounded and this property continues to hold for the ungrouped data case, this suggests that SMHD methods could preserve some of the nice robustness properties of version D.
For ungrouped data, it is equivalent to have grouped data but using intervals with unit length  and the number of classes is infinite, we shall develop SMHD estimation which is based on the objective function Note that for a data set the sum given by the RHS of the above expression only has a finite number of terms as The version D with has been investigated by Simpson [14], Simpson [15] who also shows that the MHD estimators have a high breakdown point of at least 50% and first order as efficient as the ML estimators.For the Poisson case, the ML estimator is the sample mean which has a zero breakdown point and consequently far less robust than the HD estimators, yet the HD estimators are first order as efficient as the ML estimators.This feature makes HD estimators attractive.For the notion of finite sample break down point as a measure of robustness, see Hogg et al. [20] (p594-595), Kloke and McKean [21] (p29) and for the notion of asymptotic breakdown point for large samples, see Maronna et al. [22] (p58).
Simpson [14], Simpson [15] extended the works of Beran [23] for continuous distributions to discrete distributions.Beran [23] appears to be the first to introduce a weaker form of robustness not based on bounded influence function and shows that efficiency can be achieved for robust estimators not based on influence functions.Also, see Lindsay [24] for discussions on robustness of Hellinger distance estimators.Simulated versions extending some of the seminal works of Simpson will be introduced in this paper.
SMHD methods appear to be useful for actuarial studies when there is a need for fitting discrete risk models, see chapter 9 of Panjer and Willmott [3] (p292-238) for fitting discrete risk models using ML methods.The SMHD methods appear to be useful for other fields as well especially when there is a need to analyze count data with efficiency and robustness but the pmfs of the models do not have closed form expressions.For minimizing the objective functions to obtain SMHD estimators, simplex derivative free algorithm can be used and the R package already has built in functions to implement these minimization procedures.

Outlines of the Paper
In this paper, we develop unified simulated methods of inferences for grouped and ungrouped count data using HD distances and it is organized as follows.
Asymptotic properties for SMHD methods are developed in Section 2 where and for version S, let which can be reexpressed as , , In general, the intervals i I 's form a partition of the nonnegative real line 0 R + A.
3) where we want to test goodness of fit for continuous distribution with support of the entire real line used in financial study, we might let Clearly the set up fits into the scopes of their Theorem 3.1 and 3.3 which we shall rearrange the results of these two theorems before applying to version D and version S of Hellinger distance inferences and verify that we can satisfy the regularity conditions of these two Theorems.

Consistency
We define MHD estimators as given by the vector  G θ for version D and  S G θ for version S but emphasize version S as version D has been studied by Simpson [14].Both versions can be treated in a unified way using the following Theorem 1 for consistency which is essentially Theorem 3.1 of Pakes and Pollard [16] (p1038) and the proof has been given by the authors.

Theorem 1 (Consistency)
Under the following conditions  θ converges in probability to 0 θ : a) as it is easier to use this condition when there is a need to extend to the infinite dimensional case with the space 2  l .
An expression is ( )  ( ) occurs at the values of the vector values of the HD estimators, so the conditions a) and b) are satisfied for both versions and compactness of the parameter space Ω is assumed.Also, for both versions ( ) otherwise, this implies that there exist real numbers u and v with 0 u v < < < ∞ such that ( ) of Pakes and Pollard [16] is an elegant theorem, its proof is also concise using the norm concept of functional analysis and it allows many results to be unified.
Essentially, the same theorem remains valid with the use of the Hilbert space 2 l and its norm instead of the Euclidean space m R and the Euclidean norm.By using 2 l and its norm the consistency for the ungrouped SMHD estimators can also be established but further asymptotic results for the ungrouped SMHD estimators will be postponed and given in Section 3.
Asymptotic normality is more complicated in general.For the grouped case, Theorem 3.3 given by Pakes and Pollard [16] (p1040) can be used to establish asymptotic normality for both versions of Hellinger distance estimators.We shall rearrange results of Theorem 3.3 under Theorem 2 and Corollary 1 given in the next section to make it easier to apply for HD estimation using both versions.
Since the proofs have been given by the authors, we only discuss here the ideas of their proofs to make it easier to follow the results of Theorem 2 and Corollary 1 in Section (2.2.2).
For both versions, ( ) A regularity condition for the ap- proximation is of the right order which implies the condition (iii) given by their Theorem 3.3, which is the most difficult to check is given as This condition is used to formulate Theorem 2 below and is slightly more stringent than the condition iii) of their Theorem 3.3 but it is less technical and sufifcient for SMHD estimation.Clearly, for SMHD estimation given by expression (9) or expression (10).For simulated unweighted simulated minimum chi-square estimation for this condition to hold, independent samples for each θ cannot be used, see Pakes and Pollard [16] (p1048).Otherwise, only consistency can be guaranteed for estimators using version S. For version S, the simulated samples are assumed to have size U n τ = and the same seed is used across different values of θ to draw samples of size U.We implicitly make these assumptions for SMHD methods.These two assumption are standard for simulated methods of inferences, see section 9.6 for method of simulated mo-

Asymptotic Normality
In this section, we shall state a Theorem namely Theorem 2 which is essentially We also comment on the conditions needed to verify asymptotic normality for the HD estimators based on Theorem 2.

Theorem 2
Let  θ be a vector of consistent estimators for 0 θ , the unique vector which satisfies ( ) Under the following conditions: 1) The parameter space Ω is compact,  θ is an interior point of Ω. 2) 3)

G
is differentiable at 0 θ with a derivative matrix ( ) for every sequence { } n δ of positive numbers which converge to zero.
Then, we have the following representation which will give the asymptotic distribution of θ  in Corollary 1, i.e., ( ) ( ) ( ) ( ) or equivalently, using equality in distribution, ( ) ( ) ( ) The proofs of these results follow from the results used to prove Theorem 3.3 given by Pakes and Pollard [16] (p1040-1043).For expression (13) or expression (14) to hold, in general only condition 5) of Theorem 2 is needed and there is no need to assume that ( ) 0 n G θ has an asymptotic distribution.From the results of Theorem 2, it is easy to see that we can obtain the main result of the following Corollary 1 which gives the asymptotic covariance matrix for the HD estimators for both versions.Corollary 1.

Let
( ) The matrices T and V depend on 0 θ we also adopt the notations ( ) We observe that condition 4) of Theorem 2 when applies to Hellinger distance or in general involve technicalities.The condition 4) holds for version D, we only need to verify for version S. Note that to verify the condition 4, it is equivalent to verify and for the grouped case, it is given by We need to verify that we have the sequence of functions We shall outline the approach by first defining the notion of continuity in The notion of continuity in probability has been used in a similar context in the literature of stochastic processes, see Gusalk et al. [25] and will be introduced in the next paragraph and we also make a few assumptions which are summarized by Assumption 1 and Assumption 2 given below along with the notion of continuity in probability.A related continuity notion namely the notion of continuity with probability one has been mentioned by Newey and McFadden [18] in their Theorem 2.6 as mentioned earlier.They also commented that this notion can be used for establishing asymptotic properties of simulated estimators introduced by Pakes [26].Pakes [26] also has used pseudo random numbers to estimate probability frequencies for some models.For SMHD estimation, we extend a standard result of analysis which states that a continuous function attains its supremum on a compact set to a version which holds in probability.
This approach seems to be new and simpler than the use of the more general stochastic equicontinuity condition given by section 2.2 in Newey and McFadden [18] (p2136-2138) to establish uniform convergence of a sequence of random functions in probability.Our approach uses the fact that as n → ∞ the set ( ) 0 , n S δ θ shrinks to 0 θ , a property which did not seem to have been used previously by other approaches to establish , n S δ → θ θ .It might be more precise to use the term sequence of random functions rather than just random function here for the notion of continuity in probability as the ran-dom function will depend on n.
Below are the assumptions we need to make to establish asymptotic normality for SMHD estimators and they appear to be reasonable.
Assumption 1 1) The pmf of the parametric model has the continuity property with ( ) ( ) is differentiable with respect to θ .
In general, the condition 2) will be satisfied if the condition 1) holds and implicitly we assume the same seed is used for obtaining the simulated samples across different values of θ .For ungrouped data, we also need the no- tion of differentiability in probability to facilitate the application of Theorem The sequence of random functions ( ) with 1 occurring at the ith entry.Furthermore, the vector is continuous and bounded in probability for all for some 0 0 δ > .This concept is similar to the notion of differentiability in real analysis for nonrandom function.
A similar notion of differentiability in probability has been used in stochastic processes literature, see Gusak et al. [25] (p33-34), a more stringent differentiability notion namely differentiability in quadratic mean has also been used to study local asymptotic normality (LAN) property for a parametric family, see Keener [29] (p326).The notion of differentiability in probability will be used in section 3 with Theorem 7.1 of Newey and McFadden [17] to establish asymptotic normality for the SMHD estimators for the ungrouped case.We make the following assumption for and the vector ( ) ( ) is the transpose of q and I is the identity matrix of dimension r r × with 1 r k = + .Using the delta method the asymptotic covariance matrix of ( ) of version D is simply the asymptotic covariance matrix of given by ( ) and the asymptotic covariance matrix of ( ) We then have the vector of HD estimators version D and S given respectively by  G θ and ˆS G θ with asymptotic distributions given by  ( ) ( ) the simulated sample size is U nτ = .
Note that for version D, the HD estimators are as efficient as the minimum chi-square estimators or ML estimators based on grouped data.The overall asymptotic relative efficiency (ARE) between version D and S for HD estimation is simply ARE = 1 τ τ + and we recommend to set 10 τ ≥ to minimize the loss of efficiency due to simulations.
An estimate for the covariance matrix The asymptotic covariance matrix of  S G θ can be estimated if we can estimate ( ) . Using a result given by Pakes and Pollard (1989, p1043), an estimate for Γ is the matrix ( ) with 1 occurring at the ith entry of the vector , 1, , ≤ and in general we can let 1 2 δ = .Note that the columns of ˆn Γ estimate the corresponding partial derivatives given by the columns of Γ.
For ungrouped data and for version D, it is equivalent to choose are as efficient as ML estimators for version D, a result which is already obtained by Simpson [14].We postpone till section (3) for a more rigorous approach to justify the related result for version S using Theorem 7.1 given by Newey and McFadden [17].The SMHD estimators given by  S θ for ungrouped data will be shown to have the property Section 3 may be skipped for practitioners if their main interests are only on applications of the results.

Simple Hypothesis
In this section, the Hellinger distance The version S is of interest since it allows testing goodness of fit for discrete or continuous distribution without closed form pmfs or density functions, all we need is to be able to simulate from the specified distribution.We shall justify the asymptotic chi-square distributions given by expression (23) and expression (24) below.
Note that ( ) ( ) ( ) Using standard results for distribution of quadratic forms and the property of the operator trace of a matrix with ( ) ( ) ( ) ( ) , see Luong and Thompson [30] (p247); we have the asymptotic chi-square distributions as given by expression (23) and expression (24).On how to choose the intervals, the problem is rather complex as it depends on the type of alternatives we would like to detect.
We can also follow the recommendations of the Pearson's statistics, see Greenwood and Nikulin [31]; also see Lehmann [32] (p341) for more discussions and references on this issue.

Composite Hypothesis
Just as the chi-square distance, the Hellinger distance ( ) n Q θ can also be used for construction of the test statistics for the composite hypothesis, H 0 : data comes from a parametric model { } F θ , { } F θ can be a discrete or continuous parametric model.The chi-square test statistics are given by  ( ) for version D and for version S,  ( ) L θ as given by expression (11).
Also, using expression (11) and expression (13),  ( ) ) and the matrix ( ) ( ) with the rank of the matrix B is also equal to its trace.The argument used is very similar to the one used for the Pearson's statistics, see Luong and Thompson [30] (p249).
For version S,  ( ) is based on expressions (9-10) for version S. This justifies the asymptotic chi-square distribution for version S as given by expression (25) and expression (26).This version is useful for model testing for nonnegative continuous models without closed form expression densities, see Luong [33] and we have I θ which is the Fisher information matrix.
For version D, we then have Q θ is differentiable in probability at 0 θ with the derivative vector given by ( ) A.
For the approximation to be valid, we define The regularity conditions (1-3) of Theorem 3 can easily be checked.The condition 4 follows from expression (27) established by Simpson [14].The condition 5 might be the most difficult to check as it involve technicalities and it is verified in TA2 of the Appendices.By assuming all can be verified, we apply Theorem 3 for SMHD estimation with Assumption 1 and Assumption 2. Therefore, we have the following equality in distribution using the condition 4) of Theorem 3 and expression ( 27) One might want to define the extended Cramér-Rao lower bound for simulated method estimators to be ( ) ( ) , using the inequality ( ) are not close according to the discrepancy measure using SHD as n → ∞ , an argument also used by Simpson [14] to justify his expression * 0 ρ = , see Simpson [14] (p805-806).
Using ( )  ( ) , we might conclude in probability we have the inequalities ( )

Methods to Approximate Probabilities
Once the parameters are estimated, probabilities can be estimated.For situations where recursive formulas exist then Panjer's method can be used, see Chapter 9 of the book by Klugman et al. [4].Otherwise, we might need to approximate probabilities by simulations or by analytic methods.
In this section, we discuss some methods for approximating probabilities , 0,1, h p h =  for a discrete nonnegative random variable X with pgf ( )

P s
which can be used if a recursion formula for h p is not available.The saddlepoint method and the method based on inverting the characteristic function can be used.
See Butler [35] (p8-9) for details of the saddlepoint approximation.It can be described as using a h p to approximate h p , with ( ) ( ) ( ) ing the parameters and the corresponding ratios ARE are estimated using the simulated samples and the AREs are displayed in Table A.

Poisson Distribution
For the Poisson model with parameter λ we compare the performance of  ML λ the MLE for λ which is the sample mean vs the SMHD estimator  S λ using the ratio    A which shows that the  S λ performs much better than the sample mean which is the ML estimator.For drawing simulated samples from the DPS distribution, the algorithm given by Devroye [37] is used.

Conclusion
More simulation experiments to further study the performance of the SMHD estimators vs commonly used estimators across various parametric models are needed and we do not have the computing facilities to carry out such large scale studies.Most of the computing works were carried out using only a laptop computer.So far, the simulation results confirm the theoretical asymptotic results which show that SMHD estimators have the potential of having high efficiencies for parametric models with finite Fisher information matrices and they are robust if data is contaminated; the last feature might not be shared by ML es- The first two terms of the RHS of the above equation are bounded in probability as they have a limiting distributions and this implies the third term is also bounded in probability by using Cauchy-Schwartz inequality.Now using the conditions of Assumption 1 of Section (2.2.2) and implicitly the assumption of the same seed is used across different values of θ , we then have as

H
λ is the distribution with Laplace transform ( ) h s .The resulting Laplace transform,

A.
Luong et al.DOI: 10.4236/ojs.2018.81012189 Open Journal of Statistics is the Laplace transform of a random variable.In many situa- same parametric family.SeePanjer and also mentioned other methods.Example 2 (Generalized negative binomial) The generalized negative binomial (GNB) distribution introduced by Gerber [5] can be viewed as a power variance function distribution mixture of a Poisson distribution.The power variance function distribution introduced by Hougaard [6] is obtained by tilting the positive stable distribution using a parameter θ .It is a three-parameter continuous nonnegative distribution with Laplace transform given by

(
transform of a Poisson distribution with rate 1 µ = .The Laplace transform of the GNB distribution can be represented as ( ) sample of size U which is the proportion of observations of the simulated sample which has taken a value in j I .To illustrate their theory Pake and Pollard [16] (p1047-1048) considered simulated estimators obtained by minimizing with respect to θ the objective function the estimators satisfy the regularity conditions of their Theorem 3.1 and 3.3 which lead to conclude that the simulated estimators are consistent and have an asymptotic normal distribution.As we already know, a weighted version can be more efficient, if we attempt a version S for the Pearson's chi square distance, Hellinger distance as given by

Q
θ remains always bounded.Therefore the objective function for version S can be defined as consistency and asymptotic normality are shown in Section 2.2.Based on asymptotic properties, consistency of the SMHD estimators hold in general but high efficiencies of SMHD estimators can only be guaranteed if the Fisher information matrix of the parametric exists, a situation which is similar to likelihood estimation.One can also viewed the estimators are fully efficient within the class of simulated estimators obtained with the model pmf being replaced by a simulated version.Chi-square goodness of fit test statistics are constructed in Section 2.3.For the ungrouped case, it can be seen as having grouped data but the number of intervals with unit length and the number of intervals is infinite, it is given in Section 3 where the ungrouped SMHD estimators are shown to have good efficiencies.The breakdown point for the SMHD estimators remains at least 1 2 just as for the deterministic version.A limited simulation study is included in Section 4. First, we consider the Neymann type A distribution and compare the efficiencies of the SMHD estimators versus moment (MM) estimators, simulations results appear to confirm the theoretical results showing that the SMHD estimators are more efficient than the MM estimators based on matching the first two empirical moments with their model counterparts for a selected range of parameters.The Poisson distribution is considered next and the study shows that despite being less efficient than the ML estimator, the efficiency of the SMHD estimators remain high and the estimators are far more ro-is a vector of random functions with values in a Euclidean space and ⋅ is the Euclidean norm and if Their theory is summarized by their Theorem 3.1 and Theorem 3.3 given in Pakes and Pollard [16] (p1038-1043).It is very general and it is clearly applicable for both versions D and S for Hellinger distance with grouped data.Let

(
we state condition b) as for both versions of ( ) n Q θ whether deterministic or simulated, DOI: 10.4236/ojs.2018.81012198 Open Journal of Statistics the minimum Hellinger distance estimators (MHD) are consistent.Theorem 3.1 ments(MSM) given by Davidson and McKinnon [19] (p383-394).For numerical optimization to find the minimum of the objective function ( ) n Q θ , we rely on direct search simplex methods which are derivative free and the R package already has prewritten functions to implement direct search methods.
compact set.The compactness of this set simplifies proofs and does not appear to be used in previous approaches in the literature.Observe that belongs to the compact set ( )0 , n S δ θin probability.This is similar to the property of nonrandom continuous function in real analysis.probability as n → ∞ .The technical details of these arguments are given in technical appendices TA1.1 and TA1.2 at the end of the paper, in the section of Appendices.

.
Subsequently, we define the notion of continuity in probability which is similar to the one used in stochastic processes, see Gusak et al.[25] (p33) for a related notion of continuity in probability for stochastic processes.Definition 1 (Continuity in probability)A sequence of random functions This can be viewed as an extension of the classical result of continuity in real analysis.It is also well known that the supremum of a continuous function on a compact domain is attained at a point of the compact domain, see Davidson and Donsig[27] (p81) or Rudin[28] (p89) for this classical result.The equivalent property for a random function which is only continuous in probability is the supremum of the random function is attained at a point of the compact domain in probability.The compact domain we study here is given by ( and as n → ∞ , ( )0 0

7. 1
given byNewey and McFadden (1994, p2185-2186).Before stating their Theorem 7.1, Newey and McFadden has mentioned the notion of approximate derivative for the use of their Theorem, the definition given below will make it clearer.Definition 2 (Differentiability in probability) same seed being used across different values of θ is dif- ferentiable in probability with the same derivative vector as ( ) given by the data, so we can focus on version D and make the adjustment for version S. We need the asymptotic covariance matrix Σ of the vector version D and for version S, we shall let S = T T .Recall that form properties of the multinomial distribution, the covariances of

I
θ the is Fisher information matrix for un- grouped data with elements given by ( )

Q
θ is used to construct goodness of fit test statistics for the simple hypothesis H 0 : data comes from a specified distribution with distribution 0 F θ , 0 F θ can be the distribution of a discrete or continuous distribution.The chi-square test statistics and their asymptotic distributions are given below with where  G θ and  S G θ are the vector of HD estimators which minimize ( ) n Q θ version D and version S respectively and assuming k m > .To justify these A. Luong et al.DOI: 10.4236/ojs.2018.81012207 Open Journal of Statistics asymptotic chi-square distributions, note that we have for version D, Theorem 2 given by Simpson[14] (p804) which shows that the MHDE estimators are as effcient as the maximum likelihood (ML) estimators.For version S with ungrouped data, it is more natural to use Theorem 7.1 of Newey and McFadden[17] (p2185-2186) to establish asymptotic normality for SMHD estimators.The ideas behind Theorem 7.1 can be summarized as follows.In case of the objective function ( ) n Q θ is non smooth and the estimators is the vector  θ which is obtained by minimizing ( ) n Q θ , we can consider the vector * θ which is obtained by minimizing a smooth function ( ) by the proofs of Theorem 7.1 given by Newey and McFadden.The following Theorem 3 is essentially Theorem 7.1 given by Newey and McFadden but restated with estimators obtained by minimizing an objective function instead of maximizing an objective function and requires more stringent than the original condition v) of their Theorem 7.1.We also require compactness of the parameter space Ω .Newey and McFadden do not usethis assumption but with this assumption, the proofs are less technical and simplified.It is also likely to be met in practice.
θ; with this definition, the asymptotic covariance matrix of SMHD estimators attains this bound just as the asymptotic covariance matrix of ML estimators attain the classical Cramérfactor which also appears in other simulated methods, it can be interpreted as the adjustment factor when estimators are obtained via minimizing a simulated version of the objective function instead of the original objective function with the model distribution being replaced by a sample distribution using a simulated sample, see Pakes and Pollard[16] (p1048) for the simulated minimum chi-square estimators, for example.Clearly, ( ) 0 I θ can also be estimated numerically as in the grouped case which is given in section(2).Results of Theorem 2 and Corollary 1 allow us to establish asymptotic normality of the MHD estimators for both versions in a unified way.We close this section by showing the asymptotic breakdown point  of SHMD estimators is the same as HMD estimators under the true model with 1 2 ≥  by using the argument used by Simpson for the version D of HD estimators, see Simpson[14] (p805-806) and assuming only the original data set might be contaminated, there is no contamination coming from simulated samples.This assumption appears to be reasonable as we can control the simulation procedures.We focus only on the strict parametric model and the set up is less general than the one considered by Theorem 3 of Simpson[13] (p805) which also includes distributions near the parametric model.
true model which is similar to version D. The only difference is here we have an inequality in probability.From this result, we might conclude that the SMHD estimators preserve the robustness properties of version D and the loss of asymptotic efficiency comparing to version D can be minimized if 10 τ ≥ .
information matrix exists and we can check the efficiency and robustness of the SHD estimator and compare it with the ML estimator which is the sample mean.Since there is only on parameter estimate we are able to fix of data coming from the discrete positive distribution with parameter λ 87.5592 43.6890 102.8376 85.9624 62.8738 51.2473 75.8619U = 10000 for the simulated sample size from the Poisson model without slowing down the computations.It appears overall the SHD estimators performs very well for the range of parameters often encountered in actuarial studies, here we observe that the asymptotic efficiencies range from 0.7 to 1.1.We also study a contaminated Poisson model ( λ ) with 90% p = observations coming from the Poisson model ( λ ) and 1 10% q p = − = of observations coming from a discrete positive stable (DPS) distribution with the parameter for 0.9 α = and λ has the same value of the Poisson model.We compare the performance of the sample mean for λ which is the ML estimator vs the SMHD estimator  S λ using the contaminated model Poisson model as described and estimate the  robustness of the SMHD estimator vs ML estimator in presence of contamination.The sample mean looses its efficiency and becomes very biased.The results are given at the bottom of Table

. Hellinger and Chi-Square Distance Estimation
Luong et al.

Table A .
Asymptotic relative efficiencies between MM estimators and SMHD estimators