Stochastic SIR Household Epidemic Model with Misclassification ()
1. Introduction
Inference of the stochastic SIR household epidemic model without misclassification is well analyzed in [1] - [6]. The work of [1] and [7] provided maximum likelihood based algorithm for its inferences. But sometimes, the final size epidemic data is subject to misclassification error. This occurs in categorical data when the actual and recorded categories for subject differs [8] [9]. For example, the susceptibles may be wrongly be classified as infectives or an infectives wrongly classified as susceptibles. It then becomes necessary to adjust our inferences to such errors in order to get the precise parameter estimates and model that adequately fits the final size epidemic data. Using the theoretical framework developed in this work with simulations, we explored the estimates of the parameters of the models and the adequacy of their fitness to the final size epidemic data for a range misclassification probabilities
. We do this by exploring the plots of the root mean square error of the estimates defined over the range of the misclassification probabilities in [0, 0.5), in order to provide clarity on the nature of their fitness to the final size epidemic data. This enables us to identify the model that adequately fits better to the final size epidemic data.
2. Material and Methods
The Model
We have assumed that the stochastic SIR household final size data of [4], is subject to misclassification error; which may be caused by susceptibles wrongly classified as infectives or infectives wrongly classified as susceptibles.
The probability of observing i infectives in a household of size n given that the true number of infectives is j and that of the susceptibles is
takes cognisance of the true and false positives with their classification probabilities
and
.
Let x and y be the observed false and true positives in a household of size n. Then the probability of observing
positives, given that the true number of positives is j can be written as,
(1)
We can express, the probability of making correct and precise observation of an infective when it is a true infective, and a susceptible, when it is a true susceptible, independently as,
and
. The distribution of observing i number of infectives correctly and incorrectly is Binomial distributed,
, and
. Equally the probability of observing the susceptibles correctly and incorrectly are Binomial distributed,
and
respectively.
The number of infectives observed is the sum of the true and false positives and has the sum of the Binomial distributions,
(2)
Equally, the number of susceptibles observed is the sum of the true and false negatives and has the sum of the Binomial distributions,
(3)
The probability of observing i infectives in a household of size of n can then be written as,
(4)
Since,
(5)
(6)
where
,
are the final size probabilities. We can then write,
(7)
where
(8)
We can generalize the expression,
for
and any
using the results of
as,
(9)
Knowing the terms of
, the expression for
are evaluated. For example the probability of observing
infectives in a household of size n can be evaluated as,
where
are the final size probabilities, defined as the probability of observing j infectives in a household of size n, [10] [11].
Similarly, the chance of observing
infectives in a household of size n can be obtained using the terms of
. This probability reduces to,
In general, the probability of observing
infectives in a household of size n, is obtained as,
(10)
Equations (10) is the sum of two Binomial distributions,
and
defined as the probabilities of observing
true positives from the true j number of infectives and k false positives from the remaining
number of susceptibles in a household of size n.
Alternatively,
has the form,
(11)
Equation (11) is also the sum of two Binomial distributions in Equation (10) and defined as the probability of observing k true positives from the true j infectives and
false positives from the remaining
susceptibles in a household of size n.
Here, both Equations (10) and (11) for
satisfies,
3. The Three-Dimensional Model
If the false positive and false negative misclassification probabilities are the same then Equations (2) and (3) for the distribution of the number of infected individuals observed and those of the susceptible individuals observed only depend on the common misclassification probability denoted here as
. In these equations,
and
are replaced by
same as in the expressions for
and simplified as,
(12)
Alternatively, we can employ
(13)
Equations (12) and (13) for
which are particular cases of Equations (10) and (11) when the misclassification probabilities are the same are made of two Binomial distributions. While Equation (12) expresses the probability of observing
infectives from the true j infectives and k infectives from the remaining
susceptibles in the household of size n, Equation (13) expresses the probability of observing k infectives from the true j infectives and
infectives from the remaining
susceptibles in the household of size n.
Since they are probabilities, both equations
, must satisfy,
4. Maximum Likelihood Estimation
The distribution of the final size epidemic data
is multinomial, [12] where
are the number of households of size n in which i infectives are observed and
are the probabilities of observing i infectives in a household of size n, [1] [4] [13]. The approximate likelihood function of the model parameters is then a function of
and dependent on the parameters to be estimated from the four dimensional model. These are the local infection rate
, the probability of avoiding infection from outside the household
, the false positive misclassification probability,
and the false negative misclassification probability,
and hence
has the form
.
The approximate likelihood function can be written as,
(14)
where max is the maximum household size.
Since the estimates that maximize the approximate likelihood function also maximize the approximate loglikelihood function, we can write,
(15)
where
The approximate likelihood function for the three dimensional model also has similar representation with differences in the number of parameters to be estimated.
5. Numerical Simulation and Inference on the Three and Four Dimensional Final Size Epidemic Data
How precise are the maximum likelihood estimates from the numerical optimizations, given the minimum epidemic and population sizes, the proportion of the initial susceptibles infected and the magnitude of the misclassification probabilities? Which of these parameters are intractable to estimate in the face of large misclassification probabilities? Which model best fits the final size epidemic data in the face of varying misclassification probabilities in the permissible region,
? These are some of the questions to be explored in this section using simulation studies.
Fitting the Three Models to Data from the Four Dimensional Model
We demonstrate the computational procedures of fitting the three models to four dimensional epidemic data from simulation studies and examined the behaviours of the estimates using some functions and subroutines developed for this work as,
Run the function FourDimThreeATwoSNsimhousesScatterPlotsMisspec to simulate four dimensional household epidemic data with
infectious period distribution, theoretical parameters,
and
. It then calculate the corresponding parameters of the three models with
infectious period distribution computes, their mean, standard deviation and root mean square error of the estimates and plot the estimates using the following subroutines.
a)
, provides starting values for the two dimensional model parameters,
and
according to [12].
b)
, computes the negative of the loglikelihood function associated with the three dimensional model using the parameters of
infectious period distribution, the final size epidemic data and the starting parameters values obtained by inverse transformation of the parameter space.
c)
, computes the negative loglikelihood function associated with the two dimensional model from the parameters of
infectious period distribution, the final size epidemic data and the starting values according to [12].
d)
, computes the misclassification Probabilities associated with the three dimensional model from the misclassification probability parameter
and maximum household size n.
e)
computes the final size probabilities associated with the two dimensional model from the parameters of
infectious period distribution,
and maximum household size n.
f)
, computes the sum of the product of the misclassification probabilities and the final size probabilities associated with the three dimensional model for the computation of the negative loglikelihood function.
g)
, computes the misclassification probabilities associated with the four dimensional model.
h)
, computes the products of the misclassification probabilities and the final size probabilities associated with the loglikelihood function of the four dimensional model.
i)
, calculates z and
, from the parameters of
infectious period distribution, model parameters
and vector of household sizes, where houses is the vector of household sizes.
j)
calculates the threshold parameter,
from the parameters of
infectious period distribution, theoretical parameters
and vector of household sizes, houses.
Using the theoretical parameters,
,
,
,
,
, household structure in [1] but fifty times its population size given by 70700, minimum epidemic size of 1000 and simulation runs of 1000. The estimates of the parameters of the three models were obtained for the following pairs of the misclassification probabilities (
), (
) and (
) respectively shown in Figures 1-3 and analyzed in Tables 1-3 respectively.
Figure 1. Plots of the estimates of
,
and histogram of
when
. (a) Estim. (
): 4Dim. model; (b) Estim. (
): 3Dim. model; (c) Estim. (
): 2Dim. model; (d) Estim. (
): 4Dim. model; (e) Hist. of
3Dim. model.
Figure 2. Plots of the estimates of
,
and histogram of
when
. (a) Estim. (
): 4Dim. model; (b) Estim. (
): 3Dim. model; (c) Estim. (
): 2Dim. model; (d) Estim. (
): 4Dim. model; (e) Hist. of
3Dim. model.
Table 1. Table of the mean of the parameter estimates of the three models.
Table 2. Table of the standard deviation of the parameter estimates of the three models.
Table 3. Table of the root mean square error of the parameter estimates of the three models.
Figure 3. Plots of the estimates of
,
and histogram of
when
. (a) Estim. (
): 4Dim. model; (b) Estim. (
): 3Dim. model; (c) Estim. (
): 2Dim. model; (d) Estim. (
): 4Dim. model; (e) Hist. of
3Dim. model.
Figure 1 shows fitting the Two, Three and Four Dimensional Models to the Four Dimensional the final Size Epidemic Data, when
,
.
Figure 2 shows fitting the Two, Three and Four Dimensional Models to the Four Dimensional the final Size Epidemic Data, when
,
.
Figure 3 shows fitting the Two, Three and Four Dimensional Models to the Four Dimensional the final Size Epidemic Data, when
,
.
6. Comparison of the Models on the Four Dimensional Data
6.1. Simulations with the Theoretical Parameter,
,
,
,
,
We simulated household epidemic, with the following theoretical parameters,
,
,
,
and misclassification probabilities,
,
with step size of =0.005.
With theoretical parameters corresponding to
, we found the estimates of
for the two dimensional model to be imprecise and biased especially when the misclassification probabilities increase from zero as in Figure 4(a). The two dimensional model is not a sufficient fit to the four dimensional final epidemic data. These behaviours can be observed for other parameters for the two dimensional model as in Figures 4 (b)-(g).
The three dimensional model has precise estimates of
for misclassification probability in
, while the four dimensional model is best
Figure 4. Plots of the root mean square error of the maximum likelihood estimates of the parameters for the three models when
,
,
,
,
. (a) Estim. of
: Green = 4Dim, Yellow = 3Dim; Red = 2Dim; (b) Estim. of
: Green = 4Dim, Yellow = 3Dim; Red = 2Dim; (c) Estim. of
: Green = 4Dim, Yellow = 3Dim; Red = 2Dim; (d) Estim. of z: Green = 4Dim, Yellow = 3Dim; Red = 2Dim; (e) Estim. of
: Green = 4Dim, Yellow = 3Dim; (f) Estim. of
: Green = 4Dim, Yellow = 3Dim; (g) Estim. of
: Green = 4Dim, Yellow = 3Dim; Red = 2Dim.
if
and
. This shows that the four dimensional model has precise estimates of
compared to those of the two and three dimensional models, if the misclassification probabilities are large and far apart from each other.
In the case of
, the two dimensional model has imprecise and biased estimates, while those of the three dimensional model are precise if
, those of the four dimensional model are precise if,
and
.
In the case of
, the two dimensional model has precise estimates if,
, while the three dimensional model has precise estimates, if
, the four dimensional model is best if,
and
.
In the case of z, we found that the two dimensional model is best if,
, while the three dimensional model is best if
. The estimates of the four dimensional model are precise if,
and
.
In the case of the false positive misclassification probability estimates, the three dimensional model is best, if
, while the four dimensional model is best if
and
respectively.
In the case of the false negative misclassification probability, the three dimensional model is best if,
, while the four dimensional model is best if,
and
.
Similarly in the case of the threshold parameter, the two dimensional model is best if
, the three dimensional model is best, if
, while the four dimensional model is best, if
and
.
In summary, we see in Figures 4(a)-(g) that the estimates from the four dimensional model are more precise than those from the two and three dimensional models when the misclassification probabilities are large and far apart from each other.
However if
, then those of the three dimensional models are precise since the false negative misclassification probability,
reduces to the false positive misclassification probability, which is a particular case of the four dimensional model.
The three dimensional are precise if the two misclassification probabilities are close to each other while those of the two dimensional model are best if the misclassification probabilities are zero or close to it.
6.2. Simulations with Theoretical Parameters,
,
,
,
,
We simulated household epidemic with the following theoretical parameters along the line
,
, step size = 0.005.
,
,
,
.
We then obtained the estimates of the parameters of the three models and presented plots of their root mean square error in Figures 5 (a)-(g) for a range of misclassification probabilities in
.
From the simulation plots in Figure 5(a), we see that the estimates of
from the two dimensional model are driven by bias and are precise if,
, while the estimates of
from three dimensional model are precise if,
. Those of the four dimensional model are precise if,
and
.
In the case of
in Figures 5(b), the estimates of the two dimensional model are best if,
, those of the three dimensional model are best if,
, while those of the four dimensional model are best if
.
Also, in the case of
in Figure 5(c), the estimates of the two dimensional are best if,
, those of the three dimensional model are best if,
, while those of the four dimensional model are best if,
and
.
In the case of z, the estimates of the two dimensional model are best if,
, those of the three dimensional model are best if,
, while those of the four dimensional model are best if,
Figure 5. Plots of the root mean square error of the maximum likelihood estimates of the parameters for the three models when
,
,
,
. (a) Estim. of
: Green = 4Dim, Yellow = 3Dim; Red = 2Dim; (b) Estim. of
: Green = 4Dim, Yellow = 3Dim; Red = 2Dim; (c) Estim. of
: Green = 4Dim, Yellow = 3Dim; Red = 2Dim; (d) Estim. of z: Green = 4Dim, Yellow = 3Dim; Red = 2Dim; (e) Estim. of
: Green = 4Dim, Yellow = 3Dim; (f) Estim. of
: Green = 4Dim, Yellow = 3Dim; (g) Estim. of
: Green = 4Dim, Yellow = 3Dim; Red = 2Dim.
and
.
In the case of the false positive misclassification probability,
, the three dimensional model has precise estimates if,
, while the four dimensional has precise estimates if,
and
.
On the other hand, the estimates of the false negative misclassification probability from the three dimensional model are precise if
, while from the four dimensional model the estimates are precise if,
and
.
The threshold parameter,
has best estimates from the two dimensional model if,
, while it has best from the three dimensional model if,
. It has best estimates from the four dimensional model if,
and
.
7. Simulation with Three Dimensional Epidemic Data
We studied the properties of the estimates of the three models on three dimensional epidemic data in the face of
using simulations with
infectious period distribution and pair of theoretical parameters
using the function and, subroutines developed for this work.
Figure 6 shows fitting the Two, Three and Four Dimensional Models to the Three Dimensional Simulated Final Size Epidemic Data, when
.
Figure 7 shows fitting the Two, Three and Four Dimensional Models to the Three Dimensional Simulated Final Size Epidemic Data, when
.
Figure 8 shows fitting the Two, Three and Four Dimensional Models to the Three Dimensional Simulated Final Size Epidemic Data, when
.
Tableof Mean, Standard Deviation and Root Mean Square Error of the Estimates for the Two, Three and Four Dimensional Models, When
and
are shown in Table4 and Table5.
8. Simulations and Inferences of the Two and Three Dimensional Models for
We explored the estimates of the three models with two different sets of theoretical parameters with corresponding
and
away from their boundaries, simulation runs of 500, misclassification probabilities
, with stepsize of 0.01, household structure in [1] [4] [14] and 50 times its population size, minimum epidemic size of 1000, to understand the properties of the estimates. We then simulate and estimate the models parameters, compute and plot the root mean square of the estimates. Beginning with
Figure 6. Plots of the estimates of
,
and histogram of
when
. (a) Estim.
: 4Dim. model; (b) Estim.
: 3Dim. model; (c) Estim.
: 2Dim. model; (d) Estim.
: 4Dim. model; (e) Hist. of
3Dim. model.
Figure 7. Plots of the estimates of
,
and histogram of
when
. (a) Estim.
: 4Dim. model; (b) Estim.
: 3Dim. model; (c) Estim.
: 2Dim. model; (d) Estim.
: 4Dim. model; (e) Hist. of
3Dim. model.
Figure 8. Plots of the estimates of
,
and histogram of
when
. (a) Estim.
: 4Dim. model; (b) Estim.
: 3Dim. model; (c) Estim.
: 2Dim. model; (d) Estim.
: 4Dim. model; (e) Hist. of
3Dim. model.
Table 4. Mean of the parameter estimates of the two, three and four dimensional models where, 2Dim = two dimensional model, 3Dim = three dimensional model and 4Dim = four dimensional model.
Table 5. Standard deviation of the parameter estimates of the two, three and four dimensional models where, 2Dim = two dimensional model, 3Dim = three dimensional model and 4Dim = four dimensional model.
theoretical parameters,
,
,
,
,
, we simulate household epidemic, estimate the parameters of the models and examined their precision from the plots of the root mean square error for misclassification probabilities region
(Table 6).
Figure 9 shows Plots of the RMSE of the Parameter Estimates when,
,
,
,
,
.
Figure 10 shows Plots of the RMSE of the Parameter Estimates when
,
,
,
,
.
Figure 9. Plots of the RMSE estimates of
for three and two dimensional optimization when
,
,
,
,
. (a) Estim. of
: Green = 4Dim, Yellow = 3Dim; Red = 2Dim. models; (b) Estim. of
: Green = 4Dim, Yellow = 3Dim; Red = 2Dim models; (c) Estim. of
: Green = 4Dim, Yellow = 3Dim; Red = 2Dim models; (d) Estim. of
: Green = 4Dim, Yellow = 2Dim; Red = 2Dim models; (e) Estim. of the Miscla. prob: Green =
, Yellow =
, Red =
; (f) Estim. of
: Green = 4Dim, Yellow = 3Dim, Red = 2Dim models.
Table 6. Root mean square error of the parameter estimates of the two, three and four dimensional models where, 2Dim = two dimensional model, 3Dim = three dimensional model and 4Dim = four dimensional model.
Figure 10. Plots of the RMSE estimates of
for three and two dimensional optimization when
,
.
,
,
. (a) Estim. of
: Green = 4Dim, Yellow = 3Dim; Red = 2Dim. models; (b) Estim. of
: Green = 4Dim, Yellow = 3Dim; Red = 2Dim models; (c) Estim. of
: Green = 4Dim, Yellow = 3Dim; Red = 2Dim models; (d) Estim. of
: Green = 4Dim, Yellow = 2Dim; Red = 2Dim models; (e) Estim. of the Miscla. prob: Green =
, Yellow =
, Red =
; (f) Estim. of
: Green = 4Dim, Yellow = 3Dim, Red = 2Dim models.
9. Results and Discussion
In Figures 1(a)-(c), we see that the estimates of the local and global infection rates from the two and three dimensional models are biased, while those of the four dimensional models have more variability around their true values.
In Figure 2(b) and Figure 2(c), the estimates of the two and three dimensional models are biased and imprecise when the misclassification probabilities are large and far apart from each other as theoretically expected.
In Figure 3(a) and Figure 3(b), the scatter points of the estimates from the three and four dimensional models are centered at their true value with less variability for the three dimensional model, while those of the two dimensional model in Figure 3(c) are biased. The estimates of the three dimensional model are more precise than those of the two and four dimensional models.
Figures 4(a)-(g) are plots of the root mean square error of the maximum likelihood estimates of the parameters of the three models with regions of precision when the theoretical parameters corresponds
. We see that the root mean square error of the estimates from the four dimensional model are consistently stable throughout the misclassification probabilities region.
From Figure 6(c) the two dimensional models is beginning to struggle fitting to the three four dimensional data when
, while those of the three and four dimensional models are unbiased and precisely estimated as in Figure 6(a) and Figure 6(b).
Figures 5(a)-(g) provides general summary of the properties of the estimates of the three models on four dimensional final size epidemic data. Their behaviours along the diagonal of the misclassification probabilities region
are similar to those examined along the vertical and horizontal axes of
but have only chosen to present those of the former to avoid repetition.
From Figures 7(a)-(c), we see that when
, the parameter estimates from the two dimensional model become biased and imprecise, while those of the three and four dimensional models are unbiased and precise.
From Figure 8(c), we see that estimates from the two dimensional model are biased and imprecise while those from the three and four dimensional models in Figure 8(a) and Figure 8(b) are precise and unbiased as expected.
With large misclassification probability
the three and four dimensional models are the appropriate fit to three dimensional epidemic data. The three dimensional model with less number of parameters is often chosen in line with the principle of parsimony.
In Figures 10(a)-(f), similar pattern of behaviour are observed except that the estimates of
in Figure 10(c) for the four dimensional are less precise than those of the two dimensional model. This may be attributable to the size of the proportion infected z as compared to its behaviour with
in Figure 9(c).
We see from Table 4 that the maximum likelihood estimates of the two dimensional models are precise only when the misclassification probability is close to 0 and hence outperforms the three and four dimensional models, otherwise those of the three and four dimensional models have better precision.
Also, from the regions where the models outperform each other on the three dimensional final size household epidemic data for the set of theoretical parameters and misclassification probabilities
, we see that the two dimensional model is sufficient on the three dimensional final size epidemic data if
is close to 0, while the three and four dimensional model are also sufficient model fits, if the misclassification probability is large.
The estimates of the two dimensional model are initialized according to [10], with minimum computational cost. For example from the [1] A (H3N2) Tercumseh Michigan epidemic, we found the computational time for the estimates to be 1.2 seconds, while those of the Seattle 1975-9176 B (H1N1) epidemic, [15] is 9 seconds, those of 1978-1979 A (H1N1) epidemic, [15] is 4.2 seconds.
In summary, the computational time required for convergence of the maximum likelihood estimates depends on the choice of the starting values and population size. With appropriate choice of the starting values away from the boundaries and large population size the computational time is large compared to small population size. However inadequate population size leads to lack of information and hence makes convergence of the estimates impossible.
10. Conclusions and Suggestions
We observed that, once there is no misclassification error in the final size epidemic data, the best model fit to the two dimensional final size data is the two dimensional model. The model with smaller number of parameters is therefore preferred. Making the two dimensional model the appropriate model fit to two dimensional final size epidemic data if
.
However, if
is far from 0, then the two dimensional model struggled fitting to three dimensional final size data.
With increasing
, it becomes unreliable to use the two dimensional model. The three and four dimensional models provide good fit to the theoretical chi-square distribution in the face of increasing values of the misclassification probabilities.
Also, with large and different misclassification probabilities far apart from each other, the four dimensional model has precise estimates and therefore outperforms the two and three dimensional models on the four dimensional final size epidemic data as demonstrated.
With increasing misclassification probabilities, the two and three dimensional models struggled fitting to the four dimensional final size data, with disproportionate parameter estimates.
In summary, with large misclassification probabilities, the estimates of the four dimensional model are more precise than those from the two and three dimensional models in agreement with the discussion in Subsection 6.1.
Possible extension includes estimating the shape parameter of the Gamma infectious period distribution, if the infectious period distribution is unknown. For example if
is the assumed infectious period distribution, where k is known, then the shape parameter a can then estimate from the final size epidemic data.
Acknowledgements
The authors will like to acknowledge Dr. Owen D. Lyne and Professor Martin Ridout for their valuable contributions.