Mixture Models for Estimating the Number of Drug Users in Thailand 2005-2007

It is difficult to measure the sizes of illegal drug user populations directly by using the survey method because of many “hidden drug addicts” and the difficulty of receiving a true response. Systematic and routine information on treatment episodes of drug users is adopted to estimate the population size in this study. Mixture models of zero-truncated Poisson distributions using the nonparametric maximum likelihood estimators (NPMLE) by means of capture-recapture repeated count data were used to project the number of drug users. The method was applied to surveillance data of drug users identified by treatment episodes in over 1140 health treatment centers in Thailand from the Bureau of Health Service System Development, Ministry of Public Health. We presented how this mixture model could be utilized to construct the unobserved frequency of drug users with no treatment episode and further estimated the total population size of drug users in the country from 2005 to 2007. The result of simulation was confirmed that mixture model is suitable when population is large. By means of mixture models, the estimations for the number of drug users were fitted with excellent goodness-of-fit values and we were also compared to the conventional Chao estimates. The NPMLE for the total number of drug users in Thailand 2005, 2006, and 2007 were 184,045 (95% CI: 181,297 186,793), 230,665 (95% CI: 226,611 234,719), 299,670 (95% CI: 294,217 305,123), respectively, also 125,265 (95% CI: 123,092 127,142), 166,287 (95% CI: 163,222 169,352), 228,898 (95% CI: 224,766 233,030) for the number of methamphetamine (Yaba) users, and 11,559 (95% CI: 10,234 12,884), 11,333 (95% CI: 9276 13,390), 8953 (95% CI: 7878 10,028) for the number of heroin users, respectively. The numbers of marijuana, kratom-plant, opium, and inhalant users were under-estimated because their symptoms were mild and not severe enough to remedy in health treatment centers which led to the smaller size of the total number of drug users. The well-estimated sizes of heroin and methamphetamine addicts are high reliable because they are based on clearly evident count with a severe addiction problem to health treatment centers. The estimation by means of mixture models can be recommended to monitor drug demand trend and drug health service routinely; it is easy to calculate via the available programs MIXTP based on request.


Introduction
Drug abuse in Thailand has remained a serious health problem; its epidemic is still severe and widespread.Information on the number of illegal drug users is a benefit of the policy and the plan on narcotics control, to implement a reduction strategy, and to allocate resources to the health service.Nevertheless, it is difficult to measure the sizes of drug user populations directly because of many "hidden drug addicts".Surveys, especially on the large national scale, are unlikely to be the most efficient methods due to a huge cost and manpower, the difficulty of receiving a true response, the problems of dealing with a hidden population and ethical issues.
Capture-recapture methods are a classical and useful tool to solve a hidden population problem and to estimate a total population size because it can estimate and adjust for the extent of incomplete ascertainment using information from overlapping lists of cases from two or more distinct sources [1].Moreover, there are not only the conventional multiple sources methods but also the ap-proaches available based upon one source with repeated counts for each individual.In this study, a single source is considered from a surveillance system counting the number of times that a drug user went to a treatment institution.
There were few studies in Thailand which used the capture-recapture method for estimating the number of drug users.Mastro et al. [2] estimated the number of HIV-infected injection drug users in Bangkok under twosample sources of 18 methadone treatment centers and 72 urine testing police stations.Suppawattanabodee [3] used two sources of health treatment records and police arrestment records for estimating the number of drug users in Bangkok 2001.However, for one source with repeated count data, applications have been few relevant studies in Thailand.Böhning et al. [4] estimated the number of drug users in Bangkok 2001 by means of zero-truncated count mixture distributions.Viwatwongkasem, Kuhnert, and Satitvipawee [5] projected the number of heroin users in Bangkok 2002 using the mixture of zero-truncated Poisson models.Note that the zero-truncated Poisson mixture distributions are different from the mixture of zero-truncated Poisson models, at least the mixing distribution in both estimations.
A mixture model is a flexible approach to cope with long-tailed, skewed, and/or contaminated count distributions in a natural way.The mixing idea corresponds to a mixture representing the presence of sub-populations within an overall population.Formally, a mixture model can cope with not only two or more distributions (heterogeneity) but also includes the case of one distribution (homogeneous population) [6][7][8].Böhning and Schön [9] proposed the nonparametric maximum likelihood estimators (NPMLE) of population size based on the counting distribution.Böhning and Kuhnert [10] showed the equivalence of the zero-truncated count mixture distributions and the mixture of zero-truncated count distributions.They stated that for any mixing distribution of the truncated mixture, a usually different mixing distribution of the mixture of truncated counts could be found so that the likelihood surfaces for both models agreed; consequently, for estimating population size, two estimators associated with two models had equal values.Punyacharoensin and Viwatwongkasem [11] predicted HIV incidence in Thailand utilizing the backcalculation of mixture of the past AIDS incidence and AIDS incubation period distributions.Viwatwongkasem, Kuhnert, and Satitvipawee [5] compared the performance of population size estimators under the truncated count model with and without allowance for contaminations among Mc-Kendrick's, Mantel-Haenszel's, Zelterman's, Chao's, the maximum likelihood, and their proposed methods of the mixture of zero-truncated count models.The proposed estimator provided the best choice according to its small-est bias and smallest mean square error for a situation of sufficiently large population sizes and it also performed well even for a homogeneous situation.
Although, the mixture model has been used previously in many fields of application, it is still not very common; only few relevant studies were found in Thailand and, in addition, the numerical computation of mixture model estimates has not been directly provided in the existing standard statistical packages.With the motivation of having at present few relevant studies and unavailable statistical packages with the option or focus on estimating the size of a hidden population, we take this opportunity to address the gap by adopting the nonparametric maximum likelihood estimators (NPMLE) for estimating the mixture parameters of zero-truncated Poisson distributions leading to the population size estimate of interest.

The Horvitz-Thompson Approach
Suppose that a registration system identifies n observed cases, but not all cases of a population of size N, and the system can identify a case with probability 0 where p 0 is probability of the unidentified cases.This leads to the expected equation of the population size,  0 where 0 is the expected number of cases identified by the system which simply can be estimated by n, number of identified (observed) cases.It leads to the estimating equation which in other words can be stated that the population size N is the sum of both the unobserved and the observed cases (n).The Equation (1) can easily be solved for N to provide the Horvitz-Thompson estimator and 0 n N n   .The Horvitz-Thompson approach seems easy, but the unknown 0 probability of unobserved cases must be estimated and this is quite differently accomplished in the various methods of estimation.

Data Sources
The surveillance data on the drug addicts undergoing treatment and rehabilitation in the country over 1140 health treatment centers (

Statistical Methods
Suppose that Y is the number of treatment episodes in a case; obviously, Y has the values ranged from 1 to m (without zero value) where m is the largest number of treatment episodes in a case.Now data Y are tallied into a frequency table like Table 1.We let i be the number of treatment episodes in a case, n i be the number (frequency) of cases identified with i episodes where and a sample size m is the total number of observed cases.In Table 1, the observed frequencies of treatment episodes for heroin users in Thailand 2005 are n 1 = 3057, n 2 = 791, n 3 = 351, n 4 = 107, n 5 = 80, To estimate the population size N and the size of zero treatment episode n 0 , we let 1 be probabilities of cases identified 1,  times.Under homogeneity, the density function p i is assumed to be a zero-truncated Poisson since zero identification does not occur in the sample; that is, where .However, frequently the homogeneous model is not appropriate in real situations to fit an adequate model.Mixture models allowing for heterogeneity are more flexible and we consider a discrete mixture of truncated Poisson densities of the form where the mixing distribution . Then, the log-likelihood for the mixture of zero-truncated count densities is In this situation, with the help of gradient functions and the consideration at the boundaries of parameter space, the log-likelihood is concave on the parameter space of all discrete probability densities on which it can be maximized, leading to the nonparametric maximum likelihood estimator (NPMLE) of Q.To proceed in the EM context, we need the complete data log likelihood, which is given in this case as where the unobserved covariate i j is 1 if i belongs to component j and 0 otherwise.In the E-step, the unobserved indicator variates, i j , are replaced by their expected posterior probabilities, , leading to In the M-step, the new values 1 ˆ, , k    , 1 are found, which maximize the expected version of complete log likelihood (5).The results of the weighting estimates are obtained by Similarly, the solution after solving the equations of derivatives with respect to ˆj  is obtained by Note that (8) does not provide a close form solution; the iterative procedure is needed until the desired accuracy is achieved.Having identified the model and the associated parameter estimates, we can estimate the probability of zero treatment episodes p 0 as so that the Horvitz-Thompson approach leads to a population size estimate

Model Evaluation
It ailable.These resampled data were used to compute variances and confidence intervals as asymptotic normal intervals.
The BIC adjusts the log-likelihood with the number of parameters multiplied by the log-sample size; BIC works well as model selection criterion in mixture model since it does not suffer under likelihood irregularities that are typical for mixture models [8,12]. .The associated 95% confidence interval for heroin users in Thailand 2005 was established and lied between (10,234,12,884).As a general trend, the estimated size was about 3 times higher than the observed data n .Also, it was usually important to provide an estimate of completeness (of the surveillance stream) given as ˆ100%  n N , which was for heroin users 38.6% (95% CI: 34.7% -43.6%).

An Application
For the surveillance data of heroin users 2005 in Table 1, the observed frequencies were n 1 = 3057, n 2 = 791, n 3 = 351, n 4 = 107, n 5 = 80, n 6 = 59, n 7+ = 22.Table 2 showed that the mixture of two-components of zero-truncated Poisson model was the best fitting with the smallest BIC value.The results produced 0 for the unobserved number of heroin users without any treatment episodes and for the total number of heroin users whereas a well-established alternative Chao's [13] estimator

Results
Thailand Narcotics Annual Report 2006 of the Office of

Confidence Intervals
Bootstrap resampling technique was applied to compute the variance of mixtures of truncated count data since the direct computation via the information matrix was usually difficult.For each nonparametric bootstrap, frequencies were sampled from a multinomial  ).Adolescents aged between 15 and 24 years old were the biggest group (49%).Among them, 83% were new drug patients while 17% were the relapsing patients.33% of the total drug patients were unemployed while 19% were laborers and 13% were students.Drug epidemics were mostly found in Bangkok (50%) and 30% were located in the central region of Thailand while the rest were in the North, the South, and the Northeast, respectively.Methamphetamine (Yaba) addicts were still the biggest group of drug patients in all treatment centers (79%) because the ingredients were not hardly available and the price was not too high in comparing purity and severity.The second biggest group was cannabis (marijuana) addicts (11%); most marijuana was spread in many urban and rural areas; however, the price of marijuana was still cheaper compared with other illicit drugs.
Heroin epidemic was still important though it has a high price but the injuries were quite severe.Kratom plants were abused in many areas of the country side; farmers and peasants used kratom plants for working in the rice fields to work longer.The number of club drugs like ecstasy, ketamine, cocaine, and crystallized methamphetamine had an increasing trend in big cities and the rich persons.
The MIXTP program developed by authors was available to achieve the estimates of population size under the mixture of truncated count models via FORTRAN POWERSTATION and now it is available on the request.
Table 3 illustrated the sizes of drug users, estimated by the mixture of truncated Poisson models and classified by types of drugs in Thailand 2005-2007.Methamphetamine users were the biggest group and tended to increase from 2005-2007 while heroin users trended to decrease slightly because of its high price.The numbers of marijuana, kratom-plant, and inhalant users were under-estimated because of their mild severities.Trend of marijuana sizes increased from 2005-2007 while trends of kratom-plant, inhalant, and others were difficult to predict.

A Simulation Study
Although data fitting of mixture of truncated count model was well in the examples, we wish to ensure this in general case via the simulation experiment.Let count variables Y i be generated from a two-component mixture of Poisson distributions with equal weights attached to the component means 1 1   and 2  where 2 1, 2, , 5 . Population sizes were 200 (for small), 1000 (for medium), 5000 (for large), 10,000 (very large).Each simulated datum i was tallied to get frequencies with respect to the counts where Then was dropped and zerotruncated frequencies 1 were used to compute population size estimators of mixture model and Chao.This was done under 5000 replications; mean, standard deviation (SD), and root of mean square error (RMSE) of all estimates were computed and determined from these replications.
The results are found in Table 4 and we can conclude in the following:   , mixture model estimator with 1 component in this case performs well with smaller RMSE, regardless of population size; Chao's estimator is worse with larger RMSE under this homogeneity.
  , Chao's estimator performs better when population size is small to moderate (N = 200, 1000); mixture model estimator is better when population is large to very large (N ≥ 5000) and degrees of heterogeneity are strong    Furthermore, we found that if the weak degrees of heterogeneity occur     in combination with small to moderate population size (N = 200, 1000), mixture model estimator has a problem of the large excess values of standard deviation.

Discussion
The NPMLE method provides well-estimated sizes of various drug-user target populations, obtained from the surveillance data on the drug addicts with emphasis on methamphetamine (228,898 cases in 2007) and heroin (8953 cases in 2007) users.It can be expected that these surveillance data provide a high reliability because they are based on clearly evident contact counts of drug addicts with a severe addiction problem to health treatment centers.A comparison by means of a national household survey of ONCB [15] yielded the under-estimated sizes of 66,320 methamphetamine users and 3907 heroin users per year.
In contrast, the estimated sizes of this study using NPMLE of users with kratom-plant (less than 18,720 cases in 2007), marijuana (27,323 cases in 2007), and inhalant (13,362 cases in 2007) are frequently underestimated because of their low severity of symptoms to cure, leading to the smaller size of total number of drug users (299,670 cases in 2007).This fact is confirmed by a national household survey of the ONCB [15] that reported an estimate of 378,214 kratom-plant users, 57,527  The huge difference in values between two methods mentioned, stem mainly from the severity of symptoms of drug use.With this point of view, the estimated sizes of methamphetamine and heroin users from the surveillance data of this NPMLE study seem to be more useful than those from the national survey, in particular, if viewed from the perspective of a benefit of allocating resources on health service, monitoring drug epidemics, and planning policy on narcotics control.In general, the estimated sizes from this study are at least three times higher than the observed data.Hence, the completeness of identification is about 30% -40%.
Due to the result of simulation that mixture model estimator behaves well when population size is large, there is no reason to reject the use of mixture model to estimate the hidden population size and the total population size for each type of drugs since the observed total number of drug users is large enough.However, there is something called a boundary problem: extremely large observations in some samples.This could explain the overestimation effect seen in the simulation for N = 200.Kuhnert et al. [16] used the median for a series of estimates of population size in their simulation to avoid highly influential size estimates.Basically, the mixture model is a flexible approach to cope with homogeneity and heterogeneity, including long-tailed, skewed, and/or contaminated distributions in a natural way.
Recently, there has been an increased interest in zerotruncated count models.These models can be applied in many areas such as illegal immigrants, illegal gun owners, HIV epidemic, scrapie disease on sheep, or criminal persons.This article has shown how the mixture models allowing for heterogeneity can be applied to estimate the unobserved population size of drug users with zero treatment episodes and then estimate the total population size of illegal drug addicts.Indeed, there are not only the estimators available based upon mixture models but also there are the Mantel-Haenszel's [17], Zelterman's [18], Chao's [13], and maximum likelihood methods available in estimating population sizes.Viwatwongkasem, Kuhnert, and Satitvipawee [5] found that the mixture of zerotruncated count model and Chao's model provided the best choice among the above estimators, according to its smallest bias and smallest mean square error, especially for a situation of sufficiently large population sizes; furthermore, the mixture itself also performed well even for a homogeneous situation.Although the mixture model provides a nice estimate, its variance estimate is usually difficult to find.Bootstrap resampling technique was applied to compute the variance of mixture of truncated count data, instead of the direct computation via the information matrix.Further study should focus on the es-timation of the variance of mixture models.But this is a challenging task as stated by Chao [19], Cormack [20], and Böhning and Schön [9].Other parametric models such as the binomial model, the hypergeometric model, and the inverse sampling of the negative binomial model should be considered in any future research.
Fortunately, the appropriate NPMLE models for these surveillance data of drug users in 2005-2007 do not face a spurious value of overestimation.However, in few occurrences, the NPMLE of mixture may provide an overestimation.The occurrence of overestimates is due to the boundary problem of an estimate which is evaluated at the boundary of parameter space.The improvement in reducing overestimation bias should be investigated in any further study. p to the appropriate NPMLE model with k = 2 components in the mixture.Likewise, Figure 1 compared frequency distributions of treatment episodes among the observed frequencies, single Poisson with zero-truncation, and the mixture of zero-truncated Poisson with two components.The Poisson mixture provided an excellent goodness-of-fit to the observed frequencies whereas the simple Poisson was not adequate; it was clearly evident with the smallest BIC value.

Figure 1 .Table 1 .
Figure 1.Frequency distribution of treatment episodes among the observed counts, single Poisson, and mixture of two truncated Poisson.Table 1.Observed frequencies of treatment episodes of heroin users in Thailand 2005: n 1 = 3057, n 2 = 791, n 3 = 351, n 4 = 107, n 5 = 80, n 6 = 59, n 7+ = 22.Number of treatment episodes in a case (i) 0 1 2 … m = 7+ Total Number of cases (Frequency n i ) -3057 791 … 22 n = 4467 distribution with size parameter n and categorical probability parameters i n then N  as constructed using the BIC-selected mixture model.Suppose that there were B samples of size n each, population size estimates is crucial to select an appropriate model among various potential models differing in the number of components k.The smallest value of the Bayesian Information Criterion (BIC) is considered to choose the best model: the smaller   n ; w * ˆ B N N 