A Likelihood-Based Multiple Change Point Algorithm for Count Data with Allowance for Over-Dispersion ()
1. Purpose and Objectives
1.1. Introduction
Count data arising from stochastic processes often exhibit over-dispersion, where the sample variance exceeds the sample mean. Several count data models have been proposed by researchers in available literature, but the problem of over-dispersion still remains unresolved, more so in the context of change point analysis. Discrete change-point procedures discussed so far work well for equi-dispersed data but produce biased estimates where data are over-dispersed. This study develops an algorithm for detecting and estimating multiple change points in the distribution of count data that exhibits either over-dispersion or equi-dispersion.
1.2. Objectives of the Study
1.2.1. General Objective
To develop a likelihood-based multiple change point algorithm for count data with allowance for over-dispersion.
1.2.2. Specific Objectives
To develop a likelihood-based multiple change point algorithm for count data under the Negative Binomial distribution.
To determine the critical values for the Likelihood Ratio Test for existence of change.
To test the Negative Binomial multiple changepoint algorithm using simulated data.
2. Methodology
2.1. The Negative Binomial Distribution
In probability and statistics, the negative Binomial distribution is often used to model the number of successes in an infinite sequence of independent and identically distributed Bernoulli trials. The distribution has two parameters
and
, where
is a constant representing a fixed or predefined threshold for the number of successes required and
is the success probability, which is constant from one Bernoulli trial to the next. The probability distribution,
of a Negative Binomial random variable has two possible formulations which are contingent on the definition of measure of interest. The version used in this study counts the number of failures,
, before the r-th success. The probability mass function of
is defined by:
(1)
where
,
,
and
.
The standard formulation of the Negative Binomial distribution is such that the mean and variance of random variable
are both derived quantities whose values are obtained from the primary parameters
and
as:
(2)
The parameter
is the measure for over-dispersion since the variance of X,
in Equation (2) exceeds the mean,
by a function of
.
2.1.1. Parameterization of the Negative Binomial Distribution for the Proposed Change Point Algorithm
Under the formulation of the Negative Binomial distribution with the probability mass function defined as in Equation (1), consider the mean,
and variance,
as the primary quantities and the parameters
and
as derived quantities such that the values of
and
can be derived from the mean and variance of the data distribution as:
(3)
The factor
in variance formula defined in Equation (2) is a sort of “clumping” parameter since as difference in variance and mean decreases, that is
then
and
. In other words as
, the variance,
approaches the mean,
so that
approaches the Poisson (
) distribution with both mean and variance equal to the parameter
. Therefore, under the parameterization given in Equation (3), Poisson is the limiting distribution for the Negative Binomial.
(4)
A common scenario is where the factor 1/r is large. This happens in cases where the variance exceeds the mean such that the data are over-dispersed. It follows that
then
and
. The larger the factor
the greater the amount of over-dispersion [1] and [2]. Equation (4) justifies the use of the negative binomial distribution to model count data that is both over-dispersed and equi-dispersed; hence its advantage over the standard Poisson model for count data.
2.1.2. The Likelihood Function of NB(r,p) Distribution
Suppose we have a sample
from a
distribution with probability distribution
described as in Equation (1). The likelihood function is obtained as:
(5)
The log-likelihood function is given by:
(6)
2.2. Dispersion Test
A count data distribution may exhibit any of three kinds of dispersion: under-dispersion, equi-dispersion or over-dipersion. The type of dispersion is dependent on the mean-variance relationship of the sample data. In this study, the variance-to-mean ratio (VMR), otherwise referred to as the dispersion index (D), is determined for the entire sample and for each segment, following which a dispersion test is conducted prior to applying the change point algorithm.
Starting with the assumption that sample count data are equi-dispersed so that the index of dispersion
and the sample are in agreement with a theoretical Poisson series, a dispersion test is conducted with the following hypotheses:
(Dataareequi-dispersed)
versus
(Dataareover-dispersed)
versus
(Dataareunder-dispersed)
The index of dispersion is computed from sample statistics as:
(7)
where
is the sample variance and
is the sample mean.
The VMR of the sample data informs the choice of the distribution model to fit the data as summarized in Table 1.
Table 1. Mean-variance relationships.
Dispersion index (D) |
Dispersion type |
Proposed distribution |
|
Not dispersed |
Constant variable |
|
Under-dispersion |
Binomial Distribution |
|
Equi-dispersion |
Poisson Distribution |
|
Over-dispersion |
Negative Binomial Distribution |
Tests for significant departure of the index of dispersion from 1 are performed under the Chi-square or Normal distributions contingent on the size of sample data as described in [3]. For small (
) samples, the dispersion test statistic is defined by Equation (8)
(8)
where
is approximated by a Chi-Square distribution with (
) degrees of freedom. The decision criteria are such that agreement with the Poisson distribution is accepted if
falls between the chi-square table values at the probability levels (
) and (
). On the other hand, if
falls below the chi-square table value at the (
) level, the alternative hypothesis of under-dispersion is accepted in favor of
. In cases where
exceeds the chi-square table value at the (
) level, the alternative hypothesis of over-dispersion is accepted in favor of
.
For large (
) samples, the test statistic is given by:
(9)
where
is approximately normal. The null hypothesis is rejected based on a comparison of the absolute value of
against the two-sided critical values from the Standard Normal tables for a given level of the test. For a size
test,
is rejected if
or
, in which case the distribution is said to be under-dispersed or over-dispersed, respectively. The change detection and estimation algorithm described in Section 2.3 is applied to over-dispersed and equi-dispersed samples, while under-dispersed samples are discarded.
2.3. The Negative Binomial Multiple Change Point Algorithm
The Negative Binomial Multiple Change Point Algorithm (NBMCPA) algorithm is developed in an iterative process involving recursive binary segmentation, hypothesis testing for existence of statistically significant change, and estimation of the change points, if any, using maximum likelihood approach. According to the parameterization described in Equation (3), the parameter
is defined such that its value depends only on the mean,
and variance,
of the data distribution. On the other hand, the parameter
is dependent on parameter
and the mean
of the distribution. As such, a change in parameter
automatically implies a change in parameter
. Therefore, for simplicity, the NBMCPA will explicitly consider only the hypotheses for a step change in the over-dispersion parameter
.
2.3.1. Assumptions for the Algorithm
The NBMCPA was built under the following data assumptions:
A1: The sample data arise from a count process, and are therefore discrete.
A2: There are no temporal dependencies in the sample data.
A3: A step change in both parameters
and
of the Negative Binomial distribution occurs simultaneously.
2.3.2. The Step-Wise Recursive Binary Segmentation Procedure
The step-wise recursive binary segmentation (SWRBS) procedure, similar to the method discussed in [4] is an iterative process involving 5 elementary steps as illustrated by Figure 1.
Figure 1. The SWRBS procedure.
Starting with a chronological sequence of length
from the Negative Binomial (r, p) distribution, partition the sequence into two parts at an arbitrary point
corresponding to the observation made at the random time point
. Check for the existence of a statistically significant distinction in the over-dispersion parameters
and
of the two sub-sequences by conducting a likelihood ratio test. If the two partitions are found to be significantly different with regard to the distributional parameters, then a change exists at or near the time point
. In this case, proceed to estimate the location of the said change using the maximum likelihood method.
Once the first change has been located, repeat the processes of partitioning, hypothesis testing and change point estimation in each of the two new sub-sequences formed. On the other hand, if no change is evident at the first arbitrary point
, seek an alternative arbitrary point and repeat the likelihood ratio test; hence obtain the MLE of the change point, if a change exists. The SWRBS procedure is repeated over and over until no more significant changes are identified.
2.3.3. Partitioning the Data Sequence
Consider a sequence of random calendar time points:
. Let
be a sequence of observed values or realizations of a stochastic process. Let
be the length of the partition between consecutive time points 0 and 1, and
be the length of the partition between consecutive time points 1 and 2, and so on. Assume that each partition
for
and
and
of the original sequence
consists of a random sub-sequence
. Assume that the observations in each partition follow a Negative Binomial distribution with parameter
for
as shown in Figure 2.
2.3.4. The Likelihood Ratio Test for Existence of the First Change
An investigation as to whether a change exists at some random point
Figure 2. Timeline diagram showing sequence partitions.
in a sequence of
observations is done by conducting a two-tailed likelihood ratio test (LRT). The null hypothesis states that there is no change in the over-dispersion parameter
across the entire sample. The alternative hypothesis seeks a difference in the over-dispersion parameter
between the two data segments, such that the first partition of the sequence in the interval
has the parameter
while the second partition in the interval
has parameter
, where
. The mathematical hypotheses are described as in Equation (10):
versus
(10)
The log-likelihood function under
is given by Equation (11) as:
(11)
where the method of moments estimates (MME) of the model parameters are:
and
and
The statistics
and
are the unbiased estimates of the population variance and mean respectively obtained as:
and
The log-likelihood function under
for an arbitrary change point
is defined in Equation (12).
(12)
where the method of moments parameter estimates for the lower partition of the data (
,
and
) are obtained as:
whereas, for the upper segment of the sample data, the MME of model parameters
,
and
are given by:
The statistics
and
are the unbiased estimates of the sub-population variance and mean for the first
observations respectively obtained as:
Similarly, the statistics
and
are the unbiased estimates of the sub-population variance and mean for the next
observations respectively obtained as:
The likelihood ratio statistic at an arbitrary point
takes the form:
(13)
The change point
corresponding to the time point
is estimated such that the LRT statistic in Equation (13), or equivalently its square root, is maximized. Statistical significance of the estimated change point
is determined by comparing the maximum value of against the critical value of the LRT developed in Section 3.4. Of interest is to find the optimal value of the likelihood ratio as:
(14)
The decision is made such that
in Equation (10) is rejected if the LRT statistic is large so that
. The constant
is a critical value that is determined by the level of the test
, the sample size
, and the null distribution of the likelihood ratio test statistic in Equation (13) as in Gombay and Horvath (1990). Otherwise, small values of
such that
indicate that a change exists at the time point
, but the change is not statistically significant.
2.3.5. Testing for the Second and Subsequent Change Points
Once the first change point is obtained, the likelihood ratio test is repeated in each of the two data segments formed. As a simple illustration, assume that there are
observations from the Negative Binomial distribution and that the first change point is estimated at time
. This results in two segments of the data set
and
having different parameters. A single change point algorithm would stop at this first change point. However, a multiple change point algorithm proceeds to investigate whether additional change points exist in the data set. This is done by conducting a likelihood ratio test for change in the lower segment (
) and in the upper segment (
) of the data, one at a time. To determine the possibility of change in the lower segment, the hypotheses in Equation (15) are tested:
versus
(15)
The LRT procedure given in Section 2.3.4 is then followed. Similarly, to detect change in the upper data segment, the LRT procedure is applied with the test hypotheses defined in Equation (16).
versus
(16)
Detection and location of the second and third change points, assuming they both exist, results in further sub-partitions of the data so that there are four segments in total for the entire sequence of size
. Each of the four smaller partitions is then tested for change and the splitting process is repeated; hence the name step-wise recursive binary segmentation. The position of any viable change point,
must satisfy the inequality
so that a change point can neither occur at the first nor last two observations in the sequence. Instead the change point must be sandwiched between some two observations. In the problem of multiple change point analysis given a sample of size
, the maximum possible number of change points is
, which excludes the first and last two observations.
2.4. Determining the Critical Values for the Likelihood Ratio Test for Existence of Change
The study makes use of the methods described by Gombay and Horvath on the asymptotics of maximum-likelihood ratio-type statistics for testing a sequence of observations for no change in parameters against a possible change while some nuisance parameters may remain constant over time [5]. In particular, Gombay and Horvath obtained extreme value approximations as well as Gaussian-type approximations for the square root of the likelihood ratio in Equation (13). They also approximated the maximum likelihood ratio using Ornstein-Uhlenbeck processes and obtained the upper bounds for the rate of approximation.
To derive critical values of the LRT statistic, this study makes use of the asymptotic distribution of
as described in Equation (14). Different critical values are be obtained for various small and medium sample sizes (n = 12, 20, 60, 100, 200, and 500), various test sizes (
and 10%).
The asymptotic critical values of are derived as follows: Let
be the level of the double-sided Likelihood Ratio Test.
Define:
Define also:
(17)
Then, according to Gombay et al., if the null hypothesis of no change holds and
and
are chosen such that both exceed 1/n then
So that
is an asymptotically correct critical value of size
.
It can be shown that for all
.
Such that for any
(18)
The approximations for the distributions of
and
proposed by Gombay et al. were applied to data assumed to follow the exponential, Poisson and Normal distributions [6]. Further, the approximations were based on a convenient choice of the parameters
and
such that
The value of parameter
was dependent on the choice of
and
as:
This study extends the method discussed in Gombay et al. to data obtained from the Negative Binomial distribution. Alternative values of
and
are tested in an attempt to get the best asymptotic critical values.
The assumptions made are that the limiting distribution of
is the double exponential distribution, which is achieved through careful selection of the negative binomial parameters during simulations. Particularly, the parameters should be such that the amount of over-dispersion is neither extreme nor negligible. Further, moderate values for
and
should be chosen to ensure that the distribution exhibits properties conducive to convergence. Choosing
values that are not too small helps to avoid excessive variability and
is neither too high nor too low to maintain a balanced success rate.
3. Results and Discussion
3.1. The Multiple Change Point Algorithm
The negative binomial multiple change point algorithm was developed as a set of 9 steps based on the model equations discussed in the foregoing methodology. The sequential iterative steps are outlined here below:
Step 1
Input or load the sample data into a statistical software such as R. These data could be either real observed count data or simulated data from a Negative Binomial Distribution.
Step 2
Compute the sample mean
and variance
for these data. Conduct a test for dispersion. If data are over-dispersed or equi-dispersed, proceed to estimate parameters, otherwise discard the sample. Using the sample mean and variance, find the method of moments estimates of the two parameters
and
of the negative binomial model according to Equation (11).
Step 3
Using the sample data and the parameter estimates in Step 2, compute the log-likelihood functions under the null and alternative hypotheses for various values of the arbitrary change point
as described in Equations (11) and (12) respectively. Next, compute the likelihood ratio statistic as described in Equation (13) for each possible value of the change point
. Store all computed values of the likelihood ratio statistic in a single vector.
Step 4
Plot a graph of the likelihood ratio statistic or equivalently, its square root, against the possible values of
. A line plot sufficiently shows at a glance the behavior of the likelihood ratio over sequential values of
.
Step 5
Investigate whether a change exists at some point
by visually inspecting the graph in Step 4. As a guide, when there is no change the likelihood ratio statistic does not have a unique maximum point. On the other hand, when a change exists at some point
, the graph of the likelihood ratio test statistic is such that it reaches a maximum value exactly or in the neighborhood of the point
. Figure 3 and Figure 4 represent illustrative graphs of a likelihood ratio test statistic in the absence of change and in the presence of a single change respectively.
Step 6
In case there is no change in model parameters for the given dateset, the change point algorithm comes to a stop and no further change points are sought. However, if a change exists, approximate the location of the change point using the maximum likelihood approach as the point k at which the statistic in Equation (14) attains a maximum.
Step 7
Once a change has been detected and its location estimated, determine whether the said change point is statistically significant by conducting a likelihood ratio test. The null hypothesis of no significant change is rejected if the test statistic in Equation (14) exceeds the critical value.
Step 8
If a change is found to be statistically insignificant, the algorithm comes to a stop and no further change points are sought. However, if a change is found to be significant at some point k, the current value of k is stored as a change point.
Figure 3. Sample graph showing a case of no change.
Figure 4. Sample graph showing a case of a single change.
Step 9
Partition the input data set into two segments with the boundary corresponding to the change point estimate
. Each segment is then taken, one at a time, and treated as the input data set in Step 1 then checked for the existence of change and the process repeats until no further change points are found.
Given this is an iterative process, the estimated values of
at each iteration must be recorded and stored. This process constitutes the step-wise recursive binary segmentation procedure discussed in Section 2. Figure 5 represents a model flow chart for the Negative Binomial Multiple Change Point Algorithm.
Figure 5. Schematic representation of the Negative Binomial Multiple Change Point Algorithm.
3.2. Critical Values of the Likelihood Ratio Test
This section refers back to the methods described in Subsection 2.4. The method proposed by Gombay and Horvath (1996) on the asymptotic distribution of the likelihood ratio test is applied in the determination of the critical values presented according to Equation (18) which states that for all
(19)
Various combinations of
and
were tested for the current study, ensuring the condition
was met. The choices of
and
selected were:
The parameter
in Equation (19) was taken as
to equal the number of unknown parameters in the Negative Binomial model. The value of parameter
was dependent on the choice of
and
. Since
and
were chosen to be equal, then
was computed as:
The asymptotic critical values for
were determined in the R environment for 1000 iterations as the roots of Equation (19) for various sample sizes
and different levels of significance
. The results are summarized in Table 2.
Table 2. Critical values for the LRT at varying sample sizes and levels of the test.
Sample size (n) |
Level of the test (α) |
Critical value (C) |
|
0.10 |
2.900 |
|
0.05 |
3.184 |
|
0.01 |
3.735 |
|
0.10 |
3.019 |
|
0.05 |
3.294 |
|
0.01 |
3.830 |
|
0.10 |
3.209 |
|
0.05 |
3.467 |
|
0.01 |
3.978 |
|
0.10 |
3.275 |
|
0.05 |
3.527 |
|
0.01 |
4.029 |
|
0.10 |
3.349 |
|
0.05 |
3.594 |
|
0.01 |
4.086 |
|
0.10 |
3.428 |
|
0.05 |
3.666 |
|
0.01 |
4.148 |
The critical values obtained were compared against the critical values obtained in [5]. Gombay and Horvath obtained critical values for data derived from the Poisson distribution with a mean of 10. For comparability, we chose the Negative Binomial parameters as
and
to achieve a mean of 10, similar to the Poisson distribution. Table 3 shows a sample of critical values derived under the Negative Binomial distribution against those presented by Gombay and Horvath for various sample sizes. The two sets of critical values are fairly consistent. An assessment of how sensitive the critical values are to changes in model parameters indicated no significant difference provided the double exponential limit distribution assumption holds.
Table 3. Critical values for the Negative Binomial LRT versus Gombay’s critical values.
Sample size (n) |
Level of the test (α) |
Asymptotic critical values |
Gombay’s Critical values |
|
0.10 |
3.019 |
3.11 |
|
0.05 |
3.294 |
3.60 |
|
0.01 |
3.830 |
4.70 |
|
0.10 |
3.183 |
3.18 |
|
0.05 |
3.443 |
3.62 |
|
0.01 |
3.958 |
4.69 |
|
0.10 |
3.275 |
3.23 |
|
0.05 |
3.527 |
3.64 |
|
0.01 |
4.029 |
4.57 |
|
0.10 |
3.428 |
3.31 |
|
0.05 |
3.666 |
3.69 |
|
0.01 |
4.148 |
4.54 |
3.3. Results of the Simulation Study
3.3.1. Particulars of the Simulation Study
The change point algorithm developed in Section 3 was tested using simulated data from a Negative Binomial distribution. Monte Carlo simulations were performed using the R software to showcase three scenarios: a case of no change, followed by a case of a single change, and finally a case of multiple changes in distribution parameters. In the case of single and multiple changes, synthetic datasets were generated such that the true change points were known. This allowed the comparison of detected points against the true values. The single change point study considered synthetic data for small and medium samples of size
,
,
,
,
and
. The change points were set at each of the locations: n/4, n/2, and 3n/4 where parameters
and
were assumed to change simultaneously, while the distributional form remained the same throughout the samples. For simplicity, the multiple change-point study considered only a case of two changes located at positions n/4 and 3n/4 for a sample of size n.
The choices of Negative Binomial model parameters
and
for the different segments in the simulation were made to demonstrate the impact of different parameters of the Negative Binomial distribution on the generated data. Different
and
combinations led to varying degrees of dispersion in the counts, providing an opportunity to analyze over-dispersion or changes in variability over the two segments. The rationale behind the specific parameter choices was to create diversity in data generation and illustrate change points while mimicking real-world scenarios in count data distributions.
In the no-change setting, a single set of parameters
and
were chosen for the entire data series. Setting
allows simulation of scenarios where 5 successful outcomes are expected (like 5 successful sales, recoveries) before stopping the trials. A moderate value of
led to a manageable amount of variability in the data, allowing the model to capture enough complexity without becoming overly simplistic or excessively complex. The choice of parameter
indicated a 20% chance of success in each individual trial. A lower probability of success leads to a larger number of trials being needed to achieve the desired amount of successes. This aligns with real-world scenarios where events may be rare, leading to larger counts of failures and capturing the essence of over-dispersed count data.
In the single change point setting, the parameters for the two segments were chosen as follows:
Segment 1:
and
The parameter choice for this segment indicates that you expect to see 8 successful outcomes (such as recoveries and disease incidences) before stopping the trials, with a success probability of 40%. A higher
value can produce more variability in the counts, reflecting scenarios where events are more frequent or where a greater number of successes are desirable before considering a stopping point. A moderate success probability reflects a reasonably likely event, which may be indicative of a favorable condition.
Segment 2:
and
The values of
and
were chosen such that 5 successes are expected with only a 20% chance of success for each trial. A lower
values in this segment may model a scenario where successes are less common, reflecting a different underlying process or condition. It can lead to less variability in counts, which might be appropriate if the success rate has significantly changed. A lower success probability in this segment means that more trials are needed to achieve the same number of successes, creating a higher variance in the outcomes. This can model a situation where conditions are critical, making successes more challenging to achieve.
In the multiple (two) change point setting, the parameter values for the three segments were chosen as follows:
Segment 1:
and
This segment simulates random observations where you expect to achieve 8 successes, with a 40% chance of success for each trial. A higher r value with a much lower p results in increased over-dispersion, which means greater variability in counts. This setting simulates a scenario where the stochastic process is more unpredictable, leading to larger counts and more fluctuations.
Segment 2:
and
In this segment, data are simulated such that only 3 successes are expected with a 30% chance of success for each trial. The lower
suggests a shift to a less successful outcome scenario, perhaps indicating a less favorable condition. A
value of 0.3 suggests that while successes are still possible, they are less frequent than in the first segment, leading to a greater proportion of failures compared to successes.
Segment 3:
and
This segment is such that 5 successes are expected with a 20% chance of success per trial. A moderate
combined with a lower
reflects an even more challenging scenario for achieving successes, indicating a substantial decline in the likelihood of success compared to the previous segments. This can model conditions that have deteriorated significantly, resulting in fewer successful outcomes.
Overall, the parameter values were chosen to reflect a range of potential real-world scenarios where you might expect changes in counts due to different influencing factors (such as changes in policy, market conditions, or environmental factors). The chosen values for
and
led to a clear, yet not visually obvious, distinction between data segments, making it easier to demonstrate effectiveness of the change point detection algorithm developed.
This section gives a summary of results for the simulation study and power analysis showcasing performance of the NBMCPA given different sample sizes, model parameters, and location of change points. Throughout the simulation, the Likelihood ratio tests were conducted at the 5% level of significance. Visualizations of the likelihood ratio statistic are displayed for small and medium sample sizes and for each of three predefined change-point locations. The maximum values of the LRT statistics were compared to the critical values of the likelihood ratio test obtained in 3 to check for significance of change.
3.3.2. A Case Where No Change Exists
A random sample was generated under the Negative Binomial (
,
) distribution for a case of no change. The entire data set was such that both parameters
and
of the distribution remained constant throughout the series. Figure 6 shows the graph of
for a sample of size
.
The graph of the Likelihood Ratio Test statistic (as illustrated in Figure 6 panel (b)) does not exhibit a unique or single maxima. Instead, the graph appears rugged, somewhat similar to the graph of raw data as displayed in panel (a). This indicates that no change is detected. In addition to the lack of a unique maxima on the graph, the absence of variation in the distribution parameters is emphasized by the fact that the highest point on the graph of the LRT statistic lies below the critical value,
at
(see the dashed horizontal line). It is concluded that at the 5% level of significance, there is no statistically significant change in the distribution parameters.
Figure 6. Graph showing a case of no change for a sample size
.
3.3.3. A Case Where a Single Change Exists
Synthetic data were obtained randomly from the Negative Binomial distribution with three preset change points for various small and medium sample sizes, n. The change points were set such that both parameters r and p changed only once in the entire series at either of the locations n/4, n/2 or 3n/4. However, the data segments formed both followed the Negative Binomial distribution. The change point estimates were obtained using the NBMCP algorithm and the results summarized in Table 1. The graphs of the square root of the LRT statistics,
with superimposed critical values are shown in Figures 7-15.
3.3.4. When the Sample Size Is n = 50
Figures 7-9 represent the raw simulated data (panel a)) and the results of the Likelihood Ratio Test (panel (b)) when the change point is set at n/4, n/2 and 3n/4 respectively. The estimated change point in each case is indicated on the graph of the likelihood ratio statistic
using a vertical line. The horizontal dashed lines indicate the critical values of the test, used to determine whether or not a change, if present, is significant. The results displayed in Figure 7 showed that when the change point was set at position n/4, the MLE of the change point was
. On the other hand, when the change was set at the middle of the series,
, (see Figure 8) the change point estimate was
. When the change point was set further away from the initial data point, at position 3n/4 as shown in Figure 9, the change point estimate was
.
3.3.5. When the Sample Size Is n = 200
Figures 10-12 represent the raw simulated data and the results of the likelihood ratio test when the change point is set at n/4, n/2 and 3n/4 respectively for
. The estimated change point in each case is indicated on the graph of the likelihood ratios (
) using vertical lines. The maximum points on the graph
Figure 7. Graph showing a case of a single change at n/4 for a sample size
.
Figure 8. Graph showing a case of a single change at n/2 for a sample size
.
Figure 9. Graph showing a case of a single change at 3n/4 for a sample size
.
Figure 10. Graph showing a case of a single change at n/4 for a sample size
.
Figure 11. Graph showing a case of a single change at n/2 for a sample size
.
Figure 12. Graph showing a case of a single change at 3n/4 for a sample size
.
of the LRT statistic in each case exceed the critical value
indicating that the changes are significant at the 5% level.
3.3.6. When the Sample Size Is n = 500
Figures 13-15 represent the raw simulated data and the results of the likelihood ratio test when the change point is set at n/4, n/2 and 3n/4 respectively for
. The estimated change point in each case is indicated on the graph of the likelihood ratio statistic (
) using vertical lines. The horizontal dashed lines indicate the critical values of the test statistic at the 0.05 level of significance.
Table 4 gives a summary of the estimated change points for different sample sizes (
) located at various points for the foregoing
Figure 13. Graph showing a case of a single change at n/4 for a sample size
.
Figure 14. Graph showing a case of a single change at n/2 for a sample size
.
Figure 15. Graph showing a case of a single change at 3n/4 for a sample size
.
Table 4. Actual versus estimated changepoints across sample sizes and change locations.
Sample size (n) |
Position of change |
Actual cpt (k) |
Estimated cpt (
) |
Error (Δ) |
|
n/4 |
3 |
2 |
1 |
|
n/2 |
6 |
5 |
1 |
|
3n/4 |
9 |
9 |
0 |
|
n/4 |
5 |
4 |
1 |
|
n/2 |
10 |
9 |
1 |
|
3n/4 |
15 |
15 |
0 |
|
n/4 |
15 |
14 |
1 |
|
n/2 |
30 |
30 |
0 |
|
3n/4 |
45 |
45 |
0 |
|
n/4 |
25 |
24 |
1 |
|
n/2 |
50 |
50 |
0 |
|
3n/4 |
75 |
75 |
0 |
|
n/4 |
50 |
50 |
0 |
|
n/2 |
100 |
100 |
0 |
|
3n/4 |
150 |
150 |
0 |
|
n/4 |
125 |
125 |
0 |
|
n/2 |
250 |
250 |
0 |
|
3n/4 |
375 |
375 |
0 |
single change-point scenario. The estimation error was calculated as the difference between the estimated change point and the true changepoint. The results showed that the algorithm detects and locates changes with very little (
) or no (
) error. Changes located further away from the initial data point were more accurately estimated for small sample sizes.
3.3.7. A Case Where Multiple Changes Exist
The NBMCP algorithm was found to work well for the single change point case as indicated by the results in Table 4. The same algorithm was tested through a simulation study for a multiple change point case with a sample of size
. For simplicity, two change points were specified quarter-way (at
) and three-quarter-way (at
) respectively in the simulated data.
3.3.8. Location of First Change Point
The NBMCP algorithm detects changes by the order of their magnitude such that the first change point detected and estimated is the most pronounced among all changes present. The first change point lies at the point where the likelihood ratio test statistic first attains a maximum value. The vertical blocked line in Figure 16 shows the estimate of the first change point, which corresponds to the 3n/4 value. Potential additional change points in the data series only appear as peaks on the graph of the likelihood ratio statistic but are not marked as change points at this stage.
Figure 16. Simulated observations (a) and Likelihood ratio (b) showing the location of first change point at position 3n/4 for
.
3.3.9. Location of Second Change Point
Once the first change point is identified, the algorithm splits the time series into two parts at the first estimated change point. Change detection is then conducted in the lower partition and upper partitions provided the segments are not under-dispersed. The lower segment was discarded due to under-dispersion. Figure 17
Figure 17. The location of the first change point at position 3n/4 (vertical blocked line) and the second change point at n/4 (vertical dashed line) for
.
shows estimates of the first change point (vertical blocked line) and the second change point (vertical dashed line).
A dispersion test of the resulting three segments, given the two change points located, showed that the samples were under-dispersed and hence discarded. No further changes were sought in the lower, middle and upper sub-partitions of the data. Following the dispersion test, the algorithm comes to a halt and only two change points are reported.
3.4. Power Analysis
Investigations into power of the Likelihood Ratio Test for existence of a change were conducted via a simulation study at the 5% level of significance. A null hypothesis of no change in the distributional parameters between any two data segments was tested against the alternative that a change in distributional parameters exists so that the data series has two segments, each following a negative binomial distribution but with varying dispersion parameters. Mathematically, the test hypotheses were given by Equation (20)
versus
(20)
Several datasets were generated from the negative binomial distribution under the alternative hypothesis and the LRT statistic in Equation (13) was computed for each dataset. The null hypothesis was rejected when the LRT statistic defined in Equation (14) fell short of the critical value for a given sample size (see Table 2). The proportion of times in the null hypotheses was correctly rejected constituted the power of the test, otherwise regarded as the probability of not making a Type II error
.
The simulations were performed over 1000 iterations for each sample size and change point location to investigate the algorithm’s ability to detect and correctly estimate change points under different conditions. Power was calculated as the proportion of simulations where the change point was correctly detected.
Sensitivity analysis of the test was performed in a single-change setting with change located at different positions (n/4, n/2 and 3n/4) and for varying sample sizes (
,
,
,
,
, and
).
Power results for each combination of sample size and change point were stored and used to determine the optimal conditions for change detection through a comparative analysis. A tolerance level of 1 was considered in the analysis. Table 5 summarizes the results of the power analysis for various sample sizes and locations of single change.
Table 5. Power of the likelihood ratio test for change.
Sample size (n) |
Position of change |
Test power |
|
n/4 |
0.084 |
|
n/2 |
0.166 |
|
3n/4 |
0.103 |
|
n/4 |
0.311 |
|
n/2 |
0.424 |
|
3n/4 |
0.367 |
|
n/4 |
0.404 |
|
n/2 |
0.434 |
|
3n/4 |
0.426 |
|
n/4 |
0.477 |
|
n/2 |
0.496 |
|
3n/4 |
0.494 |
|
n/4 |
0.498 |
|
n/2 |
0.533 |
|
3n/4 |
0.524 |
|
n/4 |
0.513 |
|
n/2 |
0.537 |
|
3n/4 |
0.533 |
The results showed that the likelihood ratio test was most powerful in detecting and locating change when the changepoint was located midway through the data set.
Change detection accuracy, in terms of the true positive rate, of the algorithm was higher for changes positioned three-quarter-way compared to when change was located closer to the first data point (quarter-way). These results were consistent with those presented in [7] and [8]. In addition, the power of the LRT was found to increase with the sample size with highest detection accuracy of 53.7% being recorded when the sample size was
. Additional analysis showed that the test was most powerful when changes were larger, so that there was greater distinction between the segments. However, the test performed well even for subtle changes, especially larger sample sizes. Figure 18 visualizes the trend in test power as the sample size increases and change point location is varied.
Figure 18. Sensitivity analysis of the power of the LRT for change detection.
4. Conclusions and Recommendations
This study considered change detection for a range of small and medium-sized datasets between
and
. Given the foregoing results, the Negative Binomial multiple change point algorithm produces the expected results for different sample sizes and change point locations. Important to note, the algorithm does not erroneously detect changes when absent in actual sense for both large and small samples. This finding makes the NBMCP algorithm a robust and reliable method of detecting changes in a count data series. However, when change is present, the algorithm produces better results of the change point estimates for large and medium samples compared to smaller datasets. The algorithm, when applied to larger data sets, detects multiple and subtle changes more accurately. While the accuracy of the algorithm is slightly lower in detecting small changes within small-sized datasets, there is an upside in reduced computation time compared to large datasets. In addition, the results showed that the NBMCPA produces better estimates of the change points when the actual points of variation are further away rather than closer to the first data point. It was noted that when the change point was positioned three-quarter-way, the deviation of estimates from the actual changepoints was smaller, often 0, compared to when the change point was set at the n/4th position.
In a nutshell, comparative analysis of the power results revealed that various factors influence power including the sample size, so that larger samples generally increase power; effect size, with large differences between segments increasing power; significance level (
), such that higher levels of the test (e.g.
instead of
) increases power but also raises the risk of Type I error; variability, so that less variation or noise in the data increases power; and changepoint location, so that power is higher when changes are located further away from the first observation. The algorithm correctly indicates that there is no change in synthetic data with no predefined changepoint indicating a high detection accuracy in this respect. However, further investigations can be done to investigate the False Positive Rate (FPR) of the algorithm where there are multiple subtle changes in close proximity to each other within small and medium datasets.
The NBMCPA is developed such that only a single change can be identified at a time. In the multiple change point setting, the algorithm starts by checking for change points in the entire series. In cases where only one change exists, then the algorithm stops when the first and only change point is located. However, where there are two or more change points, the most pronounced change is detected and located first, then the next most pronounced change is identified and so on.
This simulation study limited investigations to small and medium sized datasets between
and
for which critical values were obtained. Interested researchers are advised to look into the possibility of extending the methods of determining critical values for the likelihood ratio test with larger (
) sample sizes. Finally, the algorithm developed in this study works well for over-dispersed and equi-dispersed datasets. Interested researchers may develop change point algorithms that work well for under-dispersed count data.