^{1}

^{2}

^{3}

Extreme events are defined as values of the event below or above a certain value called threshold. A well chosen threshold helps to identify the extreme levels. Several methods have been used to determine threshold so as to analyze and model extreme events. One of the most successful methods is the maximum product of spacing (MPS). However, there is a problem encountered while modeling data through this method in that the method breaks down when there is a tie in the exceedances. This study offers a solution to model data even if it contains ties. To do so, an optimal threshold that gives more optimal parameters for extreme events, was determined. The study achieved its main objective by deriving a method that improved MPS method for determining an optimal threshold for extreme values in a data set containing ties, estimated the Generalized Pareto Distribution (GPD) parameters for the optimal threshold derived and compared these GPD parameters with GPD parameters determined through the standard MPS model. The study improved maximum product of spacing method and used Generalized Pareto Distribution (GPD) and Peak over threshold (POT) methods as the basis of identifying extreme values. This study will help the statisticians in different sectors of our economy to model extreme events involving ties. To statisticians, the structure of the extreme levels which exist in the tails of the ordinary distributions is very important in analyzing, predicting and forecasting the likelihood of an occurrence of the extreme event.

Certain values in the tails of any distribution, represent extreme events and they are pointers to eventuality. The values in the tails are rare, few, but can have great impact on the conclusion arrived at by the analysts. Different sectors of our life experience extreme events and here we mention just but a few. According to [

The MPS allows efficient estimators in non regular cases where MLE may not exist. This is especially relevant to the GEV distribution in which the MLE does not exist when ε < − 1 . According to [

D i ( θ ) = F θ ( x i ) − F θ ( x i − 1 ) (1)

for i = 1 , 2 , ⋯ , n + 1 . The maximum spacing estimator of θ 0 was defined as value that maximizes the logarithm of the geometric mean of sample spacings [

θ ^ = arg max θ ∈ Θ S n ( θ ) (2)

where

S n ( θ ) = ln ( D 1 ( θ ) ⋅ D 2 ( θ ) ⋯ D n + 1 ( θ ) ) n + 1 = 1 n + 1 ∑ i = 1 n + 1 ln D i (θ)

This maximum spacing estimator is sensitive to the ties. That is, for any

x i + m = x i + m − 1 = ⋯ = x i

Then D i + m ( θ ) = D i + m − 1 ( θ ) = ⋯ = D i ( θ ) . This therefore collapses the method. The modified MPS method proposed here is to use grouped data frequency table. Let x 1 , x 2 , ⋯ , x n occur f 1 , f 2 , ⋯ , f n times respectively. The geometric mean is given by

G = ( x 1 f 1 ⋅ x 2 f 2 ⋯ x n f n ) 1 N = [ ∏ i = 1 n x i f i ] 1 N

This implies that

ln G = 1 N ∑ i = 1 n f i ln x i (3)

This leads to the modified MPS method as

S n ( θ ) = ln ( D 1 f 1 ( θ ) ⋅ D 2 f 2 ( θ ) ⋯ D n + 1 f n + 1 ( θ ) ) n + 1 = 1 n + 1 ∑ i = 1 n + 1 f i ln D i ( θ ) (4)

In case f 1 = f 2 = ⋯ = f n + 1 = 1 , then we go back to the standard MPS. The Spacings are such that ∑ i = 1 n D i ( θ ) = 1 . Under MPS, the D i ( θ ) ’s are defined as:

D 1 ( θ ) = F ( x 1 : n , θ )

D i ( θ ) = F ( x i : n , θ ) − F ( x i − 1 : n , θ )

D n + 1 ( θ ) = 1 − F ( x n : n , θ )

Therefore, Equation (4) can be partitioned as:

S n ( x i ; θ , ε , σ ) = 1 n + 1 { f 1 ln D 1 ( θ ) + ∑ i = 2 n f i ln D i ( θ ) + f n + 1 ln D n + 1 ( θ ) } (5)

To estimate the parameters, we substitute the GPD

G ( x ; ε , σ ) = { 1 − [ 1 + ε ( x − u σ ) ] − 1 ε , ε ≠ 0 1 − exp [ − ( x − u σ ) ] , ε = 0 (6)

into the MPS method. This lead to two cases of estimating the GPD parameters.

In this case:

D 1 = 1 − [ 1 + ε ( x 1 − θ σ ) ] − 1 ε (7)

D i = ( 1 − [ 1 + ε ( x i − θ σ ) ] − 1 ε ) − ( 1 − [ 1 + ε ( x i − 1 − θ σ ) ] − 1 ε )

which leads to

D i = [ 1 + ε ( x i − 1 − θ σ ) ] − 1 ε − [ 1 + ε ( x i − θ σ ) ] − 1 ε (8)

and

D n + 1 = 1 − ( 1 − [ 1 + ε ( x n − θ σ ) ] − 1 ε ) (9)

implying that

D n + 1 = [ 1 + ε ( x n − θ σ ) ] − 1 ε

Therefore, Equation (5) now becomes:

S n ( x i ; θ , ε , σ ) = 1 n + 1 { f 1 ln ( 1 − [ 1 + ε ( x 1 − θ σ ) ] − 1 ε ) + ∑ i = 2 n f i ln ( [ 1 + ε ( x i − 1 − θ σ ) ] − 1 ε − [ 1 + ε ( x i − θ σ ) ] − 1 ε ) + f n + 1 ln [ 1 + ε ( x n − θ σ ) ] − 1 ε } (10)

The estimation of the parameters involves taking partial derivatives of Equation (10) with respect to each of the parameters and setting the result to zero. For the estimation of ε , the first term on the R.H.S is worked out as:

Let

K 1 = ln ( 1 − [ 1 + ε ( x 1 − θ σ ) ] − 1 ε ) (11)

implying that

∂ K 1 ∂ ε = ( x 1 − θ ) σ ε ( 1 + ε ( x 1 − θ σ ) ) − 1 ε 2 ln ( 1 + ε ( x 1 − θ σ ) ) (12)

Working out the second term of Equation (10);

Let

K 2 = ln { [ 1 + ε ( x i − 1 − θ σ ) ] − 1 ε − [ 1 + ε ( x i − θ σ ) ] − 1 ε } (13)

Therefore;

∂ K 2 ∂ ε = ( 1 [ 1 + ε ( x i − 1 − θ σ ) ] − 1 ε − [ 1 + ε ( x i − θ σ ) ] − 1 ε ) × { ( 1 ε 2 ln [ 1 + ε ( x i − 1 − θ σ ) ] − 1 ε × 1 1 + ε ( x i − 1 − θ σ ) × ( x i − 1 − θ σ ) ) − ( 1 ε 2 ln [ 1 + ε ( x i − θ σ ) ] − 1 ε × 1 1 + ε ( x i − θ σ ) × ( x i − θ σ ) ) } (14)

And the last term of Equation (10);

Let

K 3 = ln [ 1 + ε ( x n − θ σ ) ] − 1 ε (15)

Therefore,

∂ K 3 ∂ ε = ( 1 [ 1 + ε ( x n − θ σ ) ] − 1 ε ) { 1 ε 2 ln [ 1 + ε ( x n − θ σ ) ] − 1 ε ( x n − θ σ ) } (16)

Similarly, we parameter to estimate is σ , from the:

∂ K 1 ∂ σ = 1 [ 1 − [ 1 + ε ( x 1 − θ σ ) ] − 1 ε ] [ [ 1 + ε ( x 1 − θ σ ) ] − ( 1 + ε ) ε ] × ( x 1 − θ σ 2 ) (17)

∂ K 2 ∂ σ = ∂ K 2 ∂ w × ∂ w ∂ σ

∂ K 2 ∂ σ = 1 [ 1 + ε ( x i − 1 − θ σ ) ] − 1 ε − [ 1 + ε ( x i − θ σ ) ] − 1 ε × { ( x i − θ σ 2 ) [ 1 + ε ( x i − θ σ ) ] − ( 1 + ε ) ε − ( x i − 1 − θ σ 2 ) [ 1 + ε ( x i − 1 − θ σ ) ] − ( 1 + ε ) ε } (18)

and

∂ K 3 ∂ σ = − ( x n − θ σ 2 ) [ 1 + ε ( x n − θ σ ) ] − 1 (19)

Finally, we estimate

and

Therefore, after differentiating partially Equation (10) with respect to the parameters, we get the normal Equations (23), (24) and (25);

where the terms

where the terms

where the terms

Parameters under this case are estimated here. When

Therefore, Equation (5) can be written as;

Let

Therefore,

Similarly, let

Therefore,

Let

implying that;

The equations for estimating

Next, let

Simplifying to;

Lastly, let

Therefore, after differentiating partially Equation (29) with respect to the parameters, we get the normal Equations (39) and (40);

where the terms

where the terms

A simulation was performed to compare the standard MPS methodology with the improved MPS methodology. We simulated data from a gamma distribution with the parameters shape = 2.6, scale = 1:1000. Repetitions were later introduced in the order of 0, 20, 40 and 60. The repeated values gave rise to situations of ties. Gamma distribution is known to have fairly heavy tails. To determine our threshold, we simulated a set of data constituting 300 values. 100 values did not have a repetition while 100 values had each a repetition making them to have a frequency of 2 each. This set of data was used in the improved MPS model. After the simulation, this set of data was reorganized in such a way that the 300 values had a frequency of 1 each regardless of whether it was repeated or not. This set of data was used in the standard MPS model. The normal equations derived above were used as the model for the improved MPS methodology. For the three parameter models 23, 24 and 25 were used while for two-parameter model 39 and 40 were used.

Suitable values for k and

The x-axis of

between 0 and 450. The values with big magnitude concentrate to the right of the distribution. The simulated values are Gamma distributed with a long right tail.

When the two parameter standard and improved models were used, the following results

The threshold (location parameter) from the improved MPS model was higher than that obtained through standard MPS model

The threshold (location parameter) determined from the improved MPS model was high compared to the threshold determined through the standard MPS model

From

Location | Scale | |
---|---|---|

Improved | 736.476 | 13.72969 |

Standard | 725.5767 | 16.31062 |

Location | Scale | Shape | |
---|---|---|---|

Improved | 738.1303 | 9.483573 | −0.84884 |

Standard | 726.3707 | 13.33941 | −5.49648 |

MLE and MPLE Estims | 2parStd | 2parImpovd | 3parStd | 3parImpvd |
---|---|---|---|---|

Threshold | 725.5767 | 736.476 | 726.3707 | 738.1303 |

No. above | 18 | 15 | 18 | 15 |

Proportion above | 0.06 | 0.05 | 0.06 | 0.05 |

Scale estimate | 146.3 | 163.9 | 98.69 | 130 |

Scale std. err | 34.49 | 42.31 | 31.66 | 40.8 |

Shape estimate | 0.3728 | 0.1903 | 0.3879 | 0.2146 |

Shape std. err | 0.2808 | 0.2573 | 0.2855 | 0.2644 |

Asympt Var-Cov-Scale | 1189 | 1790 | 1002 | 1664 |

Asympt Var-Cov-Shape | 0.07887 | 0.06623 | 0.08151 | 0.06993 |

Deviance | 214.8238 | 183.1161 | 214.5401 | 182.7789 |

Penalized Deviance | 215.4867 | 182.9699 | 215.796 | 183.3916 |

AIC | 216.8238 | 185.1161 | 216.5401 | 184.7789 |

Penalized AIC | 217.4867 | 184.9699 | 217.796 | 185.3916 |

backtested, it produced excesses of 15 data values which was a proportion of 0.05

In this section, the effect of number of repetitions on threshold was investigated. The gamma distribution with parameters _{1} and y_{2} were made to take different values. To create a sample with 20 repetitions, variables

The repetitions cause ties and therefore, these samples contain grouped ties. The improved MPS model used this raw sample. For this sample to be used with standard MPS model, the data had to be ungrouped so as each value had a frequency of one. The density of the distribution in the four cases of repetitions were ploted as shown in

The x-axis of the plots in

According to [

Repetitions | 0 | 20 | 40 | 60 |
---|---|---|---|---|

Location | 1111.897 | 1116.442 | 1121.845 | 1124.372 |

Scale | 5.993496 | 9.99465 | 3.056949 | 1.373637 |

Repetitions | 0 | 20 | 40 | 60 |
---|---|---|---|---|

Location | 1111.009 | 1127.579 | 1130.066 | 1140.35 |

Scale | 4.097801 | 7.830554 | 5.119941 | 4.395648 |

7.830554 then decreased as the number of ties increased. The samples with ungrouped ties were subjected to a three parameter standard MPS model and the results are as in

Plots to compare the performance of the parameters obtained were made as indicated in Figures 3-7.

Repetitions | 0 | 20 | 40 | 60 |
---|---|---|---|---|

Location | 1111.473 | 1118.298 | 1120.145 | 1121.675 |

Scale | 17.44149 | 12.04152 | 4.716338 | 4.950339 |

Shape | −0.05918 | −0.04098 | −1.90142 | −6.59819 |

Repetitions | 0 | 20 | 40 | 60 |
---|---|---|---|---|

Location | 1111.954 | 1128.368 | 1133.003 | 1141.156 |

Scale | 15.4904 | 7.009472 | 8.335507 | 9.42994 |

Shape | 0.05365 | 1.586433 | 2.54892 | −4.74385 |

When there was no tie

The trend observed in the two parameter model

A plot of scales was also made to compare the performance of the scale parameter as the number of repetitions increased for the two and three parameter MPS model

The two parameter model had different scale parameters when there were no ties. The scale of the standard MPS model increased faster than that of the improved MPS model as the repetitions increased up to the 20 repetitions after which the two models showed a downward trend

The three parameter MPS models

The two parameter standard and improved model have location and scale parameters only. These models have a zero shape parameter. For the three parameter standard and improved models, the shape parameter performed as shown in

The two models had the same shape parameter when the data had no repetitions(ties) 7. The shape parameter of the improved MPS model improved up to around 38 repetitions after which it showed a downward trend. The standard model remained steady between 0 and 20 repetitions after which it followed a downward trend. The drop rate in the shape parameter was higher on improved model than it was on the standard model. The shape parameter showed a consistent decrease beyond 20 repetitions for the standard model and 38 repetitions for the improved model. A general observation was that there was a change in the trend of all parameters in between 30 and 40 repetitions.

The determined threshold obtained in different samples were back tested in the sample data they were obtained from to assess their performance.

For two parameter models at 0 repetitions, both models, the standard and the improved MPS model

Ests | 0Std | 0Impd | 20Std | 20Impd | 40Std | 40Impd | 60Std | 60Impd |
---|---|---|---|---|---|---|---|---|

Thresh | 1111.897 | 1111.009 | 1116.402 | 1127.579 | 1121.845 | 1130.066 | 1124.372 | 1140.35 |

No.abv | 16 | 16 | 23 | 21 | 14 | 14 | 11 | 11 |

Pro.abv | 0.0533 | 0.0533 | 0.0767 | 0.07 | 0.0467 | 0.0467 | 0.0367 | 0.0367 |

Sca.est | 168.3 | 169.3 | 169.6 | 178.3 | 180.7 | 170.8 | 199.5 | 174.7 |

Scastd.er | 51.09 | 51.37 | 42.41 | 46.64 | 59.37 | 56.54 | 76.02 | 67.84 |

Sha. Est. | 0.5588 | 0.5505 | 0.3235 | 0.6702 | 0.5486 | 0.6172 | 0.5641 | 0.7014 |

Shastd.er | 0.309 | 0.3065 | 0.1047 | 0.2861 | 0.3211 | 0.3412 | 0.3779 | 0.4265 |

AVarCo. Sc | 2610 | 2639 | 1798 | 2175 | 3525 | 3197 | 5779 | 4602 |

AVarCo. Sh | 0.09545 | 0.09395 | 0.1047 | 0.08186 | 0.1031 | 0.1164 | 0.1428 | 0.1819 |

Deviance | 208.5296 | 208.6634 | 301.9033 | 277.4688 | 184.4465 | 183.3083 | 147.1072 | 145.1982 |

Pen. Devia | 204.2043 | 204.3792 | 292.9033 | 269.3051 | 181.331 | 180.0228 | 145.7075 | 144.0077 |

AIC | 210.5296 | 210.6634 | 303.9033 | 279.4688 | 186.4465 | 185.3038 | 149.1072 | 147.1982 |

Pena. AIC | 206.2043 | 206.3792 | 292.5338 | 271.3051 | 183.331 | 182.0228 | 147.7075 | 146.0077 |

Ests | 0repsStd | 0repsImpd | 20repsStd | 20repsImpd | 40repsStd | 40repsImpd | 60repsStd | 60repsImpd |
---|---|---|---|---|---|---|---|---|

Thresh | 1111.473 | 1111.954 | 1118.298 | 1128.3679 | 1120.145 | 1133.003 | 1121.647 | 1141.156 |

No.abv | 16 | 16 | 21 | 21 | 14 | 14 | 11 | 11 |

Pro.abv | 0.0533 | 0.0533 | 0.07 | 0.07 | 0.0467 | 0.0467 | 0.0367 | 0.0367 |

Sca.est | 168.8 | 168.2 | 188.5 | 177.5 | 182.6 | 167.1 | 203 | 173.3 |

Scastd.err | 51.23 | 51.07 | 49.17 | 46.42 | 59.96 | 55.54 | 77.09 | 67.39 |

Sha. Est. | 0.5548 | 0.5594 | 0.5979 | 0.6765 | 0.5362 | 0.6467 | 0.5427 | 0.709 |

Shapstd.er | 0.3078 | 0.3091 | 0.2705 | 0.2875 | 0.318 | 0.3488 | 0.3709 | 0.4295 |

AsVarSc | 2624 | 2608 | 2418 | 2155 | 3595 | 3085 | 5944 | 4542 |

AsVarSh | 0.09473 | 0.09555 | 0.07319 | 0.08265 | 0.1011 | 0.1217 | 0.1376 | 0.1845 |

Deviance | 208.5939 | 208.5209 | 279.2341 | 277.3027 | 184.6582 | 182.8486 | 147.3715 | 145.0816 |

Pen. Devia | 204.2876 | 204.1926 | 271.4901 | 269.1153 | 181.5982 | 179.5488 | 145.997 | 143.9223 |

AIC | 210.5939 | 210.5209 | 281.231 | 279.3027 | 186.6582 | 184.8486 | 149.3715 | 147.0816 |

Pena. AIC | 206.2876 | 206.1926 | 273.4911 | 271.1153 | 183.5982 | 181.5488 | 147.997 | 145.9223 |

To assess the performance of the threshold obtained when the GPD had the shape parameter,a back testing was done on the samples through the two models and the results were as in

For 0 repetitions, the number of observations above the threshold were 16 in both models

This study helped to improve the MPS model by introducing the concept of f to both two-parameter and three-parameter model 23, 24, 25, 39 and 40. Through simulation, the improved MPS, both two-parameter and three-parameter models yielded a higher threshold as compared to the two standard MPS model

The authors declare no conflicts of interest regarding the publication of this paper.

Murage, P., Mung’atu, J. and Odero, E. (2019) Optimal Threshold Determination for the Maximum Product of Spacing Methodology with Ties for Extreme Events. Open Journal of Modelling and Simulation, 7, 149-168. https://doi.org/10.4236/ojmsi.2019.73008