^{1}

^{2}

^{3}

Spatial modeling has largely been applied in epidemiology and disease modeling. Different methods such as Generalized linear models (GLMs) have been made available to prediction of the claim frequencies. However, due to heterogeneity nature of policies, the methods do not generate precise and accurate claim frequencies predictions; these parametric statistical methods extensively depend on limiting assumptions (linearity, normality, independence among predictor variables, and a pre-existing functional form relating the criterion variable and predictive variables). This study investigates how to derive a spatial nonparametric model estimator based on smoothing Spline for predicting claim frequencies. The simulation results showed that the proposed estimator is efficient for prediction of claim frequencies than the kernel based counterpart. The estimator derived was applied to a sample of 6500 observations obtained from Cooperative Insurance Company, Kenya for the period of 2018-2020 and the results showed that the proposed method perform s better than the kernel based counterpart. It is worth noting that inclusion of the spatial effects significantly improves the estimator prediction of claim frequency.

Recent studies on spatial modeling have been rapidly applied in many fields: epidemiology, public health, and the insurance sector. Models such as Poisson, Generalized linear models, Credibility models and Bayesian Models are the commonly used models for prediction of claim frequencies. However, from the available literature, these models appear to be relatively inflexible. Although the Generalized linear models provide accurate and fast analysis of insurance data, they fall short because they are defined based on the assumptions, and an incorrect model assumption can cause model misspecification leading to erroneous results. Nonparametric models are deemed to minimize the shortcoming of these standard parametric models since fewer assumptions are made for the model, therefore, suitable for modeling insurance data which are nonlinear as described by [

[

The main difference between this research and [

The paper is organized as follows: Section 2 describes the development process of the model estimator based on smoothing spline; Section 3 presents the data description and main results; simulation study and analysis of CIC insurance claims data; Section 4 presents conclusion and suggestions for further research.

The study proposed a nonparametric regression model to predict the number of claims Y i , i = 1 , ⋯ , n observed in region J in order to relax restrictive assumption on the distribution of number of claims and X i covariates vector for the i^{th} claim. Since claims in each region, J has nonlinear relation with the covariates X ′ i s .

The nonparametric form of the model is given by the general form [

y i = g ( x i ) + Z i T b + ε i

g ( ⋅ ) is unknown nonparametric function used to model fixed effects, Z i T b and ε i cater for random effects.

Since the form of Z i T b = R i for R i is unknown. The main work of this study is to estimate the form of R i then establish the functional form of Z i T b that captures the spatial effects.

Let n = ( n 1 , ⋯ , n N ) T be two N dimensional vectors. We can make assumption about the spatial model as

Y i = g ( X i ) + R i , v a r ( R ) = Σ , i ∈ Λ n = { 1 , ⋯ , n 1 } × ⋯ × { 1 , ⋯ , n N } (1)

where i = ( i 1 , ⋯ , i N ) in Λ n will be referred to as site, R i cater for the spatial effects (Random effects) and the cardinality of Λ n is | Λ n | = n [

Spatial data is modelled as finite realization of vector stochastic process indexed by i ∈ Λ n , R = ( R 1 , ⋯ , R n ) T is assumed to follow a joint Gaussian distribution where E ( R i ) = 0 , is known ∀ i ∈ Λ n , Σ = [ ρ ( R i , R j ) ] is the unknown correlation coefficient matrix (need to be estimated). The vector X i = ( X i 1 , ⋯ , X i d ) ∈ ℜ d , Y i ∈ ℜ and g ( ⋅ ) is the unknown trend function.

The aim is to estimate g ( x ) for some given x = ( x 1 , ⋯ , x d ) ∈ ℜ d , the response variable Y i is claim frequency and X i is six dimensional vector consisting of the following explanatory variables: gender, claim amount, age of the policyholder, gender, vehicle age, model of the vehicle and age category of the policyholder.

Estimating g ( x ) at some point x ∈ ℜ d , for X i in the neighbourhood of x, g can be approximated using smoothing spline [

To estimate the smoothing spline estimator g ^ ( ⋅ ) of g ( ⋅ ) , the study considers minimizing the equation

∑ i = 1 n ( Y i − g ( x i ) ) 2 + λ ∫ ( g ″ ( x ) ) 2 d x (2)

over the function g This criterion trades-off least squares error of g over ( x i , y i ) , i = 1 , ⋯ , n , with a regularization term that grows large when the second derivative of g is wiggly. The coefficients are chosen to minimize Equation (3) which is a simplified form of Equation (2)

1 n ∑ i = 1 n { Y i − g ( X i ; β ) } 2 + λ β T Ω β (3)

which can be represented as

‖ Y i − G β ‖ 2 + λ β T Ω β

where G ∈ ℝ n × n is basis matrix defined as

G i j = ψ j ( x i ) , i , j = 1 , ⋯ , n

where ψ 1 , ⋯ , ψ n are the truncated power basis functions with knots at x 1 , ⋯ , x n which is evaluated at the data values

ψ j ( x ) = ( x i j ( 0 ≤ n ≤ p ) i , j = 1 , ⋯ , n ( x i − N j + 1 − p ) + p ( p + 1 ) ≤ j ≤ N (4)

( x − N j + 1 − p ) + p = max ( 0 , x i − N j ) p , j ∈ ϕ where ϕ is compact interval. p is the degree of the spline and j i < ⋯ < j N − p are fixed points or knots in ϕ .

Ω ∈ ℝ n × n is the penalty matrix defined as

Ω i j = ∫ g ″ i ( x ) ψ ″ j ( x ) d x , i , j = 1, ⋯ , n

Given the optimal coefficients β ^ minimizing (3) through penalized least squares, the smoothing spline estimator at x is therefore defined as

g ^ ( x i ) = ∑ j = 1 n β ^ j ψ j ( x ) (5)

The term affects shrinking the components of estimation β ^ towards zero. The parameter λ ≥ 0 is the smoothing parameter.

Each computed coefficient β ^ j corresponds to a particular basis function ψ j . The term β T Ω β in (3) imparts more shrinkage on the coefficients β ^ j that correspond to wigglier functions ψ j ( x ) . Hence, as we increase λ , we are shrinking away from the wiggler basis functions.

Similar to least squares regression, the coefficients β ^ minimizing (3) is

β ^ = ( G T G + λ Ω ) − 1 G T Y = ( X T X + n λ D ) − 1 X T Y

where X is a design matrix with entries x i for i = 1 , ⋯ , n , Y is a vector of the response variables, D is a diagonal matrix with p + 1 zeros on the diagonal followed by N ones and n λ D is a penalty term.

Smoothing splines can be seen as a linear smoother, where k ( x ) = ( ψ 1 ( x 1 ) , ⋯ , ψ n ( x n ) ) . Therefore, Equation (5) can be represented as

g ^ ( x ) = k ( x ) T β ^ = k ( x ) T ( X T X + n λ D ) − 1 X T Y (6)

which is linear combination of the points y i , i = 1 , ⋯ , n , λ is estimated using Generalized Cross Validation (GCV) method given by

GCV ( λ ) = 1 n ∑ i = 1 n ( Y ( z i ) − Y ^ λ − i ( z i ) 1 − ( p + t r ( S λ ) ) / n ) 2 (7)

where Y ( z i ) is the observation in point z i , Y λ − i ( z i ) is the predicted value from a fitted smoothing spline model from the data less the i^{th} data and S λ is the degree of the smoother.

As proposed by [^{2} is used to assess the performance of predictor function, given by

R 2 = 1 − ∑ i = 1 n [ g ( x i ) − g ^ ( x i ) ] 2 ∑ i = 1 n [ g ( x i ) − g ¯ ] 2 (8)

where g ¯ is the sample mean of g ( x i ) , i = 1 , ⋯ , n .

After estimating the function g ( ⋅ ) , then from (1) R i is estimated as R ^ i = Y i − g ^ ( X i ) . Since Σ in model Equation (1) is unknown, we assume that R i , i = 1 , 2 , ⋯ , n is 2^{nd}-order stationary and isotopic process (does not depend on direction).

Before prediction can be performed on spatial data sets, the variogram is usually estimated at various lags and a nonparametric model is fitted to those estimates.

Then let C ( h ) and 2 γ ( h ) be covariogram and variogram of the process where h represents the distance between 2 points at which the process is obtained [

C ( h ) = C ( 0 ) − γ ( h ) (9)

where C ( 0 ) = σ 2 = v a r ( Y ( z ) ) , Y ( z ) is the value of the process at spatial location z within region C.

l i m h → ∞ C ( h ) = 0

implies

l i m h → ∞ γ ( h ) = V a r ( Y ( z ) ) = C ( 0 )

for validity of variogram the condition that

l i m h → ∞ 2 γ ( h ) h 2 = 0

must be met [

Σ = [ ρ ( R i , R j ) ] = [ C ( ‖ z i − z j ‖ ) / σ 2 ] , while z i and z j are the spatial locations associated with the error values R i and R j thus to estimate Σ it is sufficient to estimate γ ( h ) [

2 γ ^ ( h ) = ∑ S ( h ) [ z i − z j ] 2 / N ( h ) (10)

S ( h ) = { ( z i , z j ) : | z i − z j | = h } , h ∈ ℜ d , N ( h ) is a number of distinct pairs in S ( h ) since r ( z i ) the error at location z i is unobserved, the quantity is to be estimated as well.

Since we have to estimate the variogram γ ^ ( h ) in Equation (10) in nonparametric approach [

γ ( h ) = ∫ 0 ∞ ( 1 − ω d ( h t ) ) d M ( t ) (11)

M ( t ) is nonnegative bounded nondecreasing function for nodes(or location of the jumps) t ≥ 0 and ω d is a basis for functions in ℝ d (d is the dimension of the spatial domain D) given by

ω d ( h t ) = ( 2 / h t ) ( d − 2 ) / 2 Γ ( d / 2 ) J ( d − 2 ) / 2 ( h t )

Γ ( d / 2 ) is the gamma function, and J ( ⋅ ) is the Bessel function of the first kind. Some familiar examples of ω d are ω 1 ( h t ) = cos ( h t ) , ω 2 ( h t ) = J 0 ( h t ) , and here ω 3 ( h t ) = sin ( h t ) h t is chosen which yields a non-parametric estimate which is conditionally negative definite for spatial data from 1 - 3 dimensions.

The characteristics of the estimator (11) are estimated using Integrated square error [

ISE ( γ ) = ∫ h 1 h k { γ ^ ( h ) − γ ( h ) } 2 d h (12)

where h 1 and h k are the smallest and largest distances for which variogram estimates are available [

Model (1) can therefore be represented as

Y ( z i ) = g ( X i ( z i ) ) + R ( z i ) , i = 1, ⋯ , n (13)

where Y ( z i ) : i = 1, ⋯ , n is the observations (claims) in region z i associated with independent variables X i ( z i ) in region z i , R ( z i ) is the unobserved error in region z i and g ( ⋅ ) is the estimated function in (6).

To evaluate performance of the proposed method we used R 2 to assess prediction accuracy of the method

R 2 = 1 − ∑ i = 1 n [ Y ( z i ) − Y ^ ( z i ) ] 2 ∑ i = 1 n [ Y ( z i ) − Y ¯ ] 2 (14)

The study used motor third party liability data for 2018-2020 from the insurance company Cooperative Insurance Company (CIC). The data include 6500 policies, out of which many policies have total claim sizes other than zero, and an appropriate number of policies without any claims were taken. The following policy data were used: the region where the policy was taken, age, gender, type of vehicle, number of claims per policy, years of policy ownership, claim amount, insured cases number for a user, and average claim size. In the process of preparation, data was cleaned, and imputation of data will be done; age is categorized into old (over the age of 50), Young (up to the age of 25), and Middle (aged 25 - 50) age. Policies with extremely low and extremely high average claim sizes are removed; categorical variables with multiple categories were replaced with dummy (indicator) variables.

No. of Claims | Freq of Observations | % of Observations |
---|---|---|

0 | 4015 | 61.8 |

1 | 1967 | 30.3 |

2 | 431 | 6.6 |

3 | 71 | 1.1 |

4 | 16 | 0.2 |

This section describes the simulation and their analysis results of the proposed method, we simulate spatial data with a length of n = 100 observations. This is to ensure that the simulated data mimic the real claims dataset so that the results can be inferred to evaluate the performance of our method in data analysis. 65 spatial sampling locations were selected randomly and denoted by z 1 , ⋯ , z n . The responses Y ( z i ) for i = 1, ⋯ , n are the observations and were simulated from the spatial nonparametric model (13) with p = 2

Y ( z ) = g ( x i ( z ) ) + R ( z )

R ( z ) is the term for spatial effects z i and z j in 2-dimensional space [

The ISE ( γ ) defined as ISE ( γ ) = ∫ h 1 h k { γ ^ ( h ) − γ ( h ) } 2 d h was approximated numerically from simulated data for the proposed estimator (11) and NW kernel.

Assessing how well the proposed method performs, we compare the proposed method under which R = ∫ 0 ∞ ( 1 − ω d ( h t ) ) d M ( t ) with the method under which the spatial component (R) is based on kernel estimation, we calculated the MSE and the R^{2} of the estimators from 100 simulations and present the results in ^{2} for the proposed method were larger ranging between 0.7003 to 0.99963 in all the sample sizes compared those of kernel based estimator which ranges from 0.6751 to 0.9694. Thus the results demonstrate the superior performance of the proposed method compared to the kernel based estimator.

The results in

Based on performance of the proposed method, the method was applied to the simulated data to check its performance in prediction of future values.

h = 1 | h = 2 | h = 3 | h = 4 | |
---|---|---|---|---|

NW kernel | 0.53883 | 0.53407 | 0.5177 | 0.40779 |

Proposed estimator | 0.30731 | 0.28762 | 0.26737 | 0.23037 |

n = 10 | n = 100 | n = 400 | n = 1000 | |||||
---|---|---|---|---|---|---|---|---|

MSE | R^{2} | MSE | R^{2} | MSE | R^{2} | MSE | R^{2} | |

N (kernel) | 0.0308 | 0.6751 | 0.0285 | 0.7585 | 0.0210 | 0.9691 | 0.0176 | 0.9694 |

Proposed Method | 0.0221 | 0.7003 | 0.0118 | 0.8217 | 0.0105 | 0.9962 | 0.0102 | 0.99963 |

No. of Claims | Freq of Observations | % of Observations |
---|---|---|

1 | 998 | 99.8 |

2 | 2 | 0.2 |

The predicted values generated by the proposed method as presented in

The study considered claims data from CIC insurance observed in different parts of 7 counties of Kenya to exhibit the performance of the proposed method. The main interest of this study was predicting claims frequencies, the study considers a set of 6500 observations. Let Y i denote the claim frequency, and X i = ( X 1 , ⋯ , X 6 ) Τ be a vector which consists of the following explanatory variables: gender, claim amount, age of the policyholder, gender, vehicle age, model of the vehicle and age category of the policyholder. Using the estimated model (13) we predict claim frequencies. The observations were from random process over a countable sample of spatial locations. The claim data at a particular location typically represent the entire region (

Using the proposed method future claim frequencies were predicted and the results were presented in

From the prediction results, R^{2} values using Equation (14) were evaluated to access the performance of two methods, the results presented in

From the results in ^{2} for N(kernel) was 0.543, and that from the proposed method is 0.566, this showed that the proposed method for prediction has a higher prediction accuracy than the kernel based estimator. Therefore the

No. of Claims | Freq of Observations | % of Observations |
---|---|---|

1 | 3145 | 89.52 |

2 | 302 | 8.60 |

3 | 50 | 1.42 |

4 | 16 | 0.50 |

Method | N (kernel) | Proposed Method |
---|---|---|

R^{2} | 0.543 | 0.566 |

study concluded that the proposed method is more efficient than N (kernel) model, this implies that the predicted value was more likely to be more identically equal to the observed claims.

The idea of deriving an appropriate estimator in predicting frequency claims in the insurance industry has gained more interest in finance and statistical research. Many researchers heavily rely on parametric estimators; however, the insurance datasets have some aspect of non-linearity. Hence, researchers in statistics and econometrics are currently developing nonparametric models incorporating spatial effects to improve on the prediction based on the existing parametric models such as aggregate claim models and GLMs which are rather more restrictive on their transformed mean of the response; the nonparametric methods provide a more flexible method for prediction. The study proposed a spatial nonparametric (based on splines) estimator for predicting claim frequencies in motor insurance.

The simulation study showed that the proposed method performs better than the kernel based estimator; here the Mean Squared Error values of the proposed method were smaller than those of the kernel estimator which also implies a higher value of R-squared, particularly in presence of spatial dependence. Case study findings also showed that the proposed method performs better than the kernel based estimator on predicting the future claim frequencies. Therefore, the proposed method compared to kernel based estimator provides a more efficient prediction method for motor insurance claim data and ultimately leads to more accurate predictions.

SuggestionsSome additional exogenous variables such as environmental among other institutional factors may have effect on claim frequencies therefore, more robust spatial estimator need to be constructed using the proposed idea to investigate how these factors may affect claim frequencies. Further research can also be done on the theoretical properties of this proposed model estimator. In addition, this study made the assumption that the errors were correlated, for this reason future studies could consider a case of uncorrelated error structure.

Sincere thanks to my supervisors Dr. Kube Anada and Dr. Thomas Mageto for their professional contribution and performance, and special thanks to my parents for their moral support and rare attitude of high quality.

The authors declare no conflicts of interest.

Kipngetich, G., Kube, A. and Mageto, T. (2021) A Spatial-Nonparametric Approach for Prediction of Claim Frequency in Motor Insurance. Open Journal of Statistics, 11, 493-505. https://doi.org/10.4236/ojs.2021.114031