Goodness-of-Fit Test for Non-Stationary and Strongly Dependent Samples

Abstract

In this article we improve a goodness-of-fit test, of the Kolmogorov-Smirnov type, for equally distributed- but not stationary-strongly dependent data. The test is based on the asymptotic behavior of the empirical process, which is much more complex than in the classical case. Applications to simulated data and discussion of the obtained results are provided. This is, to the best of our knowledge, the first result providing a general goodness of fit test for non-weakly dependent data.

Share and Cite:

Crisci, C. , Perera, G. and Sampognaro, L. (2023) Goodness-of-Fit Test for Non-Stationary and Strongly Dependent Samples. Advances in Pure Mathematics, 13, 226-236. doi: 10.4236/apm.2023.135016.

1. Introduction

Kolmogorov-Smirnov (KS, for short, in the sequel) is one of the best-known goodness-of-fit tests for iid samples following a continuous distribution. For a small or moderate sample size, the critical values of the KS statistic, given the level of significance (or the p-value for a given KS statistic) can be computed exactly [1] . For large sample sizes, the asymptotic behavior of the empirical process, which we will recall later on, provide an approximation to the critical values. Several extensions of KS-type tests from the classical iid case to weakly dependent data have been developed and there are substantial recent contributions in this regard (see for instance, [2] [3] ). However, the literature does not provide such a test for strongly dependent (and non-stationary) data, which is of deep interest for some applications, as we will see later on.

In previous work, we have used a model for strongly dependent and non-stationary data, that can be used in a wide series of fields, and that allows to develop different techniques, such as High Level Exceedances [4] , statistics on the mean of a random field [5] , non-parametric regression [6] , or asymptotic behavior of extremes [7] .

The basic idea used in such a model is that data depends on two independent components, one of which is merely random noise iid and the other, which specifies the “state” of the system under observation, is categorical, but non-stationary and strongly dependent. For instance, if our data is the maximum wind speed in a given 10-minute period, different combinations of meteorological variables define a finite possible number of “states” of the atmosphere. These states do not determine the wind speed, but have a clear influence on the maximum speed. In general, atmosphere states are not stationary and may present strong correlations with data from many years ago, while each year corresponds to 52,560 periods of ten minutes, and therefore, a strong dependency structure must be taken into account.

As mentioned before, the KS test, and other goodness of fit tests are based on the theory of empirical processes [8] . In particular, the statistic of the KS test leads to consistency against any fixed alternative, thanks to the first theorem concerning the asymptotic behavior of the empirical process for large sample size: the well-known Glivenko-Cantelli theorem. The computation of critical values for the KS test in the case of large sample sizes relies on the second fundamental theorem of the asymptotic theory of empirical processes, namely, the Donsker invariance principle [9] . Finally, for some more intricate asymptotic computations, the so-called Hungarian embedding [10] [11] [12] [13] [14] due to Komlós-Major-Tusnády (KMT for short, in the sequel) is a powerful tool.

Let us know recall in more detail these fundamental results. Consider an iid sequence of real random variables X 1 , , X n , such that X 1 follows the continuous distribution F, and denote F n the empirical distribution of the sample X 1 , , X n defined by

F n ( t ) = 1 n i = 1 n 1 { X i t } , t .

The Glivenko-Cantelli theorem establishes that

sup t | F n ( t ) F ( t ) | n a . s . 0 .

If we denote by b the Brownian bridge process, defined as the continuous, centered, Gaussian stochastic process of continuous parameter in [ 0,1 ] characterized by the covariance structure E ( b s b t ) = s s t , 0 s t 1 , then Donsker invariance principle shows that n ( F n ( t ) F ( t ) ) n w b F ( t ) , where “ w ” denotes weak convergence as a stochastic process (in Prohorov metric), which in turn implies that

P ( n sup t | F n ( t ) F ( t ) | x ) n P ( sup t | b F ( t ) | x ) = P ( sup u [ 0 , 1 ] | b u | x ) : = Q ( x ) x 0 ,

where Q is the tail of the well-known Kolmogorov-Smirnov distribution, allowing to the computation of the critical values x, given a level of significance, for the KS test when n is large.

Finally, KMT provides a sequence of Brownian bridges ( b m ) m and a finite, non-negative random variable C such that

F n ( t ) = F ( t ) + b F ( t ) m n + R n ( t ) , with sup t | R n ( t ) | C log ( n ) n .

2. Main Result

We shall consider the following model: X 1 , , X n , will be our data, with X i = f ( ξ i , Y i ) , where ( ξ i ) i , Y = ( Y i ) i , independent among them, ( ξ i ) i iid, and Y = ( Y i ) i satisfying:

Y i { 1, , k } i ,

j = 1 , , k there exists a random variable τ j > 0 such that

(H1) τ n ( j ) = 1 n j = 1 n 1 { Y i = j } n a . s . τ j ( Y ) , where j = 1 k τ j = 1 .

Thinking of Y i as the state of the system at time i, even if the process Y is not stationary, assumption (H1) means that the observed frequencies of any state are convergent on average (which holds true under seasonal effects or monotonous trends), but since Y also exhibits strong dependence, the corresponding limits are random variables.

The empirical distribution of our sample is:

F n ( t ) = 1 n i = 1 n 1 { X i t } = 1 n i = 1 n 1 { f ( ξ i , Y i ) t } = j = 1 k [ 1 n i = 1 n ( 1 { f ( ξ i , j ) t } 1 { Y i = j } ) ] (1)

Let us define for any n and j = 1, , k ,

A n ( j ) = 1 n j = 1 n ( 1 { f ( ξ i , j ) t } 1 { Y i = j } ) (2)

Let us call S to the space of all the sequences taking values in { 1, , k } . Considering (H1), the subset of S defined by:

Ω Y = { ( y i ) i S / τ n ( j , y ) : = 1 n i = 1 n 1 { y i = j } n τ j ( y ) j = 1 , , k }

fulfills that,

P Y ( Ω Y ) = P ( Y Ω Y ) = 1 (3)

therefore by conditioning A n ( j ) to Y = y , with y = ( y i ) i , we can assume that y Ω Y and, in such a case, by the independence of ( ξ i ) i and Y:

( A n ( j ) ) j = 1 , , k / Y = y ~ ( 1 n i = 1 n 1 { f ( ξ i , j ) t } 1 { y i = j } ) j = 1 , , k (4)

Remark 1

It is a key fact that the k subsamples ( f ( ξ i , j ) ) { i : y i = j } , j = 1 , , k , are independent and each one of size n τ n ( j , y ) .

If we call F j to the distribution of f ( ξ 0 , j ) and assume that F j is continuous we have that:

1 n i = 1 n 1 { f ( ξ i , j ) t } 1 { y i = j } = n τ n ( j , y ) n 1 τ n ( j , y ) i = 1 n 1 { f ( ξ i , j ) t } 1 { y i = j } = τ n ( j , y ) F τ n ( j , y ) n j ( t ) (5)

where F τ n ( j , y ) n j is the empirical distribution of the subsample j of Remark 1, with distribution F j , which are independent among them. Then, from Glivenko-Cantelli

τ n ( j , y ) F τ n ( j , y ) n j ( t ) n a . s . τ j ( y ) F j ( t ) t , j = 1 , , k (6)

and

sup t | τ n ( j , y ) F τ n ( j , y ) n j ( t ) τ j ( y ) F j ( t ) | n a . s . 0 (7)

Then, applying Equations (1) to (6) we have that:

P ( sup t | F n ( t ) j = 1 k τ j F j ( t ) | n 0 ) = Ω Y P ( sup t | F n ( t ) j = 1 k τ j F j ( t ) | n 0 / Y = y ) d P Y ( y ) = Ω Y P ( sup t | j = 1 k ( τ n ( j , y ) F τ n ( j , y ) n j ( t ) τ j F j ( t ) ) | n 0 / Y = y ) d P Y ( y ) Ω Y P ( j = 1 k sup t | τ n ( j , y ) F τ n ( j , y ) n j ( t ) τ j F j ( t ) | n 0 / Y = y ) d P Y ( y ) = 1

(since each value of the last integrator equals one by (7)).

In conclusion we get:

Theorem 1

Under the previous hypotheses,

sup t | F n ( t ) j = 1 k τ j F j ( t ) | n a . s . 0

Remark 2

It should be noticed that

j = 1 k τ j F j ( t ) (8)

is a random mixture of the distributions F j , j = 1 , , k .

Let us look more closely to a very simple case. Assume that k = 2 (therefore, τ 2 = 1 τ 1 ), and that τ 1 takes values 0 or 1 with P ( τ 1 = 0 ) = p , P ( τ 1 = 1 ) = 1 p , where 0 < p < 1 .

Then, with probability p, when τ 1 = 0 the random mixture (8) equals F 2 ( t ) , and with probability 1 p , when τ 1 = 1 the random mixture (8) equals F 1 ( t ) and hence, (8) is just an ordinary mixture of F 1 and F 2 .

The preceding result shows that a KS-type test will be consistent under any given alternative in this context, but to improve the test, computing the critical value for a given significance level (or p-values), we need a refinement of Theorem 1, providing the asymptotic distribution of the test statistic. This will be obtained in Theorem 2.

Given j = 1 , , k fixed, then the sequence ( f ( ξ i , j ) ) i is iid with distribution F j and therefore, calling F n j to its corresponding empirical distribution, that is:

F n j ( t ) = 1 n i = 1 n 1 { f ( ξ i , j ) t } ,

then, from KMT, there exists one sequence of Brownian bridges ( b m , j ) m such as:

F n j ( t ) = F j ( t ) + b F j ( t ) n , j n + R n j ( t ) ,

where sup t | R n j ( t ) | C j log n n and C j is a finite and non-negative random variable.

Remark 3

Since the sequence of bridges ( b m , j ) m is originated by ( f ( ξ i , j ) ) i , it depends on ( ξ i ) i , which is independent of Y = ( Y i ) i , and therefore, it must be taken into account that all the bridges ( b m , j ) m , j = 1, , k are independent of Y.

Let us then consider x 0 and compute:

P ( n sup t | F n ( t ) j = 1 k τ j F j ( t ) | x ) = Ω Y P ( sup t n | F n ( t ) j = 1 k τ j F j ( t ) | x / Y = y ) d P Y ( y ) (9)

Now, given that Y = y = ( y i ) i ,

F n ( t ) = 1 n j = 1 k i = 1 n 1 { f ( ξ i , j ) t } 1 { y i = j } = j = 1 k τ n ( j , y ) n τ n ( j , y ) i = 1 n 1 { f ( ξ i , j ) t } 1 { y i = j } (10)

But (10), as a stochastic process, has the same distribution as:

j = 1 k τ n ( j , y ) 1 n τ n ( j , y ) i = 1 i = n τ n ( j , y ) 1 { f ( ξ i j , j ) t }

where ( ξ i j ) i is iid with distribution equal to that of ξ 0 and such that, when j varies, the sequences ( ξ i j ) i are independent among them.

If we now return to Remark 3, building the Hungarian embedding for each ( ξ i j ) i , we may assume that the KMT representation for the empirical distribution is valid with a sequence of Brownian bridges ( b m , j ) m , j = 1 , , k , that are not only independent with respect to Y but also independent among them when j varies. Therefore, as keeping the distribution unchanged does not affect the probabilities, we have that (9) equals to:

Ω Y P ( sup t n | j = 1 k τ n ( j , y ) F n τ n ( j , y ) j ( t ) τ j F j ( t ) | x / Y = y ) d P Y ( y ) = Ω Y P ( sup t n | j = 1 k τ n ( j , y ) ( F j ( t ) + b F j ( t ) n τ n ( j , y ) , j n τ n ( j , y ) + R n τ n ( j , y ) j ( t ) ) τ j F j ( t ) | x / Y = y ) d P Y ( y ) (11)

Considering in (11) that the terms R n τ n ( j , y ) j are negligible and that, as indicated above, the distribution as a process of the summation is not changed (and

therefore neither does the probability) if instead of ( b n τ n ( j , y ) ) j = 1 , , k we put

( b j ) j = 1 , , k Brownian bridges independent among them and with respect to Y (and therefore with respect to the τ n ( j ,. ) and τ j ), we have that (11) equals to:

Ω Y P ( sup t | n j = 1 k ( τ n ( j , y ) τ j ) F j ( t ) + j = 1 k b F j ( t ) j τ n ( j , y ) | x ) d P Y ( y ) = P ( sup t | j = 1 k n ( τ n ( j ) τ j ) F j ( t ) + j = 1 k b F j ( t ) j τ n ( j ) | x ) (12)

If we take the limit for n tending to infinity in (12), under the additional hypothesis.

(H2) The sequence of random vectors n ( τ n ( j ) τ j ) j = 1, , k n w D = ( D 1 , , D k ) where D is a random vector in k , degenerated (since j = 1 k D j = 0 ), but the vectors of k 1 obtained by the suppression of one of any of the k coordinates of D are not degenerated, and where D is independent of the Brownian bridges ( b j ) j = 1 , , k , we finally have that (12) tends to:

P ( sup t | j = 1 k D j F j ( t ) + j = 1 k b F j ( t ) j τ j | x )

Therefore we have:

Theorem 2

Under the previous hypotheses, x 0 :

P ( n sup t | F n ( t ) j = 1 k τ j F j ( t ) | x ) n P ( sup t | j = 1 k D j F j ( t ) + j = 1 k b F j ( t ) j τ j | x ) : = T ( x )

Remark 4

The expression of T ( x ) can be computed by Monte Carlo as will be seen in the next section.

Remark 5

Obviously, for practical purposes, D j and τ j should be often replaced by their empirical estimations.

3. A Model for Simulated Data

For our simulations, we will use the model of Example 2 of [7] , with some minor modifications.

Consider σ ( 1 ) , σ ( 2 ) independent, such that

P ( σ ( 1 ) = 1 ) = δ , P ( σ ( 1 ) = 2 ) = 1 δ , P ( σ ( 2 ) = 1 ) = η ,

P ( σ ( 2 ) = 2 ) = 1 η , 0 < δ < 1 , 0 < η < 1 , and δ η .

Let ( σ 1 ( 1 ) , σ 1 ( 2 ) ) , , ( σ n ( 1 ) , σ n ( 2 ) ) , iid with the same distribution as ( σ ( 1 ) , σ ( 2 ) ) , and consider a fixed random variable U independent of ( σ 1 ( 1 ) , σ 1 ( 2 ) ) , , ( σ n ( 1 ) , σ n ( 2 ) ) , , such that

P ( U = 1 ) = p , P ( U = 2 ) = 1 p , 0 < p < 1

and define

Y i : = σ i ( U ) (13)

Then k = 2 , and:

τ n ( 1 ) = 1 n i = 1 n 1 { Y i = 1 }

Hence

τ n ( 1 ) / U = 1 ~ 1 n i = 1 n 1 { σ i ( 1 ) = 1 } n a . s . δ

(by the Strong Law of Large Numbers), and in a similar way

τ n ( 1 ) / U = 2 ~ 1 n i = 1 n 1 { σ i ( 2 ) = 1 } n a . s . η , and then τ n ( 1 ) n a . s . τ 1 ,

with

τ 1 = ( δ if U = 1 η if U = 2 (14)

On the other hand,

τ n ( 2 ) = 1 n i = 1 n 1 { Y i = 2 }

and

τ n ( 2 ) / U = 1 ~ 1 n i = 1 n 1 { σ i ( 1 ) = 2 } n a . s . 1 δ ,

and

τ n ( 2 ) / U = 2 ~ 1 n i = 1 n 1 { σ i ( 2 ) = 2 } n a . s . 1 η ,

and then τ n ( 2 ) n a . s . τ 2 , with

τ 2 = ( 1 δ if U = 1 1 η if U = 2 (15)

and thus, by (14) and (15), (H1) is satisfied.

Now consider the bivariate random vector n ( τ n ( j ) τ j ) j = 1 , 2 .

Then

n ( τ n ( 1 ) τ 1 , τ n ( 2 ) τ 2 ) / U = 1 ~ ( n ( 1 n i = 1 n 1 { σ i ( 1 ) = 1 } δ ) , n ( 1 n i = 1 n 1 { σ i ( 1 ) = 2 } ( 1 δ ) ) ) n w Z 1

a bivariate Gaussian, centered, degenerated random vector with covariance matrix

( δ ( 1 δ ) δ ( 1 δ ) δ ( 1 δ ) δ ( 1 δ ) ) (16)

by the ordinary Central Limit Theorem, and using the fact that

1 { σ i ( 1 ) = 1 } + 1 { σ i ( 1 ) = 2 } = 1 , i .

On the other hand,

n ( τ n ( 1 ) τ 1 , τ n ( 2 ) τ 2 ) / U = 2 ~ ( n ( 1 n i = 1 n 1 { σ i ( 2 ) = 1 } η ) , n ( 1 n i = 1 n 1 { σ i ( 2 ) = 2 } ( 1 η ) ) ) n w Z 2

a bivariate Gaussian, centered, degenerated random vector with covariance matrix

( η ( 1 η ) η ( 1 η ) η ( 1 η ) η ( 1 η ) ) (17)

Therefore, setting:

D = ( Z 1 if U = 1 Z 2 if U = 2 (18)

then D is a centered degenerated bivariate random vector, whose distribution is a mixture of Gaussian laws, and where the suppression of any of its two coordinates is a non-degenerated one-dimensional mixture of Gaussian distributions.

Furthermore, if we write down D = ( D 1 , D 2 ) , then it is very easy to check that D 2 = D 1 , and that D 1 may be represented as ( 2 U ) W 1 + ( U 1 ) W 2 , with W 1 , W 2 independent among them and with respect to U, such that

W 1 ~ N ( 0 , δ ( 1 δ ) ) , W 2 ~ N ( 0 , η ( 1 η ) ) (19)

and therefore, (H2) is satisfied.

Finally, consider F1 and F2, two continuous distributions such that F 1 F 2 , and two independent sequences V 1 ( 1 ) , , V n ( 1 ) , i i d ~ F 1 , V n ( 2 ) , , V n ( 2 ) , i i d ~ F 2 and set:

1) If σ i ( U ) = 1 , X i = V i ( 1 )

2) If σ i ( U ) = 2 , X i = V i ( 2 )

Then, as seen before, Theorem 1 and Theorem 2 apply to ( X i ) i and therefore, we will simulate large samples of such type of data (for different choices of the couple F 1 , F 2 ), improve the test of the KS type given by Theorem 2, and discuss the results.

4. Application to Simulated Data

Following the model of the previous secction we simulated large samples where the KS-type test provided by Theorem 2 was performed.

We have chosen the required parameters in the following way: p = 0.3 , δ = 0.3 , η = 0.6 . With this choice, and assuming as the true model the corresponding mixture with F1 a N ( 0,1 ) distribution, and F2 a N ( 3,1 ) distribution, we simulated 4000 independent samples of size n = 500 of the true model to compute p-values by MonteCarlo.

We also simulated an extra independent sample, following the true model, to apply our test.

We proposed for fitting (i.e. , as H0 in our test) a similar mixture model but taking as F1 a Cauchy distribution with location parameter 0 and scale parameter 1, and as F2 a Cauchy distribution with location parameter 3 and scale parameter 1.

In this context, the corresponding critical value for the KS statistic (maximal difference between the empirical distribution and the proposed one) was 0.3311402 and MonteCarlo computations leads to a p -value = 0.0285 , what clearly implies rejection.

Figure 1 shows the comparison between the empirical distribution of the simulated sample of the true model used for testing, and the theoretical distribution of the proposed model. The difference between them is notorious and in

Figure 1. Proposed distribution vs empirical distribution based on simulated data of the true model.

particular, it should be noticed that the distribution of the proposed model, for larger values of the argument, is always clearly below the empirical distribution of the true model. This reflects the fact that the proposed model is much more heavy-tailed than the true model.

5. Conclusions & Further Work

As seen in the previous section, a KS-type test may be performed for non-stationary and strongly dependent samples of large size. Its performance, both in terms of statistical efficiency and computational complexity is satisfactory. A large variety of real data may be analyzed using this tool and other related ones.

In particular, in a forthcoming paper by the same authors, this goodness of fit test plays a key role in the determination of the number of components and relative weights of a mixture of extremal distributions. The previous paper [7] shows that these types of mixtures are suitable for extremal analysis of many environmental data where non-stationarity and strong dependence appear.

Another direction of further work is the extension of this paper to other testing tools based on the asymptotic behavior of the empirical process and related statistical procedures.

Acknowledgements

This work was partial supported by Proyecto CSIC-VUSPAnálisis de eventos climáticos extremos y su incidencia sobre la producción hortifrutícola en Salto” (Uruguay). The authors thank to an anonymous referee for his highly valuable suggestions.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1] Dickinson Gibbons, J. and Chakraborti, S. (2021) Nonparametric Statistical Inference. 6th Edition, Chapman Hall, London.
[2] Tanguep, E. and Njomen, D. (2021) Kolmogorov-Smirnov APF Test for Inhomogeneous Poisson Processes with Shift Parameter. Applied Mathematics, 12, 322-335.
https://doi.org/10.4236/am.2021.124023
[3] Zhao, J. and Li, X. (2022) Goodness of Fit Test Based on BLUS Residuals for Error Distribution of Regression Model. Applied Mathematics, 13, 672-682.
https://doi.org/10.4236/am.2022.138042
[4] Bellanger, L. and Perera, G. (2003) Compound Poisson Limit Theorems for High-Level Exceedances of Some Non-Stationary Processes. Bernoulli, 9, 497-515.
https://doi.org/10.3150/bj/1065444815
[5] Perera, G. (2002) Irregular Sets and Central Limit Theorems. Bernoulli, 8, 627-642.
[6] Aspirot, L., Bertin, K. and Perera, G. (2009) Asymptotic Normality of the Nadaraya-Watson Estimator for Nonstationary Functional Data and Applications to Telecommunications. Journal of Nonparametric Statistics, 21, 535-551.
https://doi.org/10.1080/10485250902878655
[7] Crisci, C. and Perera, G. (2022) Asymptotic Extremal Distribution for Non-Stationary, Strongly-Dependent Data. Advances in Pure Mathematics, 12, 479-489.
https://doi.org/10.4236/apm.2022.128036
[8] Shorack, G.R. and Wellner, J.A. (2009) Empirical Processes with Applications to Statistics. Classics in Applied Mathematics, xxxvi + 955.
https://doi.org/10.1137/1.9780898719017
[9] Billinsgley, P. (1999) Convergence of Probability Measures. 2nd Edition, John Wiley Sons, Inc., Hoboken.
https://doi.org/10.1002/9780470316962
[10] Komlós, J., Major, P. and Tusnády, G. (1975) An Approximation of Partial Sums of Independent RV’-s, and the Sample DF. I. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 32, 111-131.
https://doi.org/10.1007/BF00533093
[11] Komlós, J., Major, P. and Tusnády, G. (1976) An Approximation of Partial Sums of Independent RV’s, and the Sample DF. II. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 34, 33-58.
https://doi.org/10.1007/BF00532688
[12] Bretagnolle, J. and Massart, P. (1989) Hungarian Constructions from the Nonasymptotic Viewpoint. Annals of Probability, 17, 239-256.
https://doi.org/10.1214/aop/1176991506
[13] Koning, A.J. (1994) KMT-Type Inequalities and Goodness-of-Fit Tests. Statistica Neerlandica, 48, 117-132.
https://doi.org/10.1111/j.1467-9574.1994.tb01437.x
[14] Van der Vaart, A.W. (2000) Asymptotic Statistics. Cambridge University Press, Cambridge.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.