_{1}

This paper is concerned about studying modeling-based methods in cluster analysis to classify data elements into clusters and thus dealing with time series in view of this classification to choose the appropriate mixed model. The mixture-model cluster analysis technique under different covariance structures of the component densities is presented. This model is used to capture the compactness, orientation, shape, and the volume of component clusters in one expert system to handle Gaussian high dimensional heterogeneous data set. To achieve flexibility in currently practiced cluster analysis techniques. The Expectation-Maximization (EM) algorithm is considered to estimate the parameter of the covariance matrix. To judge the goodness of the models, some criteria are used. These criteria are for the covariance matrix produced by the simulation. These models have not been tackled in previous studies. The results showed the superiority criterion ICOMP PEU to other criteria. This is in addition to the success of the model based on Gaussian clusters in the prediction by using covariance matrices used in this study. The study also found the possibility of determining the optimal number of clusters by choosing the number of clusters corresponding to lower values for the different criteria used in the study .

The clustering analysis is one of the statistical methods that deal with the division and classification of variables data elements into several homogeneous groups that are homogeneous within one group (cluster) and are different from other groups (other clusters). Cluster analysis is defined as, a set of methods for constructing a (hopefully) sensible and informative classification of an initially unclassified set of data, using the variable values observed on each individual. All such methods essentially try to imitate what the eye-brain system does so well in two dimensions (Everitt and Skrondal [

Banfield and Raftery [

Different constraints on the covariance matrix provides different models that are applicable to different data structures, which is another advantage of model-based clustering. In 1995, Celeux and Govaert [

Later in 2016, Chi et al., [

Cluster analysis is used in various fields of science. Tóth et al., [

In 2000, Bozdogan [

The main contribution of the present paper is to propose the mixture-model cluster analysis technique under different covariance structures of the component densities. To determine the optimal number of clusters by selecting the number of clusters corresponding to the lowest values for the different criteria. Four models for covariance structures that have not been applied in previous studies are studied using three criteria of the complexity of information.

This paper is organized as follows: Section one is the introduction and section two the Gaussian Mixture Model-based Clustering (GMMC) is discussed. In section three, the Expectation-Maximization (EM) algorithm is introduced. The Model Selection Criteria are introduced in section four. Finally, sections five and six contain the Numerical Results, and the Conclusion, respectively (

The Gaussian mixture model is a powerful clustering algorithm used in cluster analysis. It is the most widely used clustering method of this kind, is the one based on learning a mixture of Gaussians. It assumes that there are a certain number of Gaussian distributions, and each of these distributions represents a cluster. Hence, a Gaussian Mixture Model tends to group the data points belonging to a single distribution together. Gaussian Mixture Models are probabilistic models and use the soft clustering approach for distributing the points in different clusters. It’s difficult to determine the right model parameters, Expectation-Maximization method is used to determine the model parameters.

In a case where X ∈ ℝ ( n × p ) are given (p dimensional data of size n), would be interested in estimating the number of clusters K. Assuming the observations x i j ( i = 1 , ⋯ , n , j = 1 , ⋯ , p ) are assumed to be drawn from the following mixture K distribution, each corresponding to a different cluster:

parameters | nomenclatures | parameters | nomenclatures |
---|---|---|---|

π_{K} | mixing proportion | λ_{k} | Scalar controlling the volume of the ellipsoid |

θ_{k} | vector of unknown parameters | A_{κ} | diagonal matrix |

S_{k} | covariance matrix | D_{κ} | orthogonal matrix |

μ_{k} | mean vector |

f ( x ; π , θ ) = ∑ k = 1 K π k g k ( x ; θ k )

Here π 1 , ⋯ , π K are the mixing proportions that satisfy π k > 0 and ∑ k = 1 K π k = 1 . θ k is the vector of unknown parameters of the kth component, and π k represents the probability that an observation belongs to the kth component. The Gaussian mixture model assumes that the components of the mixture are the multivariate normal distribution, thus the density function becomes:

f ( x ; π , μ , Σ ) = ∑ k = 1 K π k g k ( x ; μ k , Σ k )

The mixture components (i.e. clusters) are ellipsoids centered at μ k with other geometric features, such as volume, shape, and orientation, determined by the covariance matrix Σ k . (Titterington et al. [

In this case, the component densities g k are given by:

g k ( x ; μ k , Σ k ) = ( 2 π ) − p 2 | Σ k | − 1 2 exp { − 1 2 ( x − μ k ) Σ k − 1 ( x − μ k ) } _{ }

Parsimonious parameterizations of the covariance matrices can be obtained by using the eigenvalue decomposition of the covariance matrix. The eigenvalue decomposition of the kth covariance matrix is given as:

Σ k = λ k D k A κ D κ T > k

where: λ k is a scalar controlling the volume of the ellipsoid.

A κ is a diagonal matrix specifying the shape of the density contours with det ( A κ ) = 1 .

D κ is an orthogonal matrix which determines the orientation of the corresponding ellipsoid (Banfield and Raftery [

In one dimension, there are just two models: E for equal variance and V for varying variance. In the multivariate setting, the volume, shape, and orientation of the covariance can be constrained to be equal or variable across groups. Thus, 14 possible models with different geometric characteristics can be specified.

Approaching the clustering problem from this probabilistic standpoint reduces the whole problem to the parameter estimation of a mixture density. The unknown parameters of the Gaussian mixture density, are the mixing proportions, π k , the mean vectors, μ k , and the covariance matrices, Σ k . Therefore, to estimate these parameters, we need to maximize the log-likelihood given by:

log L ( θ | x ) = ∑ i = 1 n log [ ∑ k = 1 K π k g k ( x i | μ k , Σ k ) ]

The estimates of the mixing proportion, π k , the mean vector μ k , and the covariance matrix Σ k for the kth population are given as:

Model | Covariance | Distribution | Volume | Shape | Orientation | |
---|---|---|---|---|---|---|

1 | EII | λ I | Spherical | Equal | Equal | - |

2 | VII | λ k I | Spherical | Variable | Equal | - |

3 | EEI | λ A | Diagonal | Equal | Equal | Coordinate axes |

4 | VEI | λ k A | Diagonal | Variable | Equal | Coordinate axes |

5 | EVI | λ A κ | Diagonal | Equal | Variable | Coordinate axes |

6 | VVI | λ k A κ | Diagonal | Variable | Variable | Coordinate axes |

7 | EEE | λ D A D T | Ellipsoidal | Equal | Equal | Equal |

8 | EVE | λ D A κ D T | Ellipsoidal | Equal | Variable | Equal |

9 | VEE | λ k D A D T | Ellipsoidal | Variable | Equal | Equal |

10 | VVE | λ k D A κ D T | Ellipsoidal | Variable | Variable | Equal |

11 | EEV | λ D k A D κ T | Ellipsoidal | Equal | Equal | Variable |

12 | VEV | λ k D k A D κ T | Ellipsoidal | Variable | Equal | Variable |

13 | EVV | λ D k A κ D κ T | Ellipsoidal | Equal | Variable | Variable |

14 | VVV | λ k D k A κ D κ T | Ellipsoidal | Variable | Variable | Variable |

π ^ k = 1 n ∑ i = 1 n I k ( Y ^ i )

μ ^ k = 1 π ^ k n ∑ i = 1 n χ i I k ( Y ^ i )

Σ ^ k = 1 π ^ k n ∑ i = 1 n [ ( χ i − μ ^ k ) ′ ( χ i − μ ^ k ) ] I k ( Y ^ i )

where: I k ( Y ^ i ) = { 1 Y ^ i = k 0 Y ^ i ≠ k .

This estimation requires the non-linear optimization of the mixture likelihood for high-dimensional data sets. However, there are no closed-form solutions to

∂ ∂ θ log L ( θ ^ | x ) = 0 for any mixture density; so the likelihood has to be numerically maximized. For this numerical optimization, the Expectation-Maximization (EM) algorithm of Dempster et al. [_{i} as missing.

The expectation-maximization (EM) algorithm is an iterative procedure used to find maximum likelihood estimates when data are incomplete or are treated as being incomplete. The consummate citation for the EM algorithm is the famous paper by Dempster et al. [

The EM algorithm is an iterative procedure consisting of two alternating steps, given some starting values for all parameters ( π ^ k , μ ^ k and Σ ^ k ). The algorithm can be summarized as follows at iteration (t + 1):

1) In the E-step, the posterior probability, T ^ i k of the ith observation belonging to the kth component is estimated, given the current parameter estimates.

T ^ i k = π ^ k ( t ) g k ( x i | μ ^ k ( t ) , Σ ^ k ( t ) ) ∑ k = 1 K π ^ k ( t ) g k ( x i | μ ^ k ( t ) , Σ ^ k ( t ) ) .

2) In the M-step, the parameter estimates of π k , μ k and Σ k are updated given the estimated posterior probabilities, using the update equations

π ^ κ ( t + 1 ) = 1 n ∑ i = 1 n T ^ i κ

μ ^ κ ( t + 1 ) = 1 n π ^ κ ( t + 1 ) ∑ i = 1 n x i T ^ i κ

Σ ^ κ ( t + 1 ) = 1 n π ^ κ ( t + 1 ) ∑ i = 1 n T ^ i κ ( x i − μ ^ κ ( t + 1 ) ) ′ ( x i − μ ^ κ ( t + 1 ) )

3) Iterate the first two steps until convergence.

The EM algorithm requires two issues to be addressed; determining the number of components, K, and initialization of the parameters.

After estimating the parameters for the covariance matrix, the next step of determining the optimal cluster structure is selecting the best model. Despite the vast number of different model selection criteria in the literature, Schwarz’s Bayesian Criteria (SBC) (Schwarz [

AIC = − 2 log L ( θ ^ | x ) + 2 m

SBC = − 2 log L ( θ ^ | x ) + m log ( n )

where: L ( θ ^ | x ) is the likelihood function.

m is the number of independent parameters to be estimated.

θ ^ is the maximum likelihood estimate for parameter θ.

ICOMP, originally introduced by Bozdogan [

ICOMP = − 2 log L ( θ ^ | x ) + 2 C ( C o v ( θ ) ^ )

where: L ( θ ^ | x ) is the likelihood function.

C is a real-valued complexity measure.

C o v ( θ ) ^ is the estimated model covariance matrix.

The covariance matrix is estimated by the estimated inverse Fisher information matrix (IFIM), F ^ − 1 is given by:

F ^ − 1 = { − E [ ∂ 2 log L ( θ ^ ) ∂ θ ∂ θ ′ ] } − 1

That is to say, IFIM is the negative expectation of the matrix of the second partial derivatives of the maximized log-likelihood of the fitted model, evaluated at the maximum likelihood estimators θ ^ .

For a multivariate normal model, the general form of ICOMP is defined as:

ICOMP PEU ( F ^ − 1 ) = − 2 log L ( θ ^ | x ) + m + log ( n ) C 1 ( F ^ − 1 )

where:

C 1 ( F ^ − 1 ) = S 2 log [ t r ( F ^ − 1 ) s ] − 1 2 log | F ^ − 1 |

s = dim ( F ^ − 1 ) = rank ( F ^ − 1 )

For all the above criteria, the decision rule is to select the model that gives the minimum score for the loss function.

All results were obtained by using MATLAB.

The Gaussian mixture-model based clustering is applied, which implements the EM algorithm for inference, to four simulated data sets. The maximum number of clusters is taken K max = 6 for all examples. The convergence criteria of the EM algorithm are set to see = 10^{−}^{6} and a maximum of 1000 iterations is allowed. After confirming the validity of mathematical equations and the program, four models of covariance matrix were applied. These models are:

Model: EVV with the covariance matrix ( λ D k A κ D κ T ).

Model: VII with the covariance matrix ( λ k I ).

Model: VEE with the covariance matrix ( λ k D A D T ).

Model: VVE with the covariance matrix ( λ k D A κ D T ).

These models have been selected due to their distinguishing features: They represent different cases of the covariance matrix. Where the models [EVV] [VEE] and [VVE] belong to the General Family (Celeux and Govaert [

1) Model: EVV with the covariance matrix ( λ D k A κ D κ T ) (

From

Given below in

For the selected model, GMMC identifies the cluster labels with a miss classification rate of 1%. The miss classification rate is calculated as follows:

ICOMPPEU | SBC | AIC | No. of clusters |
---|---|---|---|

2174.1 | 2188.3 | 2188.1 | 1 |

1826.8 | 1844.3 | 1837.8 | 2 |

1828.7 | 1852.3 | 1839.7 | 3 |

1831 | 1860.8 | 1842 | 4 |

1832.4 | 1868.4 | 1843.4 | 5 |

1842.2 | 1884.5 | 1853.2 | 6 |

Output | Parameters | Input | Model |
---|---|---|---|

No. of simulations = 100 COV 1 = [ 1.2671 1.2559 1.2559 2.0334 ] COV 2 = [ 3.738 − 3.6801 − 3.6801 4.5328 ] π k = [ 0.6997 0.3003 ] μ k = [ 1.9466 1.8859 ] , [ − 2.8553 − 0.1221 ] | λ = 2 A 1 = [ 1 0 0 1 ] A 2 = [ 1 0 0 3 ] COV 1 = [ 1.2929 1.2483 1.2483 2.000 ] COV 2 = [ 4.7071 − 4.6268 − 4.6268 5.4142 ] π k = [ 0.7 0.3 ] μ k = [ 2 2 ] , [ − 3 0 ] | n = 250 n_{1} = 175 n_{2} = 75 K = 2 | EVV λ D k A k D k T |

Predicted | ||||
---|---|---|---|---|

Total | 2 | 1 | ||

175 75 | 1 75 | 174 0 | 1 2 | Actual |

250 | 76 | 174 | Total |

( 1 − a i i + a j j Σ ) × 100 = ( 1 − 174 + 75 250 ) × 100 = ( 1 − 249 250 ) × 100 = ( 1 − 0.99 ) × 100 = 1

2) Model: VII with the covariance matrix ( λ k I ) (

Using

3) Model: VEE with the covariance matrix ( λ k D A D T ) (

From the results in

For this model, the miss classification rate was 15%.

4) Model: VVE with the covariance matrix ( λ k D A κ D T ) (

ICOMPPEU | SBC | AIC | No. of clusters |
---|---|---|---|

2018.8 | 2034.5 | 2034.3 | 1 |

1859.8 | 1904.8 | 1873.5 | 2 |

1864.4 | 1890.8 | 1878.1 | 3 |

1863.5 | 1896 | 1877.7 | 4 |

1868.7 | 1907.4 | 1882.4 | 5 |

1873.7 | 1893.8 | 1888.74 | 6 |

Output | Parameters | Input | Model |
---|---|---|---|

No. of simulations = 100 COV 1 = [ 1.0460 − 0.0659 − 0.0659 0.8913 ] COV 2 = [ 2.4341 − 0.5134 − 0.5134 1.9715 ] π k = [ 0.7040 0.2960 ] μ k = [ 1.9139 2.0450 ] , [ − 2.7437 − 0.1582 ] | λ k = 1 , 2 A 1 = [ 1 0 0 1 ] A 2 = [ 1 0 0 3 ] COV 1 = [ 1.2929 1.2483 1.2483 2.000 ] COV 2 = [ 4.7071 − 4.6268 − 4.6268 5.4142 ] π k = [ 0.7 0.3 ] μ k = [ 2 2 ] , [ − 3 0 ] | n = 250 n_{1} = 175 n_{2} = 75 K = 2 | VII λ k I |

Predicted | ||||
---|---|---|---|---|

Total | 2 | 1 | ||

175 75 | 0 72 | 175 3 | 1 2 | Actual |

250 | 72 | 178 | Total |

ICOMPPEU | SBC | AIC | No. of clusters |
---|---|---|---|

2028.1 | 2043.7 | 2043.4 | 1 |

1727.6 | 1744.9 | 1738.4 | 2 |

1725.2 | 1748.7 | 1736.1 | 3 |

1731.8 | 1761.5 | 1742.7 | 4 |

1741.8 | 1777.7 | 1752.6 | 5 |

1741.9 | 1784 | 1752.7 | 6 |

Output | Parameters | Input | Model |
---|---|---|---|

No. of simulations = 100 COV 1 = [ 0.8652 0.6431 0.6431 0.8904 ] COV 2 = [ 1.7439 2.1360 2.1360 3.2356 ] COV 3 = [ 1.2078 0.8378 0.8378 1.7065 ] π k = [ 0.4559 0.4162 0.1279 ] μ k = [ 0.7230 1.2093 ] , [ 0.3678 0.0200 ] , [ 1.1760 0.3249 ] | λ k = 1 , 1.5 , 3 A = [ 1 0 0 1 ] D = [ cos ( 6 ∗ π 8 ) sin ( π 8 ) − sin ( π 8 ) cos ( π 8 ) ] COV 1 = [ 0.6464 0.6242 0.6242 1.000 ] COV 2 = [ 0.9697 0.9362 0.9362 1.500 ] COV 3 = [ 1.9393 1.8725 1.8725 3.000 ] π k = [ 0.3 0.5 0.2 ] μ 1 , 2 , 3 = [ 0.5 1 ] , [ 1 1 ] , [ 0 − 0.5 ] | n = 500 n_{1} = 150 n_{2} = 200 n_{3} = 150 K = 3 | VEE λ k D A D T |

Predicted | |||||
---|---|---|---|---|---|

Total | 3 | 2 | 1 | ||

150 200 150 | 0 0 129 | 24 170 1 | 126 30 20 | 1 2 3 | Actual |

500 | 129 | 195 | 176 | Total |

The fit number of clusters for this model was two clusters (

It was shown that the miss classification rate was 0% from the data in

In this paper, the Gaussian mixture model-based clustering is used. The mixture models based on clusters are able to predict accurately if the appropriate covariance matrix, model is selected. It is applied by using four models:

ICOMPPEU | SBC | AIC | No. of clusters |
---|---|---|---|

1872.6 | 1884.3 | 1884.1 | 1 |

1516.1 | 1529.5 | 1523.1 | 2 |

1520.7 | 1540.4 | 1527.7 | 3 |

1516.8 | 1542.6 | 1523.8 | 4 |

1529.7 | 1561.8 | 1536.7 | 5 |

1527.7 | 1566 | 1534.7 | 6 |

Output | Parameters | Input | Model |
---|---|---|---|

No. of simulations = 100 COV 1 = [ 0.6007 0.5619 0.5619 0.9121 ] COV 2 = [ 1.1635 1.8430 1.8430 4.1089 ] π k = [ 0.7039 0.2961 ] μ k = [ 2.0486 2.0247 ] , [ − 3.0599 − 0.0947 ] | λ k = 1 , 1.5 A 1 = [ 1 0 0 1 ] A 2 = [ 1 0 0 3 ] D = [ cos ( 6 ∗ π 8 ) sin ( π 8 ) − sin ( π 8 ) cos ( π 8 ) ] COV 1 = [ 1.1980 1.6492 1.6492 3.0186 ] COV 2 = [ 1.4402 1.7089 1.7089 3.2965 ] π k = [ 0.7 0.3 ] μ k = [ 2 2 ] , [ − 3 0 ] | n = 250 n_{1} = 175 n_{2} = 75 K = 2 | VVE λ k D A k D T |

Predicted | ||||
---|---|---|---|---|

Total | 2 | 1 | ||

175 75 | 0 75 | 175 0 | 1 2 | Actual |

250 | 75 | 175 | Total |

1) Model [EVV] ( λ D k A k D k T ) represents the case of equal volume, variable shape, and orientation. It is showed that the optimal number of clusters equals two. From the values of the complexity criteria in

2) Model [VII] ( λ k I ) represents the case of variable volume, shape, and orientation. Also, in this model, the optimal number of clusters equals two and the ICOMPPEU criterion corresponds to the lowest value compared to the other two parameters (the values in

3) Model [VEE] ( λ k D A D T ) represents the case of variable volume, equal shape, and direction. From

4) Model [VVE] ( λ k D A k D T ) represents the case of variable volume, shape, and equal orientation. As the first and second model the optimal number of clusters equals two the ICOMPPEU criterion corresponds to the lowest value compared to the other two criteria (values are found in

The results showed that the ICOMPPEU criteria were superior to the rest of the criteria. In addition to the success of the Gauss model based on the clusters in the prediction using the covariance matrix. The study also determined the possibility of determining the optimal number of clusters by selecting the number of clusters corresponding to the lowest values of the different criteria.

For the number of clusters k = 1, ..., 6, the three different selection criteria have chosen the VVE model for the number of clusters two to be the optimal model. For the selected model, the Gaussian Mixture Model-based Clustering (GMMC) diagnoses the cluster classification with a 0% miss classification rate.

The author declares no conflicts of interest regarding the publication of this paper.

Morad, N.A. (2020) Modeling Methods in Clustering Analysis for Time Series Data. Open Journal of Statistics, 10, 565-580. https://doi.org/10.4236/ojs.2020.103034