Applying K-Means Clustering and Fuzzy C-Means Clustering in Vehicle Crashes ()
1. Introduction
In machine learning, the process of dividing datasets into groups consisting of similar data points is called clustering, which is an unsupervised learning technique [1]-[3]. The data should be unlabeled, meaning we do not have a dependent variable (or response variable) for the algorithm to compare as a basic truth. Instead, we try to split the dataset into diverse groups so that the data points in the same group have characteristics similar to those of the data points in separate groups. Clustering algorithms do not depend on common assumptions that the conventional statistical methods use, such as the underlying distribution of data, and therefore they are useful in situations where little prior knowledge exists [4]-[6]. Clustering can reveal underlying patterns in data in many applications, including classification, image processing, stock market analysis, pattern recognition, customer segmentation, and identification. Types of clustering include exclusive clustering, in which data points exclusively belong to one cluster, such as K-means clustering. Another type is the overlapping clustering, in which data points belong to multiple clusters, such the Fuzzy C-means clustering. Another type is Hierarchical clustering that includes grouping similar objects into clusters, in which each cluster is distinct from another cluster and the object within that cluster are similar to each other [7] [8]. This paper incorporates the K-means and Fuzzy C-means clustering in vehicle crashes.
2. K-Means
K-means is a widely used clustering algorithm due to its simplicity and clarity. K-Means was originally discovered by Polish Mathematician Hugo Steinhaus in 1956. However, the term “K-means” was first used by James MacQueen in 1967. Later, David Arthur and Sergei Vassilvitskii produced k-means augmentation in 2007 [9]-[13]. The K value represents the number of clusters, and the algorithm works by randomly assigning centroids to the dataset. Then, clusters can be formed by assigning data points to the nearest cluster. From those clusters, new centroids are formed with the mean data points. This iterative process continues until convergence is achieved by finding the optimal clustering results. The algorithm’s efficiency relies on defining appropriate distance measures, such as Euclidean distance, to determine similarity between data points. K-means uses a centroid-based method, iteratively updating cluster centers to minimize intra-cluster distances, which minimizes the Within-Cluster Sum of Squares, enhancing the accuracy of clustering results. Some limitations of K-means include its sensitivity to outliers, affecting the centroids’ positioning, which can produce suboptimal clustering results. It also struggles with non-linear relationships between data points, affecting the accuracy of cluster outputs. In addition, the K-means algorithm does not work with categorical data [14]-[17].
In order to understand the math behind the K-means, we need to know that each cluster is represented by its center (i.e., centroid) which corresponds to the mean of the observation values assigned to the cluster. In addition, K-means clustering, uses the sum of Euclidean distances from points to their nearby cluster centroids. The formula for Euclidean distance is given by [18]-[20]:
The objective function for K-means is given by:
where:
if the data point does not belong to the cluster, and
if the data point belongs to the cluster
J: objective function
k: number of clusters
m: number of cases
Xi: case i
c: centroid of cluster k
Choosing the optimal number of clusters is a critical step in K-means, and since we don’t have predefined cluster counts in unsupervised learning, we need a systematic approach to determine the best k value. The Elbow Method is widely used for this purpose in K-Means clustering. We iterate over a range of k values, typically from 1 to n (where n is a user specified hyper-parameter that you choose). Then, for each k, we calculate the Within-Cluster Sum of Squares, which is the sum of squared distances between each member of the cluster and its centroid. Then, the K value is chosen after the decrease of these distances is almost constant [21]-[23].
3. Fuzzy C-Means
Fuzzy C-means is a clustering algorithm that can manage overlapping clusters. It is considered a soft clustering technique in which every data point is assigned to a cluster along with the probability of it being in the cluster. Therefore, it assigns each data point a degree of membership or probability that indicates the likelihood of a data point belonging to each cluster. Thus, it allows data points to belong to multiple clusters [24]-[29]. It works by randomly initializing cluster centroids from the data set and specifying a fuzziness parameter (m) to control the degree of fuzziness in the clustering. Then, it calculates the probability of each data point to each cluster based on its distance to the cluster centroids using a distance metric, such as Euclidean distance. The process is iterative until a specified number of iterations is reached, or the membership values and centroids converge to stable values. This algorithm is less sensitive to outliers as it allows for soft, probabilistic cluster assignments. The math behind Fuzzy C-means includes [27]-[30]:
Our goal is to minimize the objective function which is as follows:
where:
number of data point
number of clusters
“i” data point
centroid of “j” cluster
membership value of data point of
to cluster
fuzziness parameter (
)
Then, we need to update the membership values using the formula:
Then, we need to update cluster centroids using a weighted average of the data points:
Then, we keep updating the membership values and the cluster centers until the membership values and cluster centers stop changing significantly or when a predefined number of iterations is reached. This is followed by assigning each data point to the cluster or multiple clusters for which it has the highest membership value.
In order to choose the appropriate number of clusters in Fuzzy C-means, Dunn’s partition coefficient can be used, which measures how close the fuzzy solution is to the corresponding K-means solution. Dunn’s partition coefficient is a clustering measure that quantifies the degree of separation between different clusters. The formula for Dunn’s partition coefficient is [28] [29]:
where mi represents the membership value of each object to each cluster, and n is the total number of observations. This coefficient ranges from 1/K to 1. Its value is 1/K when all memberships are equal to 1/K. Dunn’s partition coefficient may be normalized so that it varies from 0 (completely fuzzy) to 1 (hard cluster). The normalized version is:
Another partition coefficient, given in Kaufman [30], is the D(U). This coefficient ranges from 0 (hard clusters) to 1 − 1/K (completely fuzzy). The normalized version of this coefficient is [30]:
Fc(U) and Dc(U) together can be used as a perfect indication of an optimum number of fuzzy C clusters. We should choose C so that Fc(U) is large, and Dc(U) is small.
4. Data
The data used in this paper presents the number of vehicle crashes at two local streets A and B in the city of Columbia, Missouri, USA from the years 2008 to 2019. The data includes three columns namely, “Year”, “Number of crashes at street A”, and “Number of crashes at street B” as shown in Table 1.
Table 1. Number of vehicle crashes at street A and B from 2008-2019.
Year |
Number of crashes at street A |
Number of crashes at street B |
2008 |
61 |
52 |
2009 |
53 |
38 |
2010 |
74 |
42 |
2011 |
69 |
51 |
2012 |
81 |
44 |
2013 |
48 |
71 |
2014 |
59 |
52 |
2015 |
61 |
78 |
2016 |
47 |
56 |
2017 |
82 |
49 |
2018 |
74 |
68 |
2019 |
83 |
47 |
In order to determine the K value, the Elbow curve method was utilized using the NCSS software, and a K = 5 was chosen for both the K-means and C-means. Figure 1 shows the number of vehicle crashes at street A and B from 2008 to 2019. We can see that the crashes at both streets for the years 2010, 2012, 2017, and 2019 are grouped as cluster 1. The crashes at both streets for the years 2013, and 2015 are grouped as cluster 2. The crashes at both streets for the years 2008, 2014, and 2016 are grouped as cluster 3. The crashes at both streets for the year 2009 are grouped as cluster 4. The crashes at both streets for the years 2011, and 2018 are grouped as cluster 5.
Figure 1. Number of vehicle crashes at street A and B from 2008-2019.
Figure 2 shows the trendlines (i.e., increase, constant or decrease) of vehicle crashes at street A and B for different years. For example, the number of crashes at street A in cluster 3 is decreasing from 61 in 2008 to 59 in 2014 to 47 in 2016, while it is constant at street B from 2008 to 2014, then it is increasing to 56 in 2016. Figure 3 shows the clusters of crashes at street A vs street B.
Figure 2. The trendlines of vehicle crashes at street A and B from 2008 to 2019.
Figure 3. Clusters of vehicle crashes at street A vs street B.
The summary statistics for the dataset are shown in Table 2. We can see that the mean of crashes at street A is 66 and the standard deviation is 13.003. While the mean of crashes at street B is 54 and the standard deviation is 12.269. The minimum number of crashes at street A is 47 and the maximum number of crashes is 83. While the minimum number of crashes at street B is 38 and the maximum number of crashes at street B is 78.
Table 2. Summary statistics of the dataset.
Statistic |
Crashes at Street A |
Crashes at Street B |
Mean |
66 |
54 |
Standard Deviation |
13.003 |
12.269 |
Standard Error |
3.7537 |
3.5419 |
Lower 95% CL Mean |
57.737 |
46.204 |
Upper 95% CL Mean |
74.262 |
61.795 |
Minimum |
47 |
38 |
Maximum |
83 |
78 |
Shapiro-Wilk Normality test |
0.9219 |
0.9693 |
The cluster’s mean of each cluster at each street is shown in Table 3. For example, the mean of cluster 1 at street A is 51.33, while the mean of cluster 1 at street B is 59.66. The mean of cluster 3 at street A is 74.66, while the mean of cluster 3 at street B is 45.66.
Table 3. The cluster’s mean at street A and B.
Street |
Cluster 1 |
Cluster 2 |
Cluster 3 |
Cluster 4 |
Cluster 5 |
A |
51.33 |
57 |
74.66 |
67.5 |
82.5 |
B |
59.66 |
45 |
45.66 |
73 |
48 |
5. Discussion
The clusters of the vehicle crashes were found based on similarity at both streets. Using the NCSS software, the Analysis of Variance (ANOVA) for the K-means algorithm is shown in Table 4. We can see that for street A, the mean square Between is 395.4167, the mean square Within is 39.7619, the F-ratio is 9.94 and the p-value is 0.00515. While for street B, the mean square Between is 315.1667, the mean square Within is 56.4761, the F-ratio is 5.58 and the p-value is 0.00436. The p-value is very small and significant at both streets indicating perfect model fit for the K-means algorithm.
Table 4. The ANOVA table of vehicle crashes at street A and street B.
|
Degrees of Freedom |
Mean Square |
Variables |
DF1 |
DF2 |
Between |
Within |
F-ratio |
p-value |
Number of crashes street A |
4 |
7 |
395.4167 |
39.7619 |
9.94 |
0.00515 |
Number of crashes street B |
4 |
7 |
315.1667 |
56.4761 |
5.58 |
0.00436 |
The distance matrix of each cluster center using the Euclidian distances is shown in Table 5. These distances play a crucial role in creating the crash clusters. We can see for example that cluster 2 consists of 2 crash points, the first crash point is at (distance 2.0071 from cluster 1, distance 1.4364 from cluster 3, distance 2.9558 from cluster 4, and distance 3.2454 from cluster 5). The second crash point is at (distance 2.3071 from cluster 1, distance 1.8640 from cluster 3, distance 3.7024 from cluster 4 and distance 3.4701 from cluster 5).
Table 5. The distance matrix of each cluster center by Euclidian distances.
Cluster |
1 |
2 |
3 |
4 |
5 |
2 |
2.0071 |
0.6628 |
1.4364 |
2.9558 |
3.2454 |
2 |
2.3071 |
0.6628 |
1.8640 |
3.7024 |
3.4701 |
3 |
2.5605 |
1.3936 |
0.4109 |
3.1438 |
2.3642 |
3 |
1.7887 |
1.2536 |
0.6155 |
2.3569 |
2.2151 |
3 |
2.6933 |
2.0870 |
0.5767 |
2.8674 |
1.6997 |
1 |
1.0275 |
2.5548 |
2.9625 |
1.7938 |
3.5321 |
1 |
0.8641 |
1.6359 |
1.5525 |
1.9589 |
2.1455 |
4 |
1.6791 |
3.2524 |
3.0463 |
0.7675 |
3.0666 |
1 |
0.6435 |
2.3921 |
2.6757 |
2.1034 |
2.8611 |
5 |
2.6200 |
3.0594 |
1.7779 |
2.2558 |
0.2916 |
4 |
2.1293 |
3.4879 |
2.6618 |
0.7675 |
1.7562 |
5 |
2.9447 |
3.5363 |
2.3121 |
2.5282 |
0.2916 |
The matrix of cluster membership probability for all crash points in Fuzzy C-means algorithm is shown in Table 6. We can see for example that cluster 2 has 2 crash points, the first point has (probability of 0.1305 in cluster 1, probability of 0.6258 in cluster 2, probability of 0.0767 in cluster 3, probability of 0.0887 in cluster 4, and probability of 0.0783 in cluster 5). While the second point has (probability of 0.1072 in cluster 1, probability of 0.6543 in cluster 2, probability of 0.0681 in cluster 3, probability of 0.0772 in cluster 4, and probability of 0.0932 in cluster 5). The membership probability of each crash point is vital in assigning the crash points to different clusters in fuzzy C-means algorithm. This shows that each crash point can be assigned to more than one cluster as determined by its membership probability.
Goodness of fits statistics for the fuzzy C-means are shown in Table 7. We need to choose the optimal number of clusters (C) in Fuzzy C-means so that Fc(U) is large, and Dc(U) is small. We can see that Fc(U) has the largest value when using 5 clusters (0.5505), and Dc(U) has the smallest value when using 5 clusters (0.1087) as shown in Table 7. Therefore, we will keep our initial C value of 5 clusters. The goodness of fits is essential in obtaining the optimal C value for the fuzzy algorithm.
Table 6. The matrix of cluster membership probability for all crash points.
Cluster ID |
Cluster 1 |
Cluster 2 |
Cluster 3 |
Cluster 4 |
Cluster 5 |
1 |
0.4859 |
0.1969 |
0.1437 |
0.0973 |
0.0763 |
1 |
0.8532 |
0.0543 |
0.0415 |
0.0269 |
0.0241 |
1 |
0.0938 |
0.7504 |
0.0487 |
0.0514 |
0.0557 |
2 |
0.1305 |
0.6258 |
0.0767 |
0.0887 |
0.0783 |
2 |
0.1072 |
0.6543 |
0.0681 |
0.0772 |
0.0932 |
3 |
0.0768 |
0.0588 |
0.7690 |
0.0573 |
0.0382 |
3 |
0.1949 |
0.1394 |
0.4889 |
0.0978 |
0.0790 |
4 |
0.0115 |
0.0143 |
0.0120 |
0.9481 |
0.0141 |
3 |
0.1307 |
0.0955 |
0.6287 |
0.0823 |
0.0627 |
5 |
0.0310 |
0.0492 |
0.0259 |
0.0458 |
0.8482 |
4 |
0.1112 |
0.1611 |
0.1116 |
0.3967 |
0.2194 |
5 |
0.0296 |
0.0446 |
0.0256 |
0.0429 |
0.8574 |
Table 7. Goodness of Fits for the fuzzy C-means algorithm.
Number of Clusters |
Average |
Goodness-of-Fit Statistics |
Distance |
Silhouette |
F(U) |
Fc(U) |
D(U) |
Dc(U) |
2 |
2.951851 |
0.334028 |
0.6048 |
0.2095 |
0.1988 |
0.3975 |
3 |
1.676369 |
0.511612 |
0.6079 |
0.4119 |
0.1815 |
0.2722 |
4 |
1.097818 |
0.572472 |
0.6047 |
0.4729 |
0.1582 |
0.2109 |
5 |
0.698199 |
0.702357 |
0.6404 |
0.5505 |
0.0870 |
0.1087 |
6. Conclusion
Clustering is a powerful unsupervised machine learning algorithm for grouping unlabeled datasets. In order to show different patterns in vehicle crashes, this paper presented two clustering algorithms, the K-means and the Fuzzy C-means. In both algorithms, the goal is to identify cluster centers that best represent the data points in each cluster. The cluster centers are calculated as the mean or centroid of the cluster’s assigned data points. The algorithms iteratively update the cluster centers until convergence to find the optimal clusters. The C-means clustering can assign a probability score for each data point for being in that cluster. Therefore, fuzzy C-means clustering gives better results for overlapped data sets compared to K-means clustering.
Conflicts of Interest
The author declares no conflicts of interest.