Clustering Countries on COVID-19 Data among Different Waves Using K-Means Clustering

Abstract

The COVID-19 pandemic has caused an unprecedented spike in confirmed cases in 230 countries globally. In this work, a set of data from the COVID-19 coronavirus outbreak has been subjected to two well-known unsupervised learning techniques: K-means clustering and correlation. The COVID-19 virus has infected several nations, and K-means automatically looks for undiscovered clusters of those infections. To examine the spread of COVID-19 before a vaccine becomes widely available, this work has used unsupervised approaches to identify the crucial county-level confirmed cases, death cases, recover cases, total_cases_per_million, and total_deaths_per_million aspects of county-level variables. We combined countries into significant clusters using this feature subspace to assist more in-depth disease analysis efforts. As a result, we used a clustering technique to examine various trends in COVID-19 incidence and mortality across nations. This technique took the key components of a trajectory and incorporates them into a K-means clustering process. We separated the trend lines into measures that characterize various features of a trend. The measurements were first reduced in dimension, then clustered using a K-means algorithm. This method was used to individually calculate the incidence and death rates and then compare them.

Share and Cite:

Muhtasim and Masud, Md.A. (2023) Clustering Countries on COVID-19 Data among Different Waves Using K-Means Clustering. Journal of Computer and Communications, 11, 1-14. doi: 10.4236/jcc.2023.117001.

1. Introduction

The COVID-19 pandemic started on December 29, 2021, and as of December 29, 2022, there have been 619,391,055 confirmed cases, including 6,537,201 fatalities [1] . This data pertains to over 230 countries, regions, or territories that are affected by COVID-19 [2] . The illness pattern was not consistent between these locations, and understanding this heterogeneity is an essential source of knowledge for academics and policymakers. Unsupervised machine learning is used by Carrillo et al. to categorize 155 nations that have a similar COVID-19 profile. Clustering is done for cases that have COVID-19 confirmation. As feature variables, the following are used: disease prevalence, male population, air quality index, socioeconomic metrics, and health system indicators [3] . The clusters created to provide light on the similarities and contrasts between nations in terms of how COVID-19 has affected them. The COVID-19 fatality rate is not included in this model to stratify nations [4] . Similar economic and health aspects of COVID-19 dissemination are discussed in another article by Farseev et al. The research reveals important connections between COVID-19 and other national indicators. Based on metrics for the nation’s economy and health system, it distinguishes four groupings [5] . Another of the socioeconomic factors influencing COVID-19 is presented by Stojkoski et al. The socioeconomic, medical, demographic, and environmental elements that are relevant to the spread of COVID-19 are identified [6] . A worldwide epidemic known as COVID-19 is endangering the lives and livelihoods of millions of people. Specialists have struggled to produce accurate projections for this illness due to its novelty and rapid spread [7] [8] . This is caused in part by variations in human behavior and environmental elements that affect the spread of diseases. Due to either a lack of case histories or other distinctive characteristics of the region, this is particularly true for prediction models that are region-specific [9] . To study COVID-19 propagation prior to the widely available of a vaccine, this work uses unsupervised methods to identify the critical county-level confirmed cases, death cases, recover cases, total_cases_per_million, and total_deaths_per_million. We combine counties into significant clusters using this feature subspace to assist more in-depth disease analysis efforts [10] . Therefore, we use a clustering strategy to address various patterns in mortality and incidence of COVID-19 across nations.

1.1. Problem Statement

The most crucial component of the k-means clustering technique is to first determine the k number of clusters. On datasets including information about the number of coronavirus cases and death cases in various nations, we utilize the k-means clustering technique. There are various traditional techniques for this, but they have significant drawbacks. Therefore, we will develop an approach that does not have the drawbacks of previous methods. The number of deaths likewise changes with the coronavirus’s several waves. Additionally, the number of deaths varies by nation and per wave. Basically, homogenous nations would be divided into distinct clusters depending on the number of deaths and cases in various countries with various waves. We introduce the related work of functional data clustering and the methodology of several clustering methods of interest. Additionally, we bring up the idea of sequential functional data clustering, which could capture the stage-by-stage changes in the Covid data. We provide a method for grouping country-specific COVID-19 incidence and death trends that involve three phases. This technique takes the key components of a trajectory and incorporates them into a K-means clustering process. In the first phase, we separate the trend lines into measures that characterize various features of a trend. The measurements are first reduced in dimension, then clustered using a K-means algorithm. This method is used to individually calculate the incidence and death rates and then compare them.

1.2. Objectives of the Research

The main objectives of our work are:

• K Means clustering is used to identify the behaviors of the group of same countries based on different attributes such as county-level confirmed_cases, death_case, recover_case, total_cases_per_million, and total_deaths_per_million and so many features of county-level factors;

• To show the time series analysis on globally over various regions selected features from the dataset;

• To show how confirmed cases are distributed globally over various regions, death cases are distributed globally over various regions and others to visualize, predict, and forecasting;

• To find out the prediction using Machine Learning Models.

2. Methodology

The suggested framework is a technique that uses machine learning techniques including decision trees, random forests, k-means, and hierarchical clustering approaches together with a novel mathematical model constructed based on clustering. The suggested framework’s flowchart is laid out as follows. The study’s data collection was composed of 15 distinct types of data sets that were gathered from 230 different nations. The number of cases, tests, fatalities, recovering patients, and active patients was gathered every day from the start of the study to the present. Before being utilized in the study, the gathered data were summed. GDP, the prevalence of diabetes, smoking rates among men and women, the population, and hospital beds. Current data collected from GitHub are per 100 k. The data is broken down into 230 rows and 15 continuous columns. After the analysis, the efficiency values were added to the data set as a factor type. The decision tree and random forest algorithms both used the generated findings as inputs. The nations with an efficiency level of 90% or more were accepted as 1, and 0 for the others, when efficiency numbers were added to the data set. These data, however, are divided into three distinct groups. The first category includes the most recent and general COVID-19 data, followed by data on identifying nations’ social and economic systems and, finally, data that are not usually utilized in COVID-19 research. The most popular data in literature include total cases, total recovered, total tests, total fatalities, total active cases, etc. The facts that describe the social and economic structures of the nation include GDP, poverty index, population, and similar statistics. Additionally, this study also takes into account variables that are rarely utilized in the literature, such as the stringency index and fatality rate. Our proposed overall methodology is shown in Figure 1.

Figure 1. Proposed methodology.

2.1. Data Collection

From the internet available WHO COVID-19 dashboard, we retrieved the daily new cases and death of COVID-19 for 230 countries and territories. The data records the daily growth of active Covid cases and death cases for each country. We delve into the research on functional data clustering methods. They are implemented on the discrete consecutive daily Covid case observations, which can be considered to have fine grid and have fulfilled the property of the functional data. We introduce the source of Covid data and its components. Also, we introduce our work on the preprocessing of the Covid data. The data preprocessing, especially for the Covid data, is quite challenging. Countries may have different timelines for the Covid case records. Besides, the quality of the Covid data for each country is different. We proposed several data cleaning and imputation techniques to solve this problem. This data was acquired from international official health organizations [2] . For our experiment, we used OWID COVID-19 data [3] . We collected our dataset from online Our World in Data open-source dataset [11] . This dataset contains a total of 214,707 instances and 42 features. It comprises a variety of attributes based on different scenarios that are simulated on day-wise COVID-19 data. The study period started from starting from the first day of a COVID-19 case to 22 September 2022 [1] [12] . In the data preparation, we saw daily trajectories with a rather high amount of data volatility. Additionally, other nations revised their reports days later. As a result, we became ready to add the weekly summation to the daily incidence and death [9] [13] . A few nations supplied data that was of low quality, with significant incompleteness or daily volatility [6] [7] . Others had zero-inflated data for several days. The pattern of COVID-19 itself was unaffected by these issues, thus we don’t consider them in the following study. Several nations also had fewer than 54 weeks of data, so we disregarded them as well. Lastly, in this analysis, we employed 206 one-year-long trajectories for 206 distinct nations and territories.

2.2. Data Preprocessing

The dataset needed to be pre-processed before the clustering algorithm could be used. Preprocessing data may be done using a variety of tools and techniques, such as the following.

Handling Missing Values

Our dataset contains missing values, which may create problems in machine learning classification model. Our dataset contains some missing values, we used python library function to handle the missing value. Correlation heatmap of the features of the dataset is shown in Figure 2. To provide accurate, precise, and resilient findings for enterprise applications, nearly every sort of data analysis, data science, or AI development requires some kind of data pretreatment.

Figure 2. Correlation heatmap of the features of dataset.

2.3. Feature Selection

By eliminating the useless attributes, we may make computation time more difficult. The Random Forest embedding approach, which combines the filter and wrapper techniques, was utilized to choose the key features for the machine learning classifiers. By using this feature selection methods, we came to know that 15 features are most important and related to label column, and rest of features are ignored. Table 1 contains a summary of 15 carefully chosen characteristics that are crucial to our machine learning model which is shown in Figure 3.

The following table shows the selected features which are used to cluster countries based on dataset. 15 features are selected by the Random Forest Algorithm.

Table 1. Short description of 15 selected important features.

Figure 3. Feature’s importance rate of dataset.

2.4. K-Means Clustering

Popular clustering algorithm K-means was suggested by [2] [12] . With k-means, you may find a partition where the squared error between an observation’s value and the mean of a cluster is as little as possible. The squared error between a cluster’s mean, mCj, and all of its observations is given by Equation (1) for cluster Cj:

J ( C j ) = x i C j x i μ C j 2 (1)

In order to determine the partition, the following unconstrained minimization problem is solved across all k clusters:

J ( C ) = j = 1 k x i C j x i μ C j 2 (2)

Following is how the k-means algorithm operates:

• Pick k (random) data points as the first centroids and cluster centers.

• Assign the nearest centroid to each data point.

• Recalculate the centroids using the current memberships of the cluster.

• Repeat steps 2 and 3 if a convergence requirement is not satisfied.

Finding Number of Clusters

The K-means clustering technique works best with highly effective clusters. It might be challenging to choose the right amount of clusters, though. Although there are alternative ways to figure out how many clusters are optimum, in this study we concentrate on the most effective method [10] . The Elbow Curve Method is used to get the right value for k in our clustering technique.

The elbow approach goes through the following phases to determine the best value of clusters:

• On a given dataset, it uses K-means clustering for different K values (which vary from 1 to 10).

• Determine the WCSS score for each K value.

• Creates a curve using the cluster count K and the estimated WCSS values.

• The greatest K value is assigned to a bend’s sharp tip or a plot point that resembles an arm.

3. Results

Highly effective clusters are essential to the K-means clustering algorithm’s success. Choose the best, though.

3.1. Visualization Results

Results are visualized using choropleth maps. In order to show the K-Means results for COVID-19 confirmed cases and COVID-19 death cases in various countries, maps are created for the 230 nations with data that is currently accessible. Based on COVID-19 confirmed instances, the first figure in Figure displays the nations. The grouping of nations based on related variables is easier to grasp thanks to the graphics. In Figure 4 we try to show the time series analysis based on dataset of how many patients are affected over time which we called confirmed cases. So, in the following figure, we show how affected people increase over time.

Figure 4. Confirmed cases are distributed globally over various times.

In Figure 5 we try to show the time series analysis based on dataset of how many patients died over time which we called death cases. So, in the following figure, we show how the dead people increased over time.

Figure 5. Confirmed deaths are distributed globally over various times.

In Figure 6 we try to show confirmed cases are shown in the world map for various regions. From this figure, we show the countries which are affected by COVID-19. The countries are colored blue. So, Virus is spread across various regions shown in the world map shown below.

Figure 6. Virus is spread across various regions shown on the world map.

In Figure 7 we try to show death cases are shown in the world map for various regions. From this figure, we show the countries which are affected by COVID-19. The countries are colored red. So, Virus is deaths across various regions shown in world map in below.

Figure 7. Virus are death across various regions shown in world map.

3.2. Clustering Results

The grouping of nations may be done while taking into account several characteristics. Here, we’re attempting to group various nations according to their respective rates of mortality and recovery. As everyone is aware, COVID-19 has variable Mortality Rates in various nations based on many causes, and the Recovery Rate varies as well due to national pandemic control policies. Additionally, the Mortality Rate and Recovery Rate consider all cases, including Confirmed, Recovered, and Deaths. We must first ascertain the value of k. The Elbow approach allows us to determine the ideal value for k. One of the most often used techniques for determining the ideal number of clusters is the Elbow approach. The WCSS value idea is used in this technique. The phrase “total variations within a cluster” is known as “WCSS,” or inside Cluster Sum of Squares. Following is the formula to determine the value of WCSS (for 3 clusters):

WCSS = P i in Cluster 1 distance ( P i C 1 ) 2 + P i in Cluster 2 distance ( P i C 2 ) 2 + P i in Cluster 3 distance ( P i C 3 ) 2

We use the above method and find out the number of clusters based on these datasets. Using this method, we can find 3 clusters for our dataset which is shown in Figure 8.

Figure 8. Eelbow method.

3.3. Summary of Clusters

In Figure 9 we show the correlation heatmap of the features of database.

In the following Figure 10, we show our clustering result. We can show there are 3 clusters that have differentiated by three different colors. Clustering Countries based on the findings clusters which are shown in Table 2. Where there are 3 clusters found from the clustering approach.

Cluster 0 is a set of countries that have Low Mortality Rate and High Recovery Rate. These are the set of countries that have been able to control COVID-19 by following pandemic-controlling practices rigorously.

Cluster 1 is a set of countries that have Low Mortality Rates and Low Recovery Rates. These countries need to pace up their Recovery Rate to get out it, some of these countries have really high number of Infected Cases, but Low Mortality is a positive sign out of it.

Cluster 2 is a set of countries that have a High Mortality Rate and a considerably Good Recovery Rate. Basically, a few countries among these clusters have already seen the worst of this pandemic but are now recovering with a healthy Recovery Rate.

Figure 9. Correlation heatmap of the features of dataset.

Figure 10. Clustering countries.

Table 2. Clustering countries based on the clustering output.

4. Conclusion

With K-means clustering, we were able to quickly search for hidden or unidentified clusters across numerous COVID-19-infected nations, and we were also able to discover the correlations between various variables. The correlation matrix and K-means algorithms are used to examine COVID-19 risk in pandemic countries. This study revealed the start of the exceptional COVID-19 verified cases increase that has been observed globally in 230 countries. Based on global pandemic observation, correlation matrix data still showed a significant positive link between the total number of fatalities and patients classified as critical. Because of this relationship, a large number of critically ill people have not recovered. Results are based on data that is already available from a source. Future research is advised to take advantage of data. We are hopeful that these findings will provide the Center for Disease Control (CDC) with knowledge that will help it make crucial decisions occasionally regarding infection control, countermeasures, and precise mitigation responses. In order to study the factors directly associated with the spread of disease, 230 countries were clustered using an unsupervised K-Means algorithm based on socioeconomic, disease prevalence, and health system indicators. COVID-19 confirmed cases and COVID-19 death cases were used as evaluation parameters. To determine the ideal number of clusters, the elbow approach is utilized. The incidence of COVID-19 verified cases is significantly positively correlated with the prevalence of asthma, diabetes mellitus, cardiovascular illness, dietary inadequacies, and health expenditure. When using K-Means on COVID-19 confirmed cases and COVID-19 death cases, three clusters are produced. Cluster 0 consists of nations with high rates of recovery and low rates of mortality. These are the nations that have successfully contained the COVID-19 by strictly adhering to pandemic control procedures. Those nations in Cluster 1 have low mortality and low recovery rates. Some of these countries have an extremely large number of infected cases, but low mortality is a good indicator as a result, therefore these countries need to speed up their recovery rate to get out of it. Cluster 2 is a group of nations with a high mortality rate and a very high rate of recovery. In essence, a small number of these clusters’ member nations have already experienced the worst of the epidemic, but they are currently recovering with a strong Recovery Rate. Disease prevalence has a substantial correlation with COVID-19, but environmental health indicators have a less correlation. Policymakers can use the data to inform better choices for containing the epidemic.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1] Agrebi, S. and Larbi, A. (2020) Use of Artificial Intelligence in Infectious Diseases. In: Barh, D., Ed., Artificial Intelligence in Precision Health, Elsevier, Amsterdam, 415-438.
https://doi.org/10.1016/B978-0-12-817133-2.00018-5
[2] Carrillo-Larco, R.M. and Castillo-Cara, M. (2020) Using Country-Level Variables to Classify Countries According to the Number of Confirmed COVID-19 Cases: An Unsupervised Machine Learning Approach. Wellcome Open Research, 5, 56.
https://doi.org/10.12688/wellcomeopenres.15819.3
[3] W.H. Organization (2019) Who Coronavirus Disease (COVID-19) Dashboard.
https://covid19.who.int/table
[4] Farseev, A., Chu-Farseeva, Y.-Y., Yang, Q. and Loo, D.B. (2020) Understanding Economic and Health Factors Impacting the Spread of COVID-19 Disease.
https://doi.org/10.31227/osf.io/7utqe
[5] Siddiqui, M.K., Morales-Menendez, R., Gupta, P.K., Iqbal, H.M., Hussain, F., Khatoon, K. and Ahmad, S. (2020) Correlation between Temperature and COVID-19 (Suspected, Confirmed and Death) Cases Based on Machine Learning Analysis. Journal of Pure and Applied Microbiology, 14, 1017-1024.
https://doi.org/10.22207/JPAM.14.SPL1.40
[6] Imtyaz, A., Haleem, A. and Javaid, M. (2020) Analyzing Governmental Response to the COVID-Pandemic. Journal of Oral Biology and Craniofacial Research, 10, 504-513.
https://doi.org/10.1016/j.jobcr.2020.08.005
[7] Pachetti, M., Marini, B., Giudici, F., Benedetti, F., Angeletti, S., Ciccozzi, M., Masciovecchio, C., Ippodrino, R. and Zella, D. (2020) Impact of Lockdown on COVID-19 Case Fatality Rate and Viral Mutations Spread in 7 Countries in Europe and North America. Journal of Translational Medicine, 18, Article No. 338.
https://doi.org/10.1186/s12967-020-02501-x
[8] Gabriel, P., Nestor, B. and Juliana, G. (2023) Identifying the Most Relevant Attributes to Explain Peaks of COVID-19 Infections and Deaths by Machine Learning Methods. International Journal of Computer Theory and Engineering, 15, 1-9.
https://doi.org/10.7763/IJCTE.2023.V15.1326
[9] Porcheddu, R., Serra, C., Kelvin, D., Kelvin, N. and Rubino, S. (2020) Similarity in Case Fatality Rates (CFR) of COVID-19/SARS-COV-2 in Italy and China. The Journal of Infection in Developing Countries, 14, 125-128.
https://doi.org/10.3855/jidc.12600
[10] Garg, P. and Joshi, D. (2021) A Region-Specific Clustering Approach to Investigate Risk-Factors in Mortality Rate During COVID-19: Comprehensive Statistical Analysis from 208 Countries. Journal of Medical Engineering & Technology, 45, 284-289.
https://doi.org/10.1080/03091902.2021.1893398
[11] Oppenheim, B., Gallivan, M., Madhav, N., Brown, N., Serhiyenko, V., Wolfe, N., et al. (2019) Assessing Global Preparedness for the Next Pandemic: Development and Application of an Epidemic Preparedness Index. BMJ Global Health, 4, e001157.
https://doi.org/10.1136/bmjgh-2018-001157
[12] van Seventer, J.M. and Hochberg, N.S. (2017) Principles of Infectious Diseases: Transmission, Diagnosis, Prevention, and Control. In: Quah, S.R., Ed., International Encyclopedia of Public Health, Elsevier, Amsterdam, 22-39.
https://doi.org/10.1016/B978-0-12-803678-5.00516-6
[13] Xiao, X., van Hoek, A.J., Kenward, M.G., Melegaro, A. and Jit, M. (2016) Clustering of Contacts Relevant to the Spread of Infectious Disease. Epidemics, 17, 1-9.
https://doi.org/10.1016/j.epidem.2016.08.001

Copyright © 2025 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.