A Review on Clustering Methods for Climatology Analysis and Its Application over South America ()
1. Introduction
In various climate and meteorological studies, it is often essential to categorize data (observations or variables) into distinct subgroups containing elements that share similar characteristics. For instance, this separation can be used to create spatially homogeneous observations, either by utilizing weather stations or gridded data, and considering different variables like temperature or precipitation. This allows for climate regionalization, aiding in understanding regional climate patterns. Additionally, data grouping based on different temporal scales, such as hours, days, or months, and various parameters, can be employed to identify patterns, such as synoptic types. This facilitates the analysis of meteorological events and their underlying dynamics. Furthermore, for forecast analysis and evaluation, data grouping plays a crucial role. It enables the grouping of ensemble members, which can help in assessing the uncertainty and performance of forecasting models [1] [2] [3] .
Cluster Analysis is a versatile technique that facilitates various types of studies. Its roots can be traced back to Tryon’s proposal in 1939 [4] , where it was primarily applied in biological taxonomy. However, it gained significant attention in the 1960s with the advent of faster computers, and by the 1970s, it began finding applications in diverse fields such as biology, sociology, and medicine [1] . Over the years, with the advancements in high-speed computers and data science, Cluster Analysis has evolved and is now classified as an unsupervised learning method [2] . This means that it can identify patterns and structures within data without the need for pre-labeled or labeled examples, making it a valuable tool in modern data analysis and exploration.
Several studies have been conducted over South America to classify homogeneous regions based on climate variables, employing various clustering methods. These investigations are of significant importance, given the continent’s vast territory and diverse climate patterns. The outcomes of such studies find valuable applications in multiple domains, including agriculture, natural disaster management, understanding climate impacts, water resources management, and climate and weather forecasting [5] .
In the study [6] gauge stations were used over Brazil to classify precipitation in six homogenenous regions using Ward Method. In [7] was performed a revision of precipitation regimes over South America and classify in eight homogeneous regions using a subjective analysis (graphics climatology analysis of meteorological stations data). [5] used a multivariate technique based on fuzzy theory to identify nine climate profiles (Grade of Membership) over Brazil using precipitation, relative humidity and maximum and minimum temperature from 1980-2013. [8] using K-Means Clustering Method and monthly precipitation data from ECMWF-SEAS5 and CPC for the period 1993-2016 found eight homogeneous regions. In addition to general studies across South America, specific research focusing on particular regions has also been conducted. For instance, [9] utilized clustering analyses to investigate the distributions of anomalies of sea surface temperature (SST) and moisture sources in the South Atlantic Ocean during extreme dry events in southeastern Brazil throughout the austral autumn, winter, and spring. This targeted approach allows for a deeper understanding of the factors influencing such extreme events in the specified region, shedding light on the complex interactions between SST and moisture source patterns during these critical periods. The methodology of Cluster Analysis has also been employed at a regional scale in South America in numerous studies, primarily oriented towards the demarcation of homogeneous zones with respect to rainfall and temperature patterns [10] [11] [12] . Notably, investigations have extended to the utilization of such data in the context of grain production [13] . The versatility inherent to Cluster Analysis methodology has facilitated its application in diverse contexts. Notably, it has been employed for the temporal characterization of temperature variability [14] , as well as for the delineation of large-scale meteorological patterns within the South American region [15] .
But these studies used only one preview choiced method. A comparison between methods was performed from few studies for other regions. [1] highlight the importance of the intercomparison of different clustering techniques using geophysical data in comparison with synthetic data. The knowledge of the method skill it is only possible with the application on real data. [16] compared four hierarquical methods using tropical rainfall stations, showing that there is no significant difference between methods performance. [1] found that nonhierarquical methods outperformed hierarquical for central-eastern North America. [3] found that K-Means clustering method produced stable cluster boundaries compared to other methods for Ethiopia precipitation. [17] regionalize annual precipitation for Iran using K-Means and Self Organizing Maps methods and show that K-means has better performance (using Silhouette Coefficient, Dunn index and Davis Bouldin index). [18] found that K-Means presents better results (using Calinski-Harabasz and Davies-Bouldin measures) than Ward and Self Organizing Maps methods for clustering precipitation over southeastern Brazil.
Regarding the entire South America continent there was no founded studies using Self Organizing Maps (SOM) for clustering regionalization. Furthermore, there are no studies comparing different methodologies for clustering regionalization in climatology. Facing this lack of investigation, the objective of this study is to perform a review of different methodologies for clustering (Ward, K-means and SOM) and the metrics for its evaluation (Silhouette value, Calinski-Harabasz index, Davies-Bouldin Index, Elbow Method and Modified Elbow Method). This study also presents a comparison between different reanalysis temperature (ERA5, ERA5-Land and CRU) and precipitation data (ERA5, ERA5-Land and CPC). An application of cluster methods and the metrics for its evaluation for precipitation and temperature were performed over South America only for ERA5-Land data to create spatial homogeneous groups of these climate variables.
2. Clustering Methods
According to [19] , clusters are regions in the attribute space that contain a high density of patterns and are separated from each other by regions of low density. The central idea of clustering methods is to divide a data set into groups (clusters) such that similar elements are found into each group at the same time these elements show a distinct behavior concerning elements found into other groups. The literature on cluster analysis is very extensive, and its applications span from signal processing to psychology, archaeology, and linguistics [20] .
We can express a clustering method as function g: I→G, where
is a set of m observed examples/objects
defined on the attribute space X, and
is a partition of I into c subsets. It is of utmost importance to emphasize that there is no prior knowledge about the labels of the examples contained in I. According to a general overview presented by [21] , clustering methods can be categorized into: hierarchical methods, based on cost function optimization (non-hierarchical), and others, including neural network-based methods. In the following sections are discussed remarkable clustering methods according these categories.
2.1. Hierarchical Methods
Hierarchical methods are commonly used to synthesize the organizational structure of how the elements are related to each other. A representation based on a dendrogram, exemplified in Figure 1, supports the mentioned structure understanding. A dendrogram is a diagram that shows the hierarchical relationship between objects, with its main use is to work out the best way to allocate objects to clusters [2] . Through dissimilarity values, the existence of subdivisions becomes evident with respect to a given threshold τi. These subdivisions naturally determine the configuration of the clusters. In the dendrogram in Figure 1, the height of the dendrogram indicates the order in which the clusters were joined. In Figure 1, we can see that x4 and x5 are most similar, as the height of the link that joins them together is the smallest. The next two most similar objects are x3 and x6.
The different algorithms proposed in the literature lead to the construction of a hierarchical relationship among the data. Agglomerative hierarchical approaches derive this relationship through consecutive clustering steps on the dataset until a single cluster is obtained at the end. Conversely, a divisive hierarchical algorithm starts with a single cluster composed of all the data involved in the problem and undergoes successive subdivisions until clusters composed of a single example are obtained.
Figure 1. Example of dendrogram. The hierarchical structure depicted in terms of dissimilarity shows how the data is clustered.
Conveniently and in a generic way, we denote by
the dissimilarity between the objects
and
, where
. According to this measure,
indicates a higher dissimilarity between input patterns, and conversely, greater similarity is observed as
. Furthermore, we can denote
as the observed dissimilarity between clusters
and
.
Based on the above-presented concepts of dissimilarity, the hierarchical method of Ward clusters the data ensuring the minimum internal variability within the clusters by adopting the dissimilarity measure defined in Equation (1). According to this measure, the dissimilarity between
and a given cluster, resulting from grouping
and
, is recursively computed. Initially, each element in the dataset defines a cluster and, in this case, the dissimilarity
stands for the Euclidean distance between
and
.
(1)
2.2. K-Means
Clustering methods based on “function optimization” persist in defining a partition for the dataset such that the internal variability of the clusters is minimized while the separation between clusters is maximized. The K-Means algorithm is an extensively known algorithm that is based on such concept [22] . Aiming to achieve the objective of partition a given dataset I into k clusters, the following objective function should be minimized:
(2)
Two main and straightforward steps characterize this algorithm: 1) assigning elements to clusters based on the smallest dissimilarity, expressed in terms of Euclidean distance, between a given pattern and the mean vector of the cluster, represented by the cluster’s centroids
,
; 2) update the centroid vector that represents each cluster according to the average vector computed through the elements assigned in the previous step.
These two described steps are performed iteratively until convergence is reached. Commonly adopted convergence criteria include the absence of changes in element assignments between clusters and/or no alteration in the internal variability of the clusters.
2.3. Self-Organizing Maps
Self-Organizing Maps (SOM) comprises a neural network-based model for data clustering. In this model, neurons are represented by topologically organized maps where the location/coordinate of these neurons expresses a specific feature of the input data [23] .
Conveniently, a map of neurons is represented by a matrix
of size
, where L1 and L2 defines the neuron map dimensions and n stands for the dimension of the attribute space
. For a given neuron inserted in this map, at coordinates (u, v), it is denoted by
as the associated weight vector. Thus, for a given object
, it is possible to assess its similarity to each neuron in the network and make adjustments to their respective associated weights when necessary. This relationship is summarized in Figure 2, where the attributes of an object are compared to each neuron in this network through a weight associated with the neuron.
During the execution of such a neural network, three main processes are involved: competition, cooperation, and adaptation. The competitive process consists of determining the neuron in the network that has the minimum dissimilarity to the presented object. The neuron selection according to the minimum object-neuron dissimilarity is expressed by:
Figure 2. The SOM architecture. Each input data is compared to each neuron according the respective components and weights.
(3)
Once the neuron at coordinates (u, v) demonstrates the highest similarity (i.e., lowest dissimilarity) to the pattern x, corrections must be made to all other neurons in the network based on the configuration of the identified neuron and the presented pattern. Such corrections are conducted to benefit the neurons located in the neighborhood of (u, v), thus characterizing a cooperative process.
For this purpose, “topological neighborhood functions” are used. Among diff
erent proposals in the literature, the Gaussian function
is
widely used for this purpose, where
represent a spatial coordinate pair, and
controls the range of the neighborhood.
Lastly, the adaptive process is responsible for adjusting the weights of the neurons as patterns are presented to the neural network. Assuming that the neuron at coordinates (u, v) was identified during the competitive process, and
represents the coordinates of the neurons in
, the adaptive process is defined as:
(4)
where
represents a learning rate.
In general, the training process is iteratively executed until convergence is achieved in the weight adjustment process. Once convergence is detected, the final configuration of the neurons, with their adjusted weights, provides a flat representation of the analyzed object features and, consequently, allows groups the data assigned to specific regions of the neuron map.
3. Cluster Evaluation and Number of Clusters
The cluster evaluation can be performed using some index that allows the comparison between the methods for different number of groups. Silhouette Value [24] , Calinski-Harabasz [25] and Davies-Bouldin [26] are some of these methods for clustering assessment. For Silhouette Values (SL) each cluster is represented by a silhouette, and is a comparison of its tightness and separation (how similar an element is to other in the same cluster, compared to points of other clusters) [24] . The Silhouette values are calculated by:
(5)
where
is the average intra-cluster distance between station i and all other stations associated with medoid k and
is the smallest average distance between station i and all other stations associated with a medoid different from k. The values of
are in interval
, and
indicates better results (intra-cluster distance is much smaller than the inter-cluster distance).
Calinski-Harabasz index (CH) also called as Variance Ratio Criterion is calculated by:
(6)
where:
(7)
is the overall between-cluster variance. And the overall within-cluster variance is:
(8)
with
being the number of observations in cluster i,
the centroid of cluster i, m the overall mean of the sample data, x a data point,
the ith cluster, k the number of clusters, and N the number of observations. The values of CHk are in interval
, and highest CHk indicates better data partition, large between-cluster variance SSB and a small within-cluster variance SSw [25] .
The Davies-Bouldin criterion (DB) is based on a ratio of within-cluster and between-cluster distances, defined by:
(9)
where
is the within-to-between cluster distance ratio for the ith and jth clusters. The values of DB are in interval
, and lowest DB indicates optimal clustering solution [26] .
The choice of the number of clusters is a key point on cluster analysis and although all analysis performed on methods been objectives, the determination of the number of groups presents some subjectivity [2] [3] .
For hierarquical clustering methods, the choice of k can be performed with a traditional subjective approach inspecting the plot of the distances between merged clusters as a function of the stage of the analysis. The stage where the difference between the distances is bigger (a big jump between points occurs) indicates that these elements are not so closed, and the process can be stopped just before these distances become large [2] .
Another approach to determine the number of clusters is by utilizing objective methods. The evaluation techniques introduced for assessing cluster methods can also be applied to make the selection of “k” (as presented in Table 1).
Elbow Method can also be used to the choice of the number of clusters for nonhierarquical methods. It consists in a graph analysis of within-cluster sum of
Table 1. Objective methods to choose the number of cluster and evaluation.
square errors (WSS) for different k values (searching for an elbow), defined by [3] :
(10)
where WSS is the sum of the squared errors between the time series in each grid cell g (tg) in cluster j (
) and the average time series in cluster j (
is the centroid) and then summed over all k clusters.
Although, Elbow method may be problematic mainly when there are a large number of elements to be clustered in a small number of groups (graph is smoothed and an elbow is not clear), as the case of a large number of grid points to found climate patterns [3] . In this sense, [3] proposed the analysis of a modified Elbow Method based on the analysis of the differences between
and
, that present more apparent elbow in these cases.
4. Cluster Analysis Application South America
We utilized three distinct sources of monthly reanalysis data for precipitation and temperature: ERA5 [27] and ERA5-Land [28] for both temperature and precipitation, CRU [29] exclusively for temperature, and CPC [30] solely for precipitation. Detailed descriptions of each data source are in Table 2. To facilitate intercomparison, all the data were interpolated onto a common grid with a resolution of 0.5˚ × 0.5˚.
The seasonal temperature patterns in South America are well captured by the three reanalyses. ERA5 generally exhibits higher temperatures compared to ERA5-Land across most of South America, except for specific areas like northeastern Argentina, central Brazil, and the far north of the continent. These differences are generally within 2˚C (both positive and negative). Similarly, when comparing CRU with ERA5 and ERA5-Land, they display similar temperature patterns. CRU tends to be warmer in the majority of the continent, except
Table 2. Reanalysis data used to comparison for temperature and precipitation for South America.
for regions like Southern Argentina and the Andes. In winter, ERA5 and ERA5-Land appear warmer in the center and northeast of Brazil, with differences usually within 4˚C (both positive and negative).
Concerning precipitation patterns, the three reanalyses provide a reliable representation, reflecting the seasonality that is largely influenced by atmospheric systems in the region and the impact of sea surface temperature anomalies in the Atlantic and Pacific Oceans [7] . Both ERA5 and ERA5-Land exhibit a comparable precipitation pattern, with only minor punctual discrepancies possibly attributed to spatial resolution and differences in surface representation between the two datasets. On the other hand, when comparing CPC with ERA5 and ERA5-Land, substantial differences arise for some regions (up to 100 mm/month), particularly accentuated during the summer and in the northern part of the continent.
Given the superior representation of ERA5-Land (regridded) over the continent, it was chosen for conducting the cluster analysis in South America. A summary of all the steps involved in the cluster analysis and evaluation is depicted in Figure 3. Notably, as there is a high number of grid points (6148), dendrograms for hierarchical methods will not be presented.
5. Results
Within this section, we present the outcomes acquired through the methodologies delineated earlier, and proceed to analyze the diverse models thus identified.
5.1. Temperature Clustering Over South America
Figure 4 depicts the Cluster analysis applied to temperature patterns across the South American region. This analysis incorporates the employment of SL Values, CH Index, and DB Index, encompassing clustering scenarios ranging from k = 1 to 50. The dashed lines within the figure represent the methodological averages of Ward (depicted in blue), K-means (represented in red), and SOM (illustrated in green). It is noteworthy that the most favorable outcomes are observed in the case of K-means clustering, wherein higher SL values and CH index averages are evident, coupled with the attainment of the lowest DB index average.
Figure 3. Summary of all steps for cluster analysis and evaluation.
Figure 4. Clustering assessment for temperature over South America based on (a) Silhouette Value; (b) CH Index and (c) DB Index and number of clusters.
The application of these metrics also lends itself to the determination of the optimal cluster count. In the case of the Ward Method, an optimal selection for the number of temperature clusters emerges at k = 10. At this point, the SL values and CH index ascend to their zenith prior to encountering a decline (occurring at k = 11), concomitantly with a decline in the DB index, which attains its nadir. Meanwhile, in the context of K-means clustering, an appropriate choice for the number of temperature clusters is discerned at k = 8. Here, the SL values and CH index exhibit a peak before undergoing a precipitous descent (commencing at k = 9), while the DB index experiences a descent followed by an ascent (at k = 9). Conversely, in the case of SOM, the metrics do not converge on a unanimous optimal value for k. Specifically, the SL values indicate k = 8, the CH index suggests k = 10 (with both metrics showcasing an ascending pattern before declining at this point), and the DB index demonstrates its lowest value at k = 7. In light of these considerations, a prudent selection could be made at k = 8.
Regarding the non-hierarchical K-means method, supplementary approaches such as the Elbow Method and the Changed Elbow method can be employed. Nevertheless, discerning the optimal number of clusters remains a formidable task within this framework. Figure 5 elucidates an inflection point observable at k = 5 and k = 6. However, given the dimension of the South American region, these cluster counts prove to be relatively modest, inadequately addressing the multifaceted nature of temperature variations across the region. Emphasizing the imperative consideration that the mathematical outcome must align with
Figure 5. Metrics to the choice of the number of clusters for temperature over South America with K-Means method (a) Traditional Elbow method (WSS) and (b) Changed Elbow Method (
).
physical significance, it is crucial to underscore that the determination of the cluster count, even when facilitated by metrics, entails a degree of subjectivity.
Considering the best choice for each method spatial maps were made to verify the temperature spatial distribution of the groups (Figure 6). It can be observed that the methods yield models characterized by relatively distinct group distributions, a result that aligns with anticipated expectations. However, certain clusters exhibit a degree of proximity across all three methods. Noteworthy examples include the clusters encompassing the Amazon basin and a portion of northeastern Brazil. Another instance pertains to the northern coastal region of Argentina, extending inland to approximate proximity with the Chilean border and southward until nearly reaching a latitude of −45˚. Furthermore, the Chilean coastline demonstrates a congruent cluster distribution across all three cases. Regrettably, no extant studies pertaining to temperature clustering analysis have been identified that could serve as comparative benchmarks against the findings delineated within this article.
5.2. Precipitation Clustering Over South America
The evaluation of clustering with regard to precipitation patterns across the South American expanse is depicted in Figure 7, leveraging the utilization of SL Values, CH Index, and DB Index for cluster counts spanning from k = 1 to 50. The dashed lines within the figure correspond to methodological averages associated with Ward (represented in blue), K-means (illustrated in red), and SOM (depicted in green). The most promising outcomes manifest within the domain
Figure 6. Spatial results for cluster analysis over South America for temperature using (a) Ward (k = 10); (b) K-means (k = 8); (c) SOM (k = 8) methods. Where k is the number of groups.
Figure 7. Clustering assessment for precipitation over South America based on (a) Silhouette Value; (b) CH Index and (c) DB Index and number of clusters.
of K-means clustering, as evident from the utilization of SL values and CH index, wherein elevated average values are attained. Conversely, the DB index reveals that SOM offers the most favorable performance, substantiated by the observation of lower average values, signifying enhanced effectiveness.
These metrics also serve a pivotal role in guiding the determination of an optimal cluster count. Within the framework of the Ward Method, an apt selection for the number of precipitation clusters emerges at k = 11, where both SL values and the CH index culminate in their peak values before undergoing a decline (manifesting at k = 12), concomitant with the DB index exhibiting a decline and attaining its nadir. In the context of K-means clustering, a judicious choice for the number of precipitation clusters is discerned at k = 12 for SL values and k = 11 for the CH index. In both cases, these values are characterized by an ascent to higher values prior to a subsequent decline. Furthermore, the DB index indicates an optimal selection at k = 12, corresponding to its lowest value. When considering SOM, congruence between SL values and DB index is observed at k = 12, while the CH index advocates for k = 14 as an optimal choice. Taking these factors into account, a judicious selection could indeed be made at k = 12.
In the context of the non-hierarchical K-means method, additional approaches such as the Elbow Method and the Changed Elbow method can be employed. However, in this instance, the task of determining the optimal cluster count is notably intricate due to the absence of a distinctly discernible inflection point (Figure 8).
The determination of the most suitable option for each methodology prompted the creation of spatial maps aimed at scrutinizing the geographical distribution of precipitation within the identified groups. Evidently, within the SOM framework, a pronounced large-scale cluster emerges (Figure 9(c)), accompanied by several smaller clusters located in the northern expanse of the continent. In the
Figure 8. Metrics to the choice of the number of clusters for precipitation over South America with K-Means method (a) Traditional Elbow method (WSS) and (b) Changed Elbow Method (
).
Figure 9. Spatial results for cluster analysis over South America for precipitation using (a) Ward (k = 11); (b) K-means (k = 12); (c) SOM (k = 12) methods. Where k is the number of groups.
context of the Ward method (Figure 9(a)) and K-means approach (Figure 9(b)), disparities surface, primarily attributable to the variation in the number of clusters within the southwestern, northern, and northeastern sectors of the continent. Subsequent to attaining the optimum outcomes as deduced by the K-means approach, a thorough examination of the spatial distribution of precipitation patterns across South America was conducted. In light of the indices signifying the efficacy of K-means and SOM, it becomes evident that the spatial distribution evinced by the latter does not align with physical reality. Conversely, with regard to precipitation patterns, a notable inclination is observed in favor of the K-means methodology. Upon juxtaposing the findings of K-means (k = 12) with the study conducted by [8] , which also employed the K-means technique albeit with a distinct dataset (k = 8), semblances in the distribution patterns across certain regions become apparent despite disparities in the number of clusters, an aspect that contributes to segmenting the groups. The salient distinctions stem from the omission of data pertaining to the western geographic sector of the continent.
6. Conclusions
The analysis of clustering holds a pivotal role in climatic studies, facilitating the discernment of intricate structures within climatological datasets. A notable observation pertains to the scarcity of comparative endeavors across distinct methodologies within the context of South America’s climatic investigation. Addressing this gap, this study has introduced three distinct clustering methodologies—Ward, K-means, and Self-Organizing Maps (SOM)—while elucidating the processes encompassing method assessment, selection, and cluster count determination.
Silhouette Value, CH Index, and DB Index have emerged as indispensable tools for cluster validation and the judicious determination of cluster counts. It is imperative to acknowledge that conventional approaches, such as the Elbow Method and the Changed Elbow Method employed within non-hierarchical frameworks for cluster count selection, yield challenges that compound the decision-making process.
In relation to temperature patterns, K-means has showcased superior performance, leading to the formation of 8 distinct clusters across the South American expanse. Conversely, for precipitation, the most favorable outcomes have been achieved through K-means clustering, resulting in the identification of 12 distinct clusters across the same region. Consequently, K-means emerges as a robust method for the climatic regionalization of both temperature and precipitation patterns across the South American landscape.
In conclusion, this study not only serves as a valuable reference for the exploration of climatic clustering methodologies but also lays the foundation for future investigations focused on the continent’s climatic intricacies.
Acknowledgements
Luana Albertani Pampuch thanks to the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) for the Research Financial Support (Universal Project grant 426530/2018-7). Cassiano Bortolozo also thanks to the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) for the Postdoctoral Scholarship (grant 152269/2022-3) and also for the Research Financial Support (Universal Project grant 433481/2018-8). Rogério Galante Negri thanks to the São Paulo Research Foundation (FAPESP, grant 2021/01305-6), and CNPq (grant 305220/2022-5).