Domain Delineation Using Geological Data, Variogram Analysis, and Clustering Algorithms ()
1. Introduction
In earth sciences, domaining plays a crucial role in enhancing the accuracy and reducing the uncertainty of subsurface modeling. The primary rationale behind domaining is to partition a site into domains where the variables of interest exhibit homogeneous variations and can be represented by stationary spatial random fields. Understanding the nature of the boundaries between neighboring domains is a vital aspect of this process, as it influences whether data from one domain can be effectively used for modeling another. Boundaries can be categorized as hard (distinct and non-interdependent), gradational (allowing domains to merge seamlessly), or fuzzy (gradual merging up to a certain distance before becoming distinct) (Hossein Morshedy et al., 2015; Kubler et al., 2016).
Clustering, as an unsupervised pattern classification method, is extensively used to identify groups within datasets (Kaufman & Rousseeuw, 2009). The main objective of clustering is to maximize the similarity within each cluster while ensuring significant differences between clusters (Jain, 2010). Clustering methods can broadly be categorized into hard and fuzzy clustering. Hard clustering methods assign each data point to a single cluster, whereas fuzzy clustering methods allow data points to belong to multiple clusters with varying degrees of membership (Emmert-Streib et al., 2020). Both hard and fuzzy clustering methods utilize cluster validity indices to determine the optimal number of clusters and evaluate clustering quality (Oyelade et al., 2016; Khorram et al., 2021).
There is a wide variety of clustering algorithms, which can be classified into probability model-based and nonparametric algorithms. Probability model-based algorithms assume that the data follow a mixture probability model, while nonparametric algorithms rely on an objective function based on similarity or dissimilarity measures. Nonparametric algorithms can be further divided into hierarchical and partitional methods, with partitional algorithms being the most commonly used. These methods assume that the dataset can be represented by a finite number of cluster prototypes, each associated with its own objective function. To implement these algorithms, it is necessary to define the dissimilarity (or distance) between data points and cluster prototypes.
In general, clustering does not impose spatial contiguity or compactness in the determination of clusters. To explicitly incorporate the spatial component, Oliver and Webster (1989) integrated a spatial variogram model into the dissimilarity measure, using an isotropic exponential structure with parameters such as nugget effect, sill, and range. Bourgault et al. (1992) extended this approach by employing a multivariate (co)variogram to account for both spatial and attribute correlations in clustering. Fouedjio (2016) introduced the spatial component into the dissimilarity measure within the agglomerative hierarchical clustering algorithm, where the dissimilarity/similarity between two observations is not simply a Euclidean measure but a function of their spatial correlation. However, the effect of negative correlations in cross-variograms on this dissimilarity measure, particularly when spatial correlation is low, remains unclear. Romary et al. (2015) further incorporated the spatial component into the distance metric of a hierarchical clustering method. Their distance function accounts for spatial connectivity introduced by a moving neighborhood, allowing the user to define attribute weights that are integrated into the distance measure. Additionally, the coordinates are treated as attributes in this framework.
In this study, the authors analyze various spatial geoscientific data from a copper deposit in Iran, including lithology, alteration, geochemical zoning, and values of key geochemical ore elements. We apply different clustering methods to determine the optimal domain structure for the region. Hard and fuzzy cluster validity indices are calculated to identify the ideal number of domains. To model the spatial structure of the data, directional variograms are computed for each domain. Finally, anisotropic search ellipsoids for all domains are compared to select the clustering method and dataset that produce the highest average anisotropic dissimilarity between domains.
2. Study Area: East Kahang Copper Deposit
The Kahang Cu-Mo porphyry deposit is located 73 km northeast of Isfahan and 10 km east of the town of Zefreh in central Iran. This porphyry system is part of the Urumieh-Dokhtar magmatic arc (UDMA), which trends northwest-southeast and hosts several of Iran’s major copper-bearing deposits, including Sarcheshmeh, Sungun, Meiduk, Darehzar, Sarkuh, and Daralu (Haschke et al., 2010; Sun et al., 2015). The Kahang area is divided into three main sectors: eastern, central, and western. Detailed exploration drilling was conducted in the eastern anomaly to further evaluate its potential.
2.1. Regional Geology
In the eastern part of the Kahang deposit, the primary rock types include andesite, andesite porphyry, dacite porphyry, quartz monzonite, diorite, andesitic breccia, and quartz veins filled with iron oxides. The mineralization comprises chalcocite, covellite, chalcopyrite, pyrite, malachite, chalcantite, magnetite, hematite, jarosite, and goethite. The oldest igneous units are andesitic lavas and Eocene volcanic breccias, which exhibit propylitic alterations and are considered to have high potential for Cu-Mo mineralization. The alteration types in the region include silicification, phyllic, argillic, propylitic, and potassic. The dacitic unit, characterized by a porphyritic texture, is associated with quartz-sericite, argillic, and quartz-goethite alterations. A semi-deep section of the andesite porphyry is intruded by quartz monzonite and diorite stocks, which display phyllic and propylitic alterations. Geochemical zones within the deposit are classified into oxide, leach, supergene, and hypogene, extending from the surface to greater depths. This porphyry system is also influenced by two fault systems with NE-SW and NW-SE orientations (NICICO, 2011; Afzal et al., 2012; Afshooni et al., 2013). Figure 1 shows the geological map along with the location of the eastern part of the Kahang deposit within the structural map of Iran.
2.2. Data sets
In the first phase of exploration, 2-meter core samples from 10 boreholes were logged to obtain information on lithology, alteration, and mineral zones (Table 1). These samples were then assayed for Cu, Mo, and Ag grades. The boreholes varied in depth from 100 m to a maximum of 450 m, with dips ranging from approximately 60 degrees to vertical. In the subsequent detailed exploration phase, the number of boreholes increased to 36, including both vertical and directional holes, with depths ranging from 100 m to 700 m. The core samples were assayed
Figure 1. Geological map of the eastern part of the Kahang deposit (b), with its location (blue square) in the structural map of Iran (a). Adapted from Alavi.
Table 1. Summary of logged classes for lithology, alteration, and mineral zone in initial exploration campaign.
Lithology |
Alteration |
Mineral zone |
Alluvium |
Unaltered |
Oxide |
Tuff |
Chloritic |
Leach |
Dacite |
Quartz-sericitic |
Transitional |
Quartz diorite |
Potassic |
Supergene |
Diorite |
Phyllic |
Hypogene |
Andesite |
Propylitic |
|
|
Argillic |
|
|
Silicification |
|
Figure 2. Three-dimensional distribution of ore grades in the 36 boreholes: (a) Cu; (b) Mo.
for Cu, Mo, Ag, and Au. For illustration, the distributions of copper and molybdenum grades are shown in Figure 2. The selected geochemical dataset includes less than 5% of values below the detection limit. Preprocessing involved outlier detection and removal, followed by data standardization to prepare it for clustering analysis.
3. Methodology
Clustering Methods
Clustering algorithms are essential tools for grouping data points based on similarity, and they play a crucial role in geological applications, particularly in the domain of rock type classification and geological modeling. In the context of rock types, clustering algorithms can help group rocks with similar characteristics, such as mineral composition, texture, and physical properties, which are essential for tasks like resource estimation, ore body modeling, and the optimization of mining operations (Zhu & Li, 2017).
K-means is one of the most well-known and widely used clustering algorithms. It partitions the data into a predefined number of clusters, K, by iteratively assigning each data point to the nearest centroid and recalculating the centroids based on the points assigned to each cluster. This process continues until the centroids no longer change. While K-means is fast and efficient, it assumes that clusters are spherical and equally sized, which can be a limitation for datasets with more complex structures. Moreover, the number of clusters K must be specified before running the algorithm, which can be a challenge when the optimal number of clusters is unknown (Xu & Wunsch., 2005).
In contrast, hierarchical clustering creates a tree-like structure called a dendrogram, where data points are merged or split at different levels based on their similarity. This method doesn’t require the number of clusters to be predefined and allows for a more flexible analysis of the data. There are two main types: agglomerative (bottom-up) and divisive (top-down). In agglomerative clustering, each data point starts as its own cluster, and the algorithm merges the closest clusters iteratively. In divisive clustering, all points start in a single cluster, and the algorithm divides it recursively. This flexibility makes hierarchical clustering ideal for understanding the nested structure of data, but it can be computationally expensive for large datasets (Gan et al., 2020).
Affinity propagation takes a different approach by identifying exemplars, or representative points, that best define each cluster, without the need to predefine the number of clusters. The algorithm uses a message-passing mechanism, where data points send messages to one another to indicate how well they serve as exemplars. These messages are updated iteratively until the algorithm converges on a set of exemplars, which define the clusters. Affinity propagation is advantageous in cases where the number of clusters is uncertain and where clusters may have varying sizes (Frey & Dueck., 2007).
Self-organizing maps (SOM), an unsupervised learning method, is particularly useful for visualizing high-dimensional data by mapping it onto a 2D grid. SOMs are inspired by neural networks and work by adjusting the positions of neurons in response to data patterns, creating a map where similar data points are grouped together. This method is particularly valuable for exploring and visualizing complex datasets and understanding the underlying structure. SOM is often used in cases where data visualization is as important as clustering (Miljković, 2017).
Fuzzy C-means offers a flexible alternative to K-means by allowing data points to belong to multiple clusters with varying degrees of membership. Instead of making a hard assignment of each point to a single cluster, fuzzy C-means computes the degree to which each data point belongs to each cluster, with membership values between 0 and 1. This approach is helpful when the boundaries between clusters are ambiguous, and data points may share characteristics with more than one cluster. Like K-means, fuzzy C-means iterates between assigning membership values and updating the cluster centers (Ikotun et al., 2023).
Finally, Gustafson-Kessel is an extension of fuzzy C-means that overcomes one of the limitations of fuzzy C-means: the assumption that clusters are spherical. Gustafson-Kessel adapts the covariance matrix for each cluster, allowing for the detection of elliptical clusters, which makes it more versatile when dealing with data that doesn’t fit the spherical cluster model. This adaptation makes Gustafson-Kessel particularly effective for complex datasets with clusters of varying shapes and sizes (Ruspini et al, 2019; Gustafson & Kessel 1978).
The difference between hard and fuzzy clustering is associated with the assignment of data samples to clusters (Everitt et al., 2011; Xu & Tian, 2015). In hard clustering, a sharp boundary describes two domains that are in contact with each other but are otherwise unrelated, resulting in a perfect separation of the domains. In the fuzzy boundary, the domains can overlap up to a finite distance, while a gradational boundary can be seen as an extreme case where the overlap distance becomes infinitely large.
4. Discussion and Results
4.1. Comprehensive Method
In this study, four hard clustering algorithms and two fuzzy clustering algorithms are applied to define domains and inter-domain relationships in the eastern Kahang deposit (Table 2). These algorithms identify clusters based on optimality measures, such as minimizing intra-cluster distance or maximizing inter-cluster distance (Wang et al., 2022; Oyelade et al., 2016). To assess the clustering performance, cluster validity indices (Table 3) are employed to determine the optimal number of clusters and evaluate the quality of the results. For each clustering algorithm, the validity indices are calculated based on four distinct combinations of input data (Table 4), representing the four cases under investigation. These combinations include geological layers (lithology, alteration, and mineral zone), geochemical assay data (Cu, Mo, Ag, and Au grades) from borehole core samples collected during both exploration phases, and the 3D coordinates of these samples.
Table 2. Clustering algorithms under consideration.
Table 3. Clustering algorithms under consideration.
Table 4. Four combinations of input geological and geochemical layers on the clustering algorithm. The indices 10 and 36 refer to the assays of the initial and advanced exploration campaigns, respectively.
Case |
Coordinates |
Lithology |
Alteration |
Mineral zone |
Cu10 |
Mo10 |
Ag10 |
Cu36 |
Mo36 |
Ag36 |
Au36 |
1 |
YES |
YES |
YES |
YES |
|
|
|
|
|
|
|
2 |
YES |
|
|
|
YES |
YES |
YES |
|
|
|
|
3 |
YES |
YES |
YES |
YES |
YES |
YES |
YES |
|
|
|
|
4 |
YES |
|
|
|
|
|
|
YES |
YES |
YES |
YES |
The implementation involves preparing the input data, applying the clustering algorithms, and calculating the cluster validity indices for partitions with varying numbers of clusters within the integer range [2, 10]. This range corresponds to the minimum and maximum number of domains considered feasible for the eastern Kahang deposit (Figure 3). The resulting indices, computed for different combinations of input layers, are presented in Figure 4.
Figure 3. Steps for domaining from input layers to output of clustering consisting of a partition of the data into 2 to 10 domains.
Figure 4. Validity indices computed for hard clustering algorithms: (a) C; (b) CH; (c) DB; (d) D; (e) H; (f) KL; (g) S.
In addition to the number of clusters, fuzzy partitioning algorithms are sensitive to other parameters. For example, in FCM clustering, the validity criteria can be modeled as a function of both the number of clusters and a weighting exponent (fuzziness parameter) within the range of 1 to 2, as shown in Figure 5. Similarly, in the modified Gustafson-Kessel algorithm, a tuning parameter (ranging from 0 to 1) must also be considered, as depicted in Figure 6.
Figure 5. Validity indices computed for FCM clustering: a) ADI; b) CE; c) PC; d) S; e) P; f) XB.
Figure 6. The cluster validity indices of GK clustering: a) ADI; b) CE; c) PC; d) S; e) P; f) XB.
Due to the direct or inverse relationship between each validity index and the number of clusters, a sharp decrease or increase in an index serves as a criterion for identifying optimal clustering parameters, such as the ideal number of clusters. Additionally, partitions with fewer clusters are preferred when the differences between validity indices are minimal (Hadiloo et al., 2018). Considering both the validity indices and expert geological judgment, the optimal number of domains ranges from 3 to 4, depending on the hard clustering algorithm and the combination of input data. In contrast, for the fuzzy clustering algorithms, the fuzziness and weighting parameters are selected within the ranges of 1.1 - 1.3 and 0.1 - 0.3, respectively, with the optimal number of domains consistently defined as 3.
4.2. Post Processing of Clustering Outputs
The objective of domaining is to partition the region into domains that exhibit the maximum statistical and spatial differences. A key tool for analyzing the spatial structure of a regionalized variable is the variogram, which quantifies the average squared difference of paired data points as a function of the distance between them. For each clustering algorithm and resulting domain, variograms of the feature data are computed in various directions to capture anisotropy, which can be represented as an ellipsoid characterized by dimensional (range) and directional (orientation) parameters. The goal is to identify a domaining scheme with a high degree of dissimilarity between the anisotropy ellipsoids of each domain, as this indicates better spatial differentiation and discrimination of the data’s spatial properties.
The outputs of the SOM and GK algorithms under cases 3 and 4, using the selected input data, are chosen to assess the domaining performance. For each data feature and domain, experimental variograms were computed for 24 directions and fitted with theoretical models. Figure 7 illustrates the experimental variograms and their corresponding fitted models, highlighting the primary anisotropy direction (with the largest correlation range) for each domain obtained from the SOM and GK clustering algorithms. The correlation ranges of the directional variograms were used to calculate the anisotropy ellipsoid parameters, such as anisotropy factors (ratios of the major axis range to the semi-major axis range and the major axis range to the minor axis range) and rotation angles (azimuth, dip, and plunge). The anisotropy ellipsoid effectively summarizes the spatial data structure within each domain, as shown in Figure 8.
The Jensen-Shannon divergence (JS) is used to measure the dissimilarity between probability distributions, based on Jensen’s inequality and Shannon entropy (Somani et al., 2017; Jiang et al., 2021). For each pair of domains, the Jensen-Shannon divergence is computed by considering the covariance and anisotropy ellipsoids (Table 5). The results show that, based on the average Jensen-Shannon divergence, the dataset from Case 3, which includes both qualitative (logging) and quantitative (assay) data combined with the SOM clustering algorithm, is more effective at distinguishing between domains.
In practice, once the borehole core samples have been clustered, translating these clusters into a comprehensive 3D model may involve various interpolation techniques, such as manual drawing, implicit modeling, kriging, or geostatistical simulation. These methods facilitate a nuanced and accurate representation of the clustered data in three-dimensional space, enhancing the depth of understanding in practical applications.
Table 5. The Jensen-Shannon divergences for each pair of domains obtained with SOM and GK as clustering algorithms
|
Algorithm |
SOM |
GK |
Data Set |
Domain |
1 |
2 |
3 |
1 |
2 |
3 |
Case 3 |
1 |
|
0.5598 |
0.6783 |
|
0.4747 |
0.0140 |
2 |
|
|
0.2107 |
|
|
0.4241 |
3 |
Average: 0.4829 |
|
Average: 0.3043 |
|
Case 4 |
1 |
|
0.3231 |
0.1633 |
|
0.0056 |
0.0257 |
2 |
|
|
0.1173 |
|
|
0.0162 |
3 |
Average: 0.2012 |
|
Average: 0.0158 |
|
Incorporating spatial coordinates into the clustering algorithms aims to generate clusters that exhibit greater spatial cohesion. Several alternative approaches to spatial clustering have been proposed, diverging from the traditional method of treating coordinates as feature variables. Instead, these approaches integrate coordinates as constraints or parameters within the calculation of similarity between data points (Romary et al., 2015; Fouedjio, 2016; Faraj & Ortiz, 2021). It is important to emphasize that the focus of this study lies in the analysis of the obtained clusters through variogram analysis. Our approach parallels the work of Faraj and Ortiz (2021), where clusters are delineated based on localized changes in the spatial correlation of the data.
Finally, incorporating geoscientific data (such as geophysical or geological surveys) offers an opportunity to complement the logging and assay information collected from the boreholes. There are two potential ways to leverage this supplementary data: (a) Integrating it directly into the clustering methodologies
Figure 7. Experimental variograms (dots) with fitted models (solid lines) along the main anisotropy direction for different domains. The feature variable here is the copper grade.
Figure 8. Representations of the anisotropy ellipsoids for each domain. The feature variable here is the copper grade.
as additional input, treating it similarly to other variables. However, this integration poses challenges, as geoscientific data may not share the same coordinate system as the borehole data, complicating the analysis setup. (b) Using this additional information to interpret the clustering outcomes obtained from borehole data. This approach allows for a correlation or alignment between clusters identified in the borehole data and interpretations derived from geophysical or geological surveys, aiding in the validation and enrichment of the clustered information by leveraging complementary datasets.
5. Conclusion
This paper presents a novel approach for delineating geological domains using data-driven methods, specifically clustering algorithms combined with variogram anisotropy modeling. Unlike conventional knowledge-based techniques, this approach leverages data layers to address their inherent limitations. The study focused on the eastern part of the Kahang copper deposit in central Iran, incorporating geological (lithology, alteration, and mineral zones), geochemical (Cu, Mo, Ag, Au grades), and spatial data from boreholes across two exploration phases. The optimal number of domains was determined by evaluating several hard and fuzzy cluster validity indices. In the fuzzy clustering analysis, appropriate fuzziness and weighting parameters were identified within specified ranges. The results were validated through comparison with geological observations and expert assessments, with the SOM and GK algorithms yielding the best outcomes when using data from cases 3 and 4. Anisotropy modeling for spatial domaining involved assigning data to specific domains and calculating directional variograms to derive the dimensional and directional parameters of the anisotropy ellipsoid for each domain. The SOM algorithm, applied as a hard clustering method, significantly discriminated anisotropy parameters among the domains, as evidenced by a high average value of the Jensen-Shannon (JS) divergence (~0.5). Overall, the data-driven approach combining clustering algorithms with variogram anisotropy modeling proved effective in geological domaining, offering a more reliable and comprehensive understanding of the geological structure in the study area. The results of this study can be helpful in the guiding exploration activities, targeted drilling programs, anomaly detection, enhanced resource estimation, grade control, optimized mine design, economic evaluation, and geotechnical stability.
Acknowledgments
The authors acknowledge the support of the National Agency for Research and Development of Chile, through postgraduate study scholarship ANID-Subdirección de Capital Humano/Doctorado Nacional/2022-21220317 (F. Khorram) grant ANID AFB230001 (F. Khorram and X. Emery).