Different Feature Selection of Soil Attributes Influenced Clustering Performance on Soil Datasets

Feature selection is very important to obtain meaningful and interpretive clustering results from a clustering analysis. In the application of soil data clustering, there is a lack of good understanding of the response of clustering performance to different features subsets. In the present paper, we analyzed the performance differences between k-means, fuzzy c-means, and spectral clustering algorithms in the conditions of different feature subsets of soil data sets. The experimental results demonstrated that the performances of spectral clustering algorithm were generally better than those of k-means and fuzzy c-means with different features subsets. The feature subsets containing environmental attributes helped to improve clustering performances better than those having spatial attributes and produced more accurate and meaningful clustering results. Our results demonstrated that combination of spectral clustering algorithm with the feature subsets containing environmental attributes rather than spatial attributes may be a better choice in applications of soil data clustering.


Introduction
Clustering generally divides a dataset (in which each data object has certain attributes) into k sub-clusters such that similar objects are within the same sub-cluster and dissimilar objects are in different sub-clusters [1]. Clustering Generally, ideal features should be useful in distinguishing patterns belonging to different clusters, immune to noise, easy to extract and interpret, decrease the workload, and simplify the subsequent design process [2]. The selection of appropriate features ensures meaningful and interpretive results. Although a number of methods for selecting appropriate feature subsets have been developed and reviewed [3], there is still an absent understanding of the influence of different feature subsets on the clustering performance in the real applications of soil data clustering to our knowledge.
On the other hand, no single clustering method presents a panacea that can be applied in all clustering conditions. Thus, different clustering methods have been developed to solve specific clustering problems [1] [4]. In the fields of agriculture and soil, clustering analysis has been applied to recognize soil patterns [5] [6] [7], manage soil nutrients [8] [9], design good soil sampling strategies [10] [11], and identify soil microbial communities [12] [13] [14], etc. However, few studies have compared the performance differences of clustering methods on soil data.
Inspired by the above-mentioned, we choose three classical clustering algorithms of k-means, fuzzy c-means, and spectral clustering, which are widely used and representative of the current state-of-the-art. To our best knowledge, we first evaluate the influence of different feature sets on the performances of the three clustering methods on soil data. Our research will provide a good reference for selecting a good combination of clustering algorithm and feature subsets in applications of soil data clustering.

Clustering Methods
K-means clustering is a very simple and widely applied clustering method. Given the observation data set where each observation is a d-dimensional vector, k-means clustering partitions the n observations into k (≤ n) sub-sets. To achieve the optimal clustering result, k-means clustering minimizes the within-cluster sum of squares (WCSS) [1] [4]. The clustering process has two steps: 1) first, randomly selects k observations as their initial mean or centers of sub-clusters. Each remaining data object will be assigned to the nearest subcluster based on the distance to each of the cluster centers, and the centers of the sub-clusters is then recalculated; 2) repeat (1) until WCSS is minimized.
The allocation of observations to clusters can be difficult when each data ob- International Journal of Geosciences ject must be placed into a cluster, but this can be simplified by considering a fuzzy property between observations. To represent the fuzzy boundaries between observations, fuzzy c-means clustering allows each observation to belong to more than one sub-cluster, and then associates the sub-clusters with a set of membership levels. Fuzzy c-means clustering first assigns the membership levels between observations and sub-clusters, and then uses these to allocate observations to one or more clusters. Fuzzy c-means clustering minimizes the following objective function of WCSS [15]: where u ij is the degree of membership of x i in cluster C j , and C i is the center of the cluster. In fuzzy clustering, the WCSS in (1) is iteratively optimized, and the membership u ij and cluster center C i are updated according to: Spectral clustering was developed to handle data with any shape and ensure convergence to the global optimum. This method constructs an affinity graph which is partitioned according to the corresponding Laplace eigen-spectrum [16]. First, a graph is formed based on the similarity between observations. Each graph node corresponds to one observation, nodes are connected with edges, and the edge weights denote the degree of similarity between observations [17] [18]. The graph is further characterized by the adjacency matrix W.
Let the diagonal matrix ij D W = ∑ where W ij is a diagonal element, and define the Laplacian matrix L = D − W. The top-k eigenvalues and corresponding eigenvector of L are calculated, and these k eigenvectors are arranged to form an n × k matrix, where each row can be taken as a k-dimensional vector. Finally, the k-means algorithm is applied to this n × k matrix, and the output is the spectral clustering result.

Selection of Feature Subsets
Soil samples normally contain three types of soil attributes: geographical coordinates, environmental factors, and soil conditions determined by physical or chemical analysis. Soil attributes differ in their precise physical meaning. Spatial attributes imply spatial structure information, which is normally used to characterize the spatial variability of certain soil conditions, and their environmental attributes reflect the factors influencing the soil conditions. To a certain extent, the more similar the environmental conditions between soil samples, the more similar the soil conditions [19]. In agricultural activities, soil conditions are generally more interesting and practical than spatial and environmental attributes.
A good understanding of how the clustering performance will respond to soil

Data Acquisition and Preprocessing
Two real soil datasets both contain 520 soil samples collected in Pangtang Town, Taoyuan County, Hunan Province. These are used to verify the effect of different feature sets on clustering performance. Each soil sample in the two datasets contains five attribute fields: spatial position (x, y coordinates), terrain factors (elevation, slope), and a soil condition (SOC or soil pH). Before applying the clustering models, the values of all soil attributes were normalized according to: denote the value of soil attribute j for soil sample i, the minimum value of soil attribute j, and the maximum value of soil attribute j, respectively.
The spatial distribution of the soil samples and environmental variables (elevation and slope) are shown in Figure 1. In the soil datasets, SOC is highly correlated with elevation and slope [20], but this is not the case for pH. Moreover, there is a significant difference between the SOC values in the top (elevation >  m) and bottom (elevation < 95 m) regions. Therefore, the dataset including SOC field can simply be divided into two sub-clusters according to the elevation threshold of 95 m. The spatial distribution of sub-clusters C1 and C2 is shown in Figure 2(a). The box-plot clearly indicates a significant difference of SOC contents between C1 and C2 (Figure 2(b)).

Validation
The k-means clustering, fuzzy c-means clustering and spectral clustering algorithms were applied to the experimental datasets. Good clustering results should exhibit a significant difference between the soil conditions of interest in different sub-clusters. In this study, we use two indicators to evaluate the clustering performance: the clustering dissimilarity index (DI), and the root mean square of clustering dissimilarity index (RSDI).
where ˆi C and ˆj C are the average values of a certain soil condition in sub-clusters i C and j C , respectively, k is the total number of clusters,

Programming Implementation
The k-means, fuzzy c-means, and spectral clustering algorithms were implemented in Matlab2010 on a Windows Xp operating system. The digital maps of soil samples and topography factors were produced using ArcGis9.0.

Clustering Performance under Different Soil Feature Subsets
We tested the influence of different soil feature sets (SA, EV, SA + SCV, SA + EV, EV + SCV, and WA) on the clustering performance of the three clustering algorithms. For each soil feature set, the three clustering algorithms were executed so as to form sub-clusters with respect to SOC. The spatial distribution of the soil samples in the resulting clusters clearly reflects the response of the clustering performance to the selection of different soil features. Compared with the control (Figure 2 Figure 4(r)), the clustered samples produced by all three clustering methods generally match the control. Under the SA + EV treatment, spectral clustering produces the clustering result that best matches the control ( Figure   4(o)), followed by fuzzy c-means (Figure 4(i)), with k-means the worst performer (Figure 4(c)). Compared with the results for the above-mentioned feature subsets, SA and SA + SCV both resulted in worse clustering. Under SA and SA + SCV, all three clustering methods generated two sub-clusters that were scattered to the north or south and significantly deviated from the control. DI and RSDI were further used to quantitatively evaluate the influence of different soil feature sets on clustering performance. These indexes were used to measure the deviation in SOC between the two sub-clusters. Generally speaking, DI and RSDI are higher under EV, EV + SCV, and WA than SA + EV, with the smallest index values occurring under SA and SA + SCV ( Figure 5). This  demonstrates that EV, EV + SCV, and WA produced a better clustering than SA + EV, with SA and SA + SCV producing the worst results. Additionally, the clustering performance of each clustering method can differ under the same feature set. Overall, spectral clustering generated relatively higher values of DI and RSDI than k-means and fuzzy c-means for EV, EV + SCV, SA + EV, and WA, but not for SA and SA + SCV. This indicates that spectral clustering is more robust to changes in the feature sets than k-means and fuzzy c-means.

Influence of Correlation between Environmental Variables and Soil Conditions on Clustering Performance
We also tested whether the pH values in each sub-cluster were significantly different. DI and RSDI were again used to evaluate the deviation in pH values between the two sub-clusters under different soil feature subsets. Generally speaking, the values of DI or RSDI are very similar for all six soil feature sets ( Figure   6). Additionally, the clustering performance of the three clustering methods did not vary for the same feature set. This demonstrates that the resulted sub-clusters have no significant differences in pH under the six soil feature subsets considered here.
Regarding the topographical factors(elevation and slope) correlating well with SOC but not with pH, whether the feature subsets containing environmental factors help to improve clustering performance or not depends on the correlation of environmental attributes with one or more soil conditions. In other words, Figure 6. Additionally, in the case of SOC, the bad clustering results under SA, SA + SCV, and SA + EV further suggest that spatial attributes make bad contributions in clustering models.
In many practical applications, environmental data collected by remote sensing techniques is rich and easily accessible, while relatively small amounts of soil condition data can be obtained at larger cost in terms of human resources and time. Thus, using environmental attributes that correlate well with soil conditions, rather than spatial attributes, will enable better recognition of soil patterns and allow information on soil conditions to be applied in the analysis of soil data.

Conclusion
The present study examined the effect of different soil feature subsets on the clustering performance. It was found that the feature subsets containing environmental variables generally helped to improve clustering performances of k-means, fuzzy c-means, and spectral clustering methods better than those having spatial attributes. Additionally, spectral clustering was clearly more robust to changes of feature sunsets than k-means and fuzzy c-means clustering methods in our study case. Thus, the combination of spectral clustering method with the feature subsets containing environmental variables can produce useful soil patterns when applied to soil survey data, especially those with an irregular shape. In future, diverse soil datasets will be used to further validate our results at a bigger spatial scale.