Application of Surface Water Quality Classification Models Using Principal Components Analysis and Cluster Analysis

Water quality monitoring has one of the highest priorities in surface water protection policy. Many variety approaches are being used to interpret and analyze the concealed variables that determine the variance of observed water quality of various source points. A considerable proportion of these approaches are mainly based on statistical methods, multivariate statistical techniques in particular. In the present study, the use of multivariate techniques is required to reduce the large variables number of Nile River water quality upstream Cairo Drinking Water Plants (CDWPs) and determination of relationships among them for easy and robust evaluation. By means of multivariate statistics of principal components analysis (PCA), Fuzzy C-Means (FCM) and K-means algorithm for clustering analysis, this study attempted to determine the major dominant factors responsible for the variations of Nile River water quality upstream Cairo Drinking Water Plants (CDWPs). Furthermore, cluster analysis classified 21 sampling stations into three clusters based on similarities of water quality features. The result of PCA shows that 6 principal components contain the key variables and account for 75.82% of total variance of the study area surface water quality and the dominant water quality parameters were: Conductivity, Iron, Biological Oxygen Demand (BOD), Total Coliform (TC), Ammonia (NH3), and pH. However, the results from both of FCM clustering and K-means algorithm, based on the dominant parameters concentrations, determined 3 cluster groups and produced cluster centers (prototypes). Based on clustering classification, a noted water quality deteriorating as the cluster number increased from 1 to 3. However the cluster grouping can be used to identify the physical, chemical and biological processes creating the variations in the water quality parameters. This study revealed that multivariate analysis techniques, as the extracted water quality dominant parameters and clustered information can be used in reducing the How to cite this paper: Hamed, M. A. R. (2019). Application of Surface Water Quality Classification Models Using Principal Components Analysis and Cluster Analysis. Journal of Geoscience and Environment Protection, 7, 26-41. https://doi.org/10.4236/gep.2019.76003 Received: April 1, 2019 Accepted: June 18, 2019 Published: June 21, 2019 Copyright © 2019 by author(s) and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0). http://creativecommons.org/licenses/by/4.0/ Open Access


Introduction
The Nile constitutes the essential source of life in Egypt; it provides people with their fresh water needs. It is an essential factor of production and vital for agriculture, transport, tourism and henceforth the socio-economic development of the country. However, the Nile has become, to a great extent, adversely affected by human activities. On the other hand, industrial waste discharge, leakage of sewage by urban agglomeration and agricultural runoff contributes to the Nile contamination [1].
Surface water quality deterioration at the intakes of Cairo water treatment plants along River Nile due to increasing level of some pollutants concentration above the guidelines paid the attention of public concern and may cause health hazards. Thus, the need for better management of Cairo treatment plants water sources quality is becoming essential.
Multivariate statistical techniques can be used to characterize and evaluate surface water quality; they are useful in verifying temporal and spatial variations caused by natural and anthropogenic factors linked to seasonality. Multivariate analysis of variance determines if there are any significant differences between several groups of multivariate data.
Principal component analysis includes correlated variables with the purpose of reducing the numbers of variables and explaining the same amount of variance with fewer variables (principal components).
Fuzzy C-means (FCM) can be achieved through more careful and informed initialization based on data content. By carefully selecting the cluster centers in a way which disperses the initial cluster centers through the data space, the resulting FCM approach samples starting cluster centers during the initialization phase. The cluster centers are well spread in the input space, resulting in both faster convergence times and higher quality solutions.
K-means can be used for cauterizing monitoring stations with similar water quality characteristics. K-means cluster analysis is a divisive clustering method with k number of groups set a priori to analysis [2]. Once the number of clusters is set as an input and cluster centroids are initialized, observations are added iteratively to the most similar cluster, whose centroid is then recalculated until all of the observations are grouped [3].
In the present study, the use of multivariate techniques is required to reduce the large variables number of Nile River water quality upstream Cairo Drinking Water Plants (CDWPs) and determination of relationships among them for easy and robust evaluation. By means of multivariate statistics of principal components analysis (PCA), Fuzzy C-Means (FCM) and K-means algorithm for clustering analysis, this study attempted to determine the major dominant factors responsible for the variations of Nile River water quality upstream Cairo Drinking Water Plants (CDWPs). Furthermore, cluster analysis classified 21 sampling stations into three clusters based on similarities of water quality features.
The result of PCA shows that 6 principal components contain the key variables and account for 75.82% of total variance of the study area surface water quality and the dominant water quality parameters were: Conductivity, Iron, Biological Oxygen Demand (BOD), Total Coliform (TC), Ammonia (NH 3 ), and pH.
However, the results from both of FCM clustering and K-means algorithm, based on the dominant parameters concentrations, determined 3 cluster groups and produced cluster centers (prototypes). Based on clustering classification, a noted water quality deteriorating as the cluster number increased from one to three, thus the cluster grouping can be used to identify the physical, chemical and biological processes creating the variations in the water quality parameters.

Cairo drinking water plants (CDWPs)
Cairo water company (CWC), a subsidiary of the holding company of water and wastewater; produces potable water with an amount reaches to 6 million m 3 /day used by inhabitants of Greater Cairo. This is done through 13 Cairo drinking water plants (Tibeen, Kafr Elw, North Helwan, Maadi, Fostat, El Roda, Rod El Farg, Amerea, Mostrod, El Marg, El Obour, El Asher and Shubra el Khiema) distributed in Greater Cairo. Table 1 shows the annual average raw water, treated water and sludge and washing water for Greater Cairo drinking water plants [4]. From Table 1

Data requirements
Surface water samples were collected from various sampling locations of rivers, canal, drains and industrial pollution sources of the study area. The analyses of water samples were carried on twenty water quality parameters according to the standard methods for the examination of water and wastewater for twelve consequence months during two years (2017 and 2018) to show the effect of the spatial and temporal variation.

Methods
The methods consisted of four main components as follows: Before the computation, the testing data were standardized in order to avoid misclassifications arising from different orders of magnitude of tested variables. Therefore, the original data were meaning (average) centered and scaled by the standard deviations.
Procedural steps of the PCA [6] are: • Number of components equal to number of variables is generated • The number of components to retain is determined • Components are rotated (rotations is a linear transformation of the solution to make interpretation easier) • Rotated solution is interpreted  Dominant water quality parameters: In this study, to determine the main dominant water quality parameter, varimax rotation used as an effective orthogonal rotation method that minimized the number of variables that have high loading on each factor. The varimax coefficient having correlation greater than 0.75 are considered as strong and indicate high proportion of its variance explained by the factor, between 0.50 and 0.75 is considered as moderate loading while 0.30-0.50 as weak significant factor loading, indicating much of that attribute's variance remains unexplained and it is less important [7].

Fuzzy C-means clustering (FCM) analysis:
FCM applied for clustering the raw data into several categories using the selected operators without respect to any predetermined criteria in relation to each category. Most of the rules designed for FCM are based on the proper search for centroids or representative objects around which all observations will be clustered on a minimum basis [8,9].
FCM seeks to minimize the following objective function, C, made up of cluster memberships and distances [10].
In fuzzy clustering, the following coefficients must be determined: i.
Dunn's partition coefficient may be normalized so that it varies from 0 (completely fuzzy) to 1 (hard cluster). The normalized version is: Another partition coefficient, given in Kaufman (1990) is: iii. The normalized version of this equation is: F c (U) and D c (U) together give a good indication of an optimum number of clusters. We should choose K so that maximize the value of F c (U) and minimizes D c (U) [10].
K-means algorithm: K-means is a simple and efficient algorithm. It divides n observations into given K clusters and each observation belongs to cluster with nearest mean.
It uses the sum of square error criteria. The cluster pattern is assigned when sum of square error is minimum. The sum of square error equation (SSE) for K-means is given by Kaufman and Rousseeuw [10]: where m i is the mean of the i th cluster and x∊C i is a pattern assigned to that cluster. The K-means clustering has advantage over other methods as it can be used to assign new cases to the existing clusters.

Results and Discussion
Descriptive statistics Table 2 shows the details of descriptive statistics for the water quality variables measured in two years.

Principal component analysis
The calculated principal components loadings, eigenvalues, total variance and cumulative variance are shown in Table 3, while the scree plot of the eigenvalues of observed components is depicted in Figure 2.
The results of principal components analysis illustrated in Table  3 and Figure 2 of Cattel scree plot show that of the 20 components, only 6 had extracted eigenvalues over 1 [11]. This is based on Chatfield and Collin [12] assumption which stated that components with an eigenvalue of less than 1 should be eliminated. The extracted 6 components were subsequently rotated according to varimax rotation in order to make interpretation easier and fundamental significance of extracted components to the water quality status of the selected study period. The result of rotation revealed further, the percentages of the total variances of the 6 extracted components when added account for 75.82% (that is their cumulative variance) of the total variance of   the observed variables. This indicates that the variance of the observed variables had been accounted for by these 6 extracted components.
As it is obvious, the first principal component (PC1), accounts for 31.48% from total variation, can be called as an indication of salt component because it is mainly saturated with conductivity, hardness (including calcium). PC1 accounts show a strong loading on EC (0.902), TDS (0.889), total hardness (0.887), sulfates (0.883), chlorides (0.881), magnesium (0.811), while moderate loading on calcium (0.726), nitrates (0.674) and total alkalinity (0.65). Electric conductivity(EC) measurements indicate the presence of dissolved salts and electrolytic contaminants, but it gives no information about specific ion compositions [13].There was a strong positive correlation between TDS and EC values which revealed positively strong correlation to each other (r=+0.99), so the study results were in accordance with Toufeek and Korium [14].
The second principal component (PC2), accounts for 13.29%, is associated with strong loading on iron (0.879) and manganese (0.819), while moderate loading on calcium (0.536). The concentration of iron and manganese recorded higher attribute due to the intense of human activities and industrial effluents from for iron and steel companies.
The third principal component (PC3) described 9.73% of the total variance had a strong positive loading on BOD (0.938) and COD (0.926). These factors loading explained the effects of organic pollution and reflect strong influence of anthropogenic activities in the area, probably from domestic waste and industrial waste. High BOD and COD levels in the study area are related to the existence of high bacterial load and organic matters as well as relative high temperatures which enhance the enumeration of bacteria. However all results of study area water samples were higher than the permissible limit guidelines (COD should not exceed 10 mg/L) according to Egyptian National water quality standards, Law 48/1982 regarding the protection of the River Nile and waterways from pollution.
Out of the total variance, 6.55% is explained by the fourth principal component (PC4), is mainly carried by TC with a positive strong correlation (0.76) that is indicators for water contamination. The high counts of total coliform might be due to pollution by industrial activities discharging their wastes to the Nile water in Cairo [15]. All results of Nile water samples were higher than the permissible limit guidelines (TC should not exceed 5000 cfu/100 mL) according to Tebbutt [16]. The study results also agree with Rabeh [17]. Additionally, 7.36% of the total variance of water quality is exhibited by NH 3 with a strong positive loading under the fifth principal component (PC5). NH 3 is closely related to the organic matter contents of the sediment and this high amount of nutrients might also result from the application of manure in agricultural activities [18].
The six principal component (PC6), with 7.4% of the total variance, consists mainly of pH (0.701) and DO (0.701) with a moderate loading. This factor resulted due to the anaerobic conditions in the river from the strong loading of dissolved organic matter which leads in the formation of organic acids. pH value has an effect on the biological, chemical reactions, as well as it controls the metal ion solubility and thus it affects the natural aquatic life. The study results were in accordant with Toufeek and Korium [14].
Based on the component loadings, the variables are grouped accordingly with their designated components as follows: f. Component 6: pH and DO.

Dominant water quality parameters
The dominant parameters identified by the PCA are: EC, iron, BOD, TC, NH 3 and pH ( Table 3). The previous discussion indicated that most of measured water quality parameters such as EC, TDS, total hardness, different major ions and total alkalinity, loaded with positive values, and they have strong effects on PC1. EC has the maximum strong loading value in PC1. Thus, EC is considered as a dominant parameter.
The iron is considered as the next dominant water quality parameter as it is loaded strong in PC2 with the highest positive values. Also, the concentrations iron in the Nile water causes the exceedance of the drinking water guidelines, particularly at the anthropogenic impact points, where, iron is regulated by secondary drinking water contaminant that may cause offensive taste, odor, color corrosion or staining problems.
The BOD is considered as the third dominant water quality parameter as it is loaded strong in PC3 with highest value (0.938). These two parameters (BOD and COD) may have a strong relationship with each other, particularly the discharge of industrial and agricultural effluents containing a large amount of organic matter [19].
The TC is considered as the fourth dominant water quality parameter as it is loaded strong in PC4 with highest value (0.760).
The ammonia is considered as the fifth dominant water quality parameter as it is loaded strong in PC4 with highest value (0.853). Ammonia may result from fertilizers that are present in soil and it is relatively easily oxidized to nitrite and finally to nitrate [20] and it possesses a serious threat to public health.
The pH is considered as the third dominant water quality parameter as it is loaded strong in PC3 with highest value (0.701). pH value has an effect on the biological, chemical reactions, as well as it controls the metal ion solubility and thus it affects the natural aquatic life. Moreover pH could control the pathogenic microorganism growth [21].

Cluster analysis
Optimum number of clusters: FCM applied to determine the optimum number of clusters (k) that maximize the value of Fc (U) and minimizes Dc (U) [10]. Table 4 illustrates the values of Fc (U) and Dc (U) with the corresponding number of clusters. FCM results illustrated in the Table 4, it noticed that the optimum number of clusters for the study area is three clusters which satisfies the above conditions. Clusters characteristics: According to the optimum number of clusters which determined by using FCM in the previous step, K-means algorithm applied to produce the generalized clusters characteristics using the dominant parameters. After finding medians of clusters, the clusters are developed by assigning each object of dataset to the nearest medians of the clusters. The dissimilarities from each of the objects in the dataset from these centers of the clusters are determined using Euclidean distance. Cluster Centers are selected on the basis of the minimum distance. Silhouette is used for interpretation and validation of clusters [10]. Table 5 and Figure 3 illustrate generalized characteristics mean values and the six dominant parameters mean values of the three clusters respectively. It is obvious from Table 5 and Figure 3 for the K-means algorithm results, as the cluster number changed from 1 to 3, the value of the six dominant parameters and the water quality deterioration increased [22][23][24][25][26][27].

DWPs and monitoring stations clusters allocation:
According to the K-means algorithm generalized clusters characteristics results, the allocation for CDWPs and monitoring stations clusters were developed. The output of the cluster characteristics analysis is dispensed in dendogram (Figure 4). Dendogram gives the picture of the clusters describing the spatial variation in the water quality and the grouped monitoring stations, CDWPs of each cluster.
Based on the results of cluster analysis, stations and CDWPs grouped under each cluster in Figure 4, it was concluded that: The first cluster, mainly located in the upstream of the study area with less polluted (LP) stations, included the stations from 878 to 868 and three DWPs (Tibeen, Kafr Elw and North Helwan). The changes in water quality in this cluster were mainly due to the agricultural drainage water mixed with partially treated or untreated domestic wastewater, industrial wastewater and wastewater from these three drinking water plants sludge disposal.
The second cluster, comprised only the three DWPs (Maadi, Fostat and El Roda) with moderate pollution (MP), is mainly affected by the cumulative pollution from the previous cluster in additional to the wastewater from the three mentioned drinking water plants sludge disposal. The common feature of these sites was relatively high dominant parameters concentrations compared to the first cluster.
The third cluster located in the downstream of the study area,

Conclusions
This study presents the application of multivariate statistical techniques to evaluate the water quality upstream Cairo drinking water plants along Nile River. The paper outcomes can be beneficial for: • Understand quality of source waters (i.e., lakes, rivers, and other water bodies) that supply drinking water to big and small communities in any region of the world, • Apply the study methodology on the monthly, seasonal or yearly water quality sampling data to identify major principal component analysis (PCA) and extract dominant parameters, and • Allocation of clusters to source waters might be helpful to understand the effect of natural processes, pollution types, and seasonal changes on the water quality of source waters.