Assessment of Water Quality of Euphrates River Using Cluster Analysis ()
1. Introduction
Globally, pollution of rivers and streams has become one of the most crucial environmental problems of the 20th century [1]. It is important to control water pollution, monitor water quality [2,3]. The application of different multivariate statistical techniques, such as cluster analysis (CA), principle component analysis (PCA) and factor analysis (FA) help to identify important components or factors accounting for most of the variances of a system [4,5] and interpretation of the complex databases offers a better understanding of the temporal and spatial variations in the identification of discriminate parameters that are of use in optimizing monitoring network [5,6,7]. Multivariate statistical techniques have been applied in water quality assessment and sources apportionment of water bodies over the last decade [3,9-20].
The aim of this study is to identify water quality parameters responsible for temporal and spatial variations in Euphrates river water using cluster analysis.
2. Materials and Methods
2.1. Study Area
The study area is located in Al-Anbar governorate between latitudes 33˚24'N - 33˚39'N and longitudes 42˚47'E - 43˚16'E, Figure 1. The area includes the largest urban centers in Al-Anbar governorate (Ramadi and Heet cites).
2.2. Sampling, Measuring and Analysis
Eleven sampling stations were chosen. Coordinates of sampling were listed in Table 1. The sampling process was carried out during 2008-2009. The number of samples are 16 for each sampling station, two samples per month.
Measuring and analysis was done upon 16 physical, chemical, microbiological parameters. These parameters were sampled monthly, Table 2. Electrical conductivity (EC), total dissolved solids (TDS) and dissolved oxygen were measured at the time of sampling in the field using portable EC meter, WTW model, and portable HANNA dissolved oxygen meter, H19142 model. Total suspended solids (TSS), turbidity, total hardness, biological oxygen demand(BOD), K+, Na+, Ca2+, Cl–, , , , , and total coliform (T. coli) were determined according to APHA [21].
2.3. Cluster Analysis
Cluster analysis is an exploratory data analysis tool for solving classification problems. Its object is to sort cases, data, or objects (events, people , things, etc.) into groups or clusters. The resulting clusters of objects should exhibit high internal (within-clusters) homogeneity and high external (between-clusters) heterogeneity [22]. Hierarchical CA, the most common approach, starts with each case in a separate cluster and joins clusters together step by step until only one cluster remains [23,24]. The
Table 2. Euphrates river water quality parameters.
Euclidean distance usually gives the similarity between two samples, and a distance can be represented by the difference between analytical values from the samples [25]. The squared Euclidean distance (D2) between location I and location II is calculated from normalized values as Follows :
(1)
where ZDO1 and ZDO2 are the normalized values of DO at locations 1 and 2. Similarly, ZBOD1 and ZBOD2 are similar values of BOD.
The results of the application of the clustering technique are best described using a dendogram or binary tree. The dendogram provides a visual summary of the clustering processes, presenting a picture of the groups and their proximity, with a dramatic reduction in dimensionality of the original data [5,26]. In this study, hierarchical CA was performed on the normalized dataset using Ward's method with squared Euclidean distances as a measure of similarity. The Ward’s method uses an analysis of variance (ANOVA) to evaluate the distance between clusters to minimize the sum of squares of any two clusters at each step. Both temporal and spatial variations of water quality were determined from hierarchical CA using linkage distance. Cluster analysis requires variables to conform to normal distribution. The normality of the data distribution was analyzed by one sample Kolmogorov-Smirnov test. The cluster analysis should be data standardization ( mean = 0; variance = 1). The standardization tends to increase the influence of variables whose variance is small and reduce the influence of those whose variance is large [27]. This will also minimize the effects of scale of measurement of data. All the mathematical and statistical calculations were done by statistica 7 software.
3. Results and Discussion
3.1. Temporal Similarity and Period Grouping
An initial exploratory approach involved the use of hierarchical cluster analysis on standardized log—transformed data sorted by season. Temporal CA generated a dendogram as shown in Figure 2 grouping 8 months into three clusters. Cluster I comprised April and the cluster II included May and June, while the cluster III consisted from the rest of months (November, January, February and March). The cluster III, approximately corresponding to the wet season in Iraq. Figure 2 shows that the temporal patterns to water quality were not purely consistent with the dry/wet seasons. Among the monitoring months, April has the highest pollution level and the other months (November, December, January, February, March, May and June) have the lowest pollution level. The temporal variation in physical, chemical and microbiological
Figure 2. Dendogram of temporal clustering of sampling periods.
parameters level (Figure 3) demonstrated that April has highest level of pollution. The high pollution level in April is attributed to that the highest level of total dissolved solids (TDS) was reported in this month. High concentration of TDS was reported in April in the study area [28].
3.2. Spatial Similarity and Site Grouping
In this study, sampling sites classification was performed by the use of cluster analysis (z-transformation of the input data, Euclidean distance as similarity measure and Ward’s method of linkage ) based on the standardized mean of 16 measured parameters. With regard to dendogram, the sampling sites were grouped into two statistically significant clusters, Figure 4. Grouped sites under each cluster can be seen in Figure 4. Cluster I included sampling site 7 (S7). Cluster II comprised the sampling sites 1 - 6, and 8 - 11. Among the sampling sites, site 7 (S7) has lowest pollution while the other sites (1 - 6 and 8 - 11) have the highest pollution level. This result in good agreement with the variation in water quality parameters measured in the sampling sites as shown in Figure 5. Among the mean concentrations, most parameters were found high at sampling sites (1 - 6 and 8 - 11) and less in Site 7 (S7).
The results showed that the CA technique is useful in classification of river water in the study region and the number of sampling sites and associated monitoring costs can be reduced without missing much information. This result was in accordance with results of many studies carried out in other rivers [7,11-13,17,29,30].
4. Conclusion
In this study, cluster analysis was applied to dataset for Euphrates River, Iraq. The results of this study show the importance and usefulness of cluster analysis of large
Figure 3. Temporal variation of water quality parameters at Euphrates river.
Figure 4. Dendogram of spatial clustering of sampling sites.
Figure 5. Spatial variation of water quality parameters at Euphrates river.
and complex databases to obtain better information concerning the surface water quality. Hierarchical CA grouped the 8 months into three clusters and classified 11 sampling sites into two clusters based on the similarity sites of water quality parameters. The temporal pattern shows that April has high pollution level comparison with the rest of months. The spatial pattern shows that the sampling site 7 (S7) has lowest level of pollution while the other sampling sites have highest pollution level. Based on the information obtained, it is possible to design an optimal future sampling strategy which could reduce sampling frequency, number of sampling sites and associated sampling costs.